[00:00:35] <RoanKattouw>	 Sure thing
[00:00:42] <RoanKattouw>	 I think that's the SWAT all done
[00:00:44] <RoanKattouw>	 Sorry for the slowness everyone
[00:01:16] <bd808>	 RoanKattouw: If it makes my mailbox less full of debate about font faces...
[00:01:36] * bd808  is sure that muting those threads will continue
[00:02:28] <icinga-wm>	 PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.426 second response time  
[00:08:52] * bd808  looks for a python reviewer for: https://gerrit.wikimedia.org/r/#/c/124500/
[00:09:10] <bd808>	 I think that will fix the 1.23wmf21 l10n problems
[00:09:30] <bd808>	 Because … mystery action at a distance!
[00:12:27] <icinga-wm>	 RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 219118 bytes in 8.455 second response time  
[00:24:56] <logmsgbot>	 !log catrope synchronized php-1.23wmf20/extensions/VisualEditor  'it helps if you run git submodule update first'
[00:25:02] <morebots>	 Logged the message, Master
[00:25:05] <logmsgbot>	 !log catrope synchronized php-1.23wmf21/extensions/VisualEditor  'it helps if you run git submodule update first'
[00:25:11] <morebots>	 Logged the message, Master
[00:27:34] <grrrit-wm>	 (03PS1) 10BryanDavis: test2wiki to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124505 
[00:28:54] <bd808>	 RoanKattouw_away: Are you {{done}} done now? I'd like to run some more scap tests
[00:38:27] <grrrit-wm>	 (03Abandoned) 10BryanDavis: l10nupdate: Add temporary debugging captures [operations/puppet] - 10https://gerrit.wikimedia.org/r/124467 (owner: 10BryanDavis)
[00:38:40] <grrrit-wm>	 (03PS2) 10BryanDavis: test2wiki to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124505 
[00:39:44] <grrrit-wm>	 (03Abandoned) 10BryanDavis: test2wiki to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124505 (owner: 10BryanDavis)
[00:41:34] <grrrit-wm>	 (03PS1) 10BryanDavis: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124506 
[00:43:55] <bd808>	 greg-g: Are you still on a bus? I'd like to scap group0 to 1.23wmf21 to test my band aid fix. I would be on the hook to revert immediately following if ExtensionMessages looks like it will cause a problem for l10nupdate.
[00:44:03] <RoanKattouw_away>	 bd808: Yes, sorry
[00:44:43] <bd808>	 RoanKattouw_away: :) thanks. I watched your idle time on tin climb until I felt safe.
[00:45:28] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[00:46:57] * bd808  decides that greg-g won't have changed his mind in the last 1:30 and proceeds
[00:48:38] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] "Approving to test band aid fix for ExtensionMessages generation problem. Will revert if ExtensionMessages doesn't look right after scap." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124506 (owner: 10BryanDavis)
[00:48:45] <grrrit-wm>	 (03Merged) 10jenkins-bot: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124506 (owner: 10BryanDavis)
[00:50:53] <logmsgbot>	 !log bd808 Started scap: group0 to 1.23wmf21 (testing python change for mwversionsinuse)
[00:50:58] <morebots>	 Logged the message, Master
[00:53:12] * bd808  sees l10n cache updating yet again for 1.23wmf21 and loses all confidence in his "fix"
[00:53:51] <logmsgbot>	 !log bd808 scap aborted: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (duration: 02m 57s)
[00:53:56] <morebots>	 Logged the message, Master
[00:54:30] <logmsgbot>	 !log bd808 Started scap: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (again)
[00:54:35] <morebots>	 Logged the message, Master
[00:54:56] <logmsgbot>	 !log bd808 scap aborted: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (again) (duration: 00m 25s)
[00:55:01] <morebots>	 Logged the message, Master
[00:55:12] <grrrit-wm>	 (03PS1) 10BryanDavis: Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124507 
[00:55:34] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124507 (owner: 10BryanDavis)
[00:55:42] <grrrit-wm>	 (03Merged) 10jenkins-bot: Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124507 (owner: 10BryanDavis)
[00:56:51] <logmsgbot>	 !log bd808 Started scap: revert group0 to 1.23wmf21 (testwiki still on 1.23wmf21)
[00:56:55] <morebots>	 Logged the message, Master
[01:01:33] <grrrit-wm>	 (03PS3) 10Ori.livneh: Add EventLogging Kafka writer plug-in [operations/puppet] - 10https://gerrit.wikimedia.org/r/85337 
[01:06:45] <logmsgbot>	 !log bd808 Finished scap: revert group0 to 1.23wmf21 (testwiki still on 1.23wmf21) (duration: 09m 54s)
[01:06:53] <morebots>	 Logged the message, Master
[01:22:25] <StevenW>	 ori: working now
[01:22:29] <StevenW>	 \o/
[02:07:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[02:07:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[02:07:08] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[02:07:08] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[02:15:58] <logmsgbot>	 !log LocalisationUpdate completed (1.23wmf20) at 2014-04-08 02:15:58+00:00
[02:16:06] <morebots>	 Logged the message, Master
[02:34:57] <logmsgbot>	 !log LocalisationUpdate completed (1.23wmf21) at 2014-04-08 02:34:56+00:00
[02:35:02] <morebots>	 Logged the message, Master
[02:45:57] <icinga-wm>	 PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[02:48:37] <icinga-wm>	 PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[02:48:57] <icinga-wm>	 RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds  
[02:49:06] <ori>	 springle_: db1047 has been very sad lately
[02:49:27] <icinga-wm>	 RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds  
[03:00:17] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[03:08:06] <bawolff>	 With 1.23wmf21 not getting deployed to mediawiki.org last thursday, does that mean the deployment schedule for 1.23wmf22 will be off by a week?
[03:11:07] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr  8 03:11:04 UTC 2014 (duration 11m 3s)
[03:11:11] <morebots>	 Logged the message, Master
[03:31:47] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[03:38:12] <aude>	 greg-g: still around?
[03:53:36] <aude>	 greg-g: check your mail
[04:03:35] <TimStarling>	 !log upgrading libssl on ssl1001,ssl1002,ssl1003,ssl1004,ssl1005,ssl1006,ssl1007,ssl1008,ssl1009,ssl3001.esams.wikimedia.org,ssl3002.esams.wikimedia.org,ssl3003.esams.wikimedia.org
[04:03:41] <morebots>	 Logged the message, Master
[04:03:57] <Jasper_Deng>	 TimStarling: is this the heartbleed.com thing?
[04:04:07] * Jasper_Deng  didn't know we used openssl
[04:15:22] <TimStarling>	 Jasper_Deng: yes
[04:15:47] <TimStarling>	 !log also upgraded libssl on cp4001-4019. Restarted nginx on these servers and also the previous list.
[04:15:51] <morebots>	 Logged the message, Master
[04:37:40] <Ryan_Lane>	 !log upgrading libssl on virt1000
[04:37:44] <morebots>	 Logged the message, Master
[04:38:21] <Ryan_Lane>	 !log upgrading libssl on virt0
[04:38:26] <morebots>	 Logged the message, Master
[04:41:03] <TimStarling>	 !log upgraded libssl on zirconium.wikimedia.org,neon.wikimedia.org,netmon1001.wikimedia.org,iodine.wikimedia.org,ytterbium.wikimedia.org,gerrit.wikimedia.org,virt1000.wikimedia.org,labs-ns1.wikimedia.org,stat1001.wikimedia.org
[04:43:13] <TimStarling>	 !log restarted apache on the above list, failed on labs-ns1, virt1000, ytterbium
[04:43:18] <morebots>	 Logged the message, Master
[04:43:47] <^d>	 TimStarling: I'll poke ytterbium
[04:44:00] <^d>	 Keep moving on to other boxes if you need.
[04:44:35] <^d>	 Seems up now.
[04:45:04] <TimStarling>	 yeah, labs-ns1 and virt1000 are actually the same server
[04:45:19] <TimStarling>	 and apache is running there with stime  after the upgrade
[04:46:30] <TimStarling>	 !log on dataset1001: upgraded libssl and restarted lighttpd
[04:46:34] <morebots>	 Logged the message, Master
[04:53:47] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[05:08:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[05:08:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[05:08:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[05:08:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[05:25:10] <grrrit-wm>	 (03PS1) 10Aude: Enable Wikibase on Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 
[05:26:24] <grrrit-wm>	 (03CR) 10Aude: [C: 04-2] "requires sites and site_identifiers tables to be added and populated on wikiquote" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude)
[05:31:00] <_joe_>	 !log upgraded openssl on cp10* and cp30* servers as well
[05:31:06] <morebots>	 Logged the message, Master
[05:39:29] <apergos>	 !log restarted apache on fenari magnesium yterrbium antimony
[05:39:33] <morebots>	 Logged the message, Master
[05:39:51] <apergos>	 with some mispellings but people will get the point
[05:47:01] <apergos>	 !log shot many old apache processes running as stats user from 2013, on stat1001 (restarting apache runs it as www-data user)
[05:47:06] <morebots>	 Logged the message, Master
[06:34:37] <grrrit-wm>	 (03PS3) 10Matanya: dataset: fix module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/119212 
[06:37:44] <grrrit-wm>	 (03PS3) 10Matanya: exim: fix scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/119496 
[06:43:48] <matanya>	 springle: did you hear from otto regarding https://gerrit.wikimedia.org/r/#/c/122406/ ?
[06:45:27] <springle>	 matanya: no
[06:45:41] <matanya>	 :/ i need to chase him down, thanks
[06:46:04] <springle>	 not sure otto knows about it? i emailed analytics lists directly
[06:46:29] <springle>	 so far the answer is: probably fine to decom db67, but lets wait for enveryone to chime in
[06:46:43] <springle>	 i'll bump it this week
[06:47:05] <matanya>	 thank you
[07:30:44] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: base: add debian-goodies [operations/puppet] - 10https://gerrit.wikimedia.org/r/124524 
[07:47:07] <_joe|away>	 !log restarted nginx on cp1044 and cp1043
[07:47:12] <morebots>	 Logged the message, Master
[07:53:07] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[07:53:07] <grrrit-wm>	 (03CR) 10coren: [C: 032] base: add debian-goodies [operations/puppet] - 10https://gerrit.wikimedia.org/r/124524 (owner: 10Faidon Liambotis)
[08:02:57] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[08:09:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[08:09:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[08:09:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[08:09:07] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[08:11:47] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[08:15:17] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[08:36:30] <siebrand>	 ori: still working?
[09:03:47] <icinga-wm>	 PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:04:07] <YuviPanda>	 hashar: help with setting up zuul for the apps? https://gerrit.wikimedia.org/r/#/c/124539/
[09:08:37] <icinga-wm>	 PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:08:47] <icinga-wm>	 RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical  
[09:08:57] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[09:11:47] <icinga-wm>	 PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:16:55] <grrrit-wm>	 (03PS1) 10RobH: Replacing the unified certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124542 
[09:24:34] <grrrit-wm>	 (03CR) 10RobH: [C: 032] Replacing the unified certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124542 (owner: 10RobH)
[09:29:47] <icinga-wm>	 RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical  
[09:33:47] <icinga-wm>	 PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:36:37] <icinga-wm>	 RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical  
[09:37:37] <icinga-wm>	 RECOVERY - Disk space on labstore3 is OK: DISK OK  
[09:39:19] <hashar>	 YuviPanda: hello
[09:39:25] <YuviPanda>	 hashar: hello!
[09:40:00] <icinga-wm>	 PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:40:37] <icinga-wm>	 PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.  
[09:40:57] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 
[09:44:12] <hashar>	 and we lost YuviPanda
[09:45:10] <sjoerddebruin>	 Noooo not our panda. :(
[09:46:25] <Steinsplitter>	 panda \O/
[09:46:28] <icinga-wm>	 PROBLEM - SSH on labstore3 is CRITICAL: Connection refused  
[09:46:28] <icinga-wm>	 PROBLEM - DPKG on labstore3 is CRITICAL: Connection refused by host  
[09:46:47] <icinga-wm>	 PROBLEM - puppet disabled on labstore3 is CRITICAL: Connection refused by host  
[09:47:00] <andrewbogott>	 mutante: https://gerrit.wikimedia.org/r/#/c/124560/
[09:47:43] <icinga-wm>	 ACKNOWLEDGEMENT - DPKG on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
[09:47:44] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
[09:47:44] <icinga-wm>	 ACKNOWLEDGEMENT - RAID on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
[09:47:44] <icinga-wm>	 ACKNOWLEDGEMENT - SSH on labstore3 is CRITICAL: Connection refused daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
[09:47:44] <icinga-wm>	 ACKNOWLEDGEMENT - puppet disabled on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
[09:49:57] <matanya>	 so nice to see all ops in an europian time zone :)
[09:50:37] <icinga-wm>	 PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100%  
[09:57:12] <grrrit-wm>	 (03CR) 10Dzahn: [C: 04-1] Add eth1 checks to nova compute hosts. (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 (owner: 10Andrew Bogott)
[10:00:49] <springle>	 ori: what is udpprofile::collector, and can i move it from db1014 to... somewhere else?
[10:02:47] <ori>	 springle: oh, wow. is there any indication that continues to see activity? mediawiki's profiler class can be configured to write to a database, but i didn't know anyone was using it in production. is it not ancient?
[10:04:56] <andrewbogott>	 mutante, cmjohnson:  https://wikitech.wikimedia.org/wiki/Help:Git_rebase#Don.27t_panic
[10:05:21] <thedj>	 andrewbogott: 42
[10:05:57] <ori>	 springle: it can go away
[10:06:34] <ori>	 springle: it was added in this commit: <https://gerrit.wikimedia.org/r/#/c/83953/>. the message reads: "testing graphite 0.910 on db1014".
[10:07:04] <springle>	 yeah, asher stole db1014 for graphite
[10:07:12] <springle>	 trying to steal it back :)
[10:07:20] <springle>	 ori: thanks
[10:07:46] <ori>	 springle: it's not in any way implicated in our current graphite setup, which exists solely on tungsten.eqiad.wmnet (and labs)
[10:08:13] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 
[10:08:18] <andrewbogott>	 mutante: ^
[10:09:24] <grrrit-wm>	 (03PS1) 10Cmjohnson: adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 
[10:11:07] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson)
[10:12:49] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 (owner: 10Andrew Bogott)
[10:15:34] <Jeff_Green>	 !log update & reboot samarium
[10:15:38] <morebots>	 Logged the message, Master
[10:15:48] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 (owner: 10Andrew Bogott)
[10:16:26] <grrrit-wm>	 (03PS1) 10Springle: Remove unused db1014 block. db1014 was renamed tungsten rt5871. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124575 
[10:18:19] <grrrit-wm>	 (03CR) 10Springle: [C: 032] Remove unused db1014 block. db1014 was renamed tungsten rt5871. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124575 (owner: 10Springle)
[10:21:04] <Jeff_Green>	 !log update & reboot barium
[10:21:09] <morebots>	 Logged the message, Master
[10:23:09] <grrrit-wm>	 (03PS1) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 
[10:24:10] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn)
[11:09:28] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[11:09:28] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[11:09:28] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[11:09:28] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[11:32:05] <grrrit-wm>	 (03PS20) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 
[11:32:32] <matanya>	 akosiaris: in a meeting or this ^ can be handled ?
[11:39:18] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[12:32:58] <grrrit-wm>	 (03PS2) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 
[12:39:13] <akosiaris>	 matanya: in ops meeting
[12:39:19] <matanya>	 sorry
[12:39:27] <akosiaris>	 and please tell me you did not resubmit from your local repo 
[12:39:48] <akosiaris>	 rebase* sorry
[12:39:50] <grrrit-wm>	 (03PS2) 10Cmjohnson: adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 
[12:40:26] <grrrit-wm>	 (03CR) 10Andrew Bogott: [V: 031] "This looks good -- we'll see if it makes new alarms go off :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn)
[12:46:38] <grrrit-wm>	 (03PS3) 10Cmjohnson: adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 
[12:48:28] <icinga-wm>	 PROBLEM - DPKG on strontium is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[12:49:28] <icinga-wm>	 RECOVERY - DPKG on strontium is OK: All packages OK  
[12:49:35] <grrrit-wm>	 (03CR) 10Matanya: [C: 031] add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn)
[12:50:21] <cmjohnson1>	 paravoid: can you review please  https://gerrit.wikimedia.org/r/124572
[12:50:38] <andrewbogott>	 mutante: https://rt.wikimedia.org/Ticket/Display.html?id=5064
[12:51:29] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "yep, if we want to monitor this on everything, then standard-packages sounds good to me" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson)
[12:52:38] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[12:53:10] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson)
[12:55:34] <manybubbles>	 can anyone around update Elasticsearch in apt?
[12:55:55] <manybubbles>	 and ack nagios errors (so they don't spam to irc) for a couple horus?
[12:56:39] <logmsgbot>	 !log reedy updated /a/common to {{Gerrit|Id15ddc665}}: Revert "Group0 wikis to 1.23wmf21"
[12:56:44] <morebots>	 Logged the message, Master
[12:57:23] <grrrit-wm>	 (03PS1) 10Reedy: Non wikipedias to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124591 
[12:59:03] * Reedy  pokes qchris_away and ^d
[13:01:42] <Reedy>	 Any idea why https://gerrit.wikimedia.org/changes/?q=status:merged+age%3A0d&o=DETAILED_ACCOUNTS&n=100 doesn't work?
[13:02:00] <grrrit-wm>	 (03CR) 10Cmjohnson: [C: 032] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson)
[13:03:24] <Reedy>	 versus
[13:03:24] <Reedy>	 http://review.cyanogenmod.org/changes/?q=status:open+age%3A0d&o=DETAILED_ACCOUNTS&n=100
[13:07:41] <grrrit-wm>	 (03PS3) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 
[13:12:48] <grrrit-wm>	 (03PS4) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 
[13:15:18] <apergos>	 test
[13:15:42] <apergos>	 test akosiaris
[13:15:43] <akosiaris>	 apergos: :-)
[13:15:51] <apergos>	 manybubbles: 
[13:16:54] <mutante>	 already pinged
[13:17:06] <grrrit-wm>	 (03PS1) 10coren: Tool Labs: forcibly upgrade libssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/124594 
[13:19:25] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "RT #80 :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn)
[13:21:58] <_joe_>	 ori: If you're here, please let me know :)
[13:26:57] <Reedy>	 _joe_: Couple of hours from now
[13:27:05] <Reedy>	 Though, he is around early sometimes
[13:27:31] <_joe_>	 Reedy: thanks
[13:30:38] <grrrit-wm>	 (03CR) 10RobH: [C: 031] Tool Labs: forcibly upgrade libssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/124594 (owner: 10coren)
[13:31:20] <manybubbles>	 ottomata: welcome!
[13:31:34] <manybubbles>	 can you help me get started today?
[13:31:42] <grrrit-wm>	 (03CR) 10coren: [C: 032] Tool Labs: forcibly upgrade libssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/124594 (owner: 10coren)
[13:31:50] <Reedy>	 manybubbles: We have an extension for that
[13:31:51] * Reedy  grins
[13:31:57] <manybubbles>	 Reedy: thanks!
[13:32:01] <manybubbles>	 I totally used it a while ago
[13:32:27] <qchris_away>	 Reedy: Because we're using /r/ to mark the reverse proxy ...
[13:32:33] <qchris_away>	 Reedy: https://gerrit.wikimedia.org/r/changes/?q=status:merged+age%3A0d&o=DETAILED_ACCOUNTS&n=100
[13:32:37] <qchris_away>	 Reedy: ^ should work
[13:32:47] <Reedy>	 Aha, sweet!
[13:33:43] <grrrit-wm>	 (03PS1) 10RobH: replace blog.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124595 
[13:35:07] <manybubbles>	 ottomata: I need Elasticsearch 1.1.0 shoved into apt
[13:35:37] <grrrit-wm>	 (03PS2) 10RobH: replace blog.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124595 
[13:36:15] <Reedy>	 qchris: thanks
[13:36:22] <qchris>	 yw
[13:37:04] <icinga-wm>	 PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds  
[13:37:33] <mutante>	 !log restarting gitblit
[13:37:33] <grrrit-wm>	 (03CR) 10RobH: [C: 032] replace blog.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124595 (owner: 10RobH)
[13:37:37] <morebots>	 Logged the message, Master
[13:39:00] <RobH>	 !log replacing the blog cert, if holmium crashes I didn't do it correctly.
[13:39:01] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 
[13:39:03] <ottomata>	 manybubbles:  ok!
[13:39:03] <morebots>	 Logged the message, RobH
[13:39:04] <icinga-wm>	 RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 305803 bytes in 9.337 second response time  
[13:39:08] <manybubbles>	 thanks!
[13:39:28] <Jeff_Green>	 !log update & reboot tellurium
[13:39:33] <morebots>	 Logged the message, Master
[13:39:47] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 (owner: 10Faidon Liambotis)
[13:41:14] <icinga-wm>	 PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100%  
[13:42:38] <grrrit-wm>	 (03PS2) 10Faidon Liambotis: Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 
[13:43:27] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 (owner: 10Faidon Liambotis)
[13:44:28] <grrrit-wm>	 (03CR) 10Manybubbles: "Is there a better place to run this?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 (owner: 10Faidon Liambotis)
[13:45:14] <icinga-wm>	 RECOVERY - Host tellurium is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms  
[13:46:13] <RobH>	 !log upgraded libssl on holmium
[13:46:18] <morebots>	 Logged the message, RobH
[13:48:49] <paravoid>	 ottomata: kafka upgrade doesn't work on an1004
[13:49:41] <ottomata>	 paravoid, analytics1004 (and analytics1003) were kafka test brokers, and were never productionized or puppetized
[13:49:50] <ottomata>	 i thought I had removed kafka from analytics1004, actually
[13:50:38] <manybubbles>	 ottomata: can you install git fat on tin?
[13:50:42] <manybubbles>	 I cannot
[13:50:46] <ottomata>	 hm, sure, why do you need git-fat there?
[13:50:55] <manybubbles>	 to git deploy
[13:50:58] <manybubbles>	 to Elasticsearch
[13:51:07] <manybubbles>	 the plugins
[13:51:14] <manybubbles>	 or is there another server
[13:51:17] <ottomata>	 you don't need git-fat on tin though
[13:51:23] <ottomata>	 the git-fat commands are run on deplo hsots
[13:51:27] <ottomata>	 on the targets
[13:51:46] <manybubbles>	 huh, I'm used to running it on the server to check the jars got there.  I'll just do it without and see
[13:53:21] <manybubbles>	 ottomata: that worked as you said it would
[13:53:35] <manybubbles>	 !log synced first Elasticsearch plugin to production Elasticsearch servers
[13:53:39] <morebots>	 Logged the message, Master
[13:54:01] <manybubbles>	 !log they'll pick it up during the rolling restart today to upgrade to 1.1.0
[13:54:05] <morebots>	 Logged the message, Master
[13:54:08] <ottomata>	 cool
[13:54:18] <ottomata>	 manybubbles: , i was going to start reinstalling an elastic search server today
[13:54:33] <manybubbles>	 ottomata: not a _great_ day for it
[13:54:37] <manybubbles>	 because I'm upgrading to 1.1.0
[13:54:43] <ottomata>	 ok
[13:54:45] <manybubbles>	 that is on the deployment calendar and everything
[13:55:05] <manybubbles>	 maybe tomorrow?
[13:57:09] <ottomata>	 sure
[14:04:07] <manybubbles>	 ottomata: please ping me when you get a chance to update apt
[14:04:35] <ottomata>	 i was about to to do it, but am in standup now
[14:04:36] <ottomata>	 um
[14:04:41] <ottomata>	 q for akosiaris, if you are around
[14:04:54] <ottomata>	 I should change VerifyRelease, right?
[14:04:54] <icinga-wm>	 PROBLEM - DPKG on labstore4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[14:04:59] <ottomata>	 i'm trying to find the right thing to change it to
[14:05:14] <ottomata>	 i downloaded 1.1's Release.gpg and am doing what the reprepro man page says to do
[14:05:17] <ottomata>	 but am not sure
[14:05:23] <ottomata>	 the output doesn't look like what you have
[14:05:54] <icinga-wm>	 RECOVERY - DPKG on labstore4 is OK: All packages OK  
[14:09:44] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[14:09:44] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[14:09:44] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[14:09:44] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[14:11:17] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 
[14:18:13] <grrrit-wm>	 (03PS2) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 
[14:19:21] <grrrit-wm>	 (03PS1) 10Ottomata: reprepro/updates - upgrading elasticsearch to 1.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124603 
[14:20:08] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] reprepro/updates - upgrading elasticsearch to 1.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124603 (owner: 10Ottomata)
[14:23:54] <icinga-wm>	 PROBLEM - HTTPS on ssl1002 is CRITICAL: Connection refused  
[14:24:06] <ottomata>	 manybubbles: http://apt.wikimedia.org/wikimedia/pool/main/e/elasticsearch/
[14:24:09] <ottomata>	 look ok?
[14:28:54] <icinga-wm>	 RECOVERY - HTTPS on ssl1002 is OK: OK - Certificate will expire on 01/20/2016 12:00.  
[14:29:45] <manybubbles>	 ottomata: looks good - let me try elastic1001
[14:30:35] <grrrit-wm>	 (03PS3) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 
[14:30:57] <andrewbogott>	 mutante, ^ pls?
[14:31:37] <manybubbles>	 !log upgrading elastic1001
[14:31:42] <morebots>	 Logged the message, Master
[14:32:38] <manybubbles>	 !log woops, just restarted elastic1002.  silly me
[14:32:42] <morebots>	 Logged the message, Master
[14:32:46] <manybubbles>	 !log no harm done, just lost time
[14:32:50] <morebots>	 Logged the message, Master
[14:33:53] <manybubbles>	 ottomata: can you make nagios not bother us about Elasticsearch warning over the next few hours?
[14:33:56] <manybubbles>	 I'm paying attention
[14:34:25] <ottomata>	 uh hm
[14:35:43] <ottomata>	 i think so, how long manybubbles
[14:35:45] <ottomata>	 4 hours?
[14:35:48] <manybubbles>	 sure!
[14:36:14] <icinga-wm>	 PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown  
[14:38:14] <icinga-wm>	 RECOVERY - NTP peers on linne is OK: NTP OK: Offset 0.016747 secs  
[14:44:43] <mutante>	 andrewbogott: https://gerrit.wikimedia.org/r/#/c/77332/7/modules/base/manifests/monitoring/host.pp
[14:44:51] <grrrit-wm>	 (03PS4) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 
[14:54:18] <grrrit-wm>	 (03PS5) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 
[14:54:59] <grrrit-wm>	 (03PS3) 10Cmjohnson: add interface speed check for all hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124606 
[15:01:42] <cmjohnson>	 mutante: can you review  https://gerrit.wikimedia.org/r/124606
[15:02:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Great idea. Minor stuff here and there like making it parameterizable but looks nice." (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124606 (owner: 10Cmjohnson)
[15:03:10] <ottomata>	 manybubbles: i think I just scheduled downtime in icinga for elastic search for the next ~4 hours
[15:03:19] <ottomata>	 never done that before, so not sure what it will do
[15:03:47] <grrrit-wm>	 (03PS1) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 
[15:04:54] <manybubbles>	 ottomata: its cool!
[15:04:56] <manybubbles>	 thanks
[15:07:45] <grrrit-wm>	 (03CR) 10Ottomata: module to manage new python-diamond package (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush)
[15:08:18] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 (owner: 10Andrew Bogott)
[15:12:34] <grrrit-wm>	 (03PS2) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 
[15:13:35] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush)
[15:15:36] <grrrit-wm>	 (03PS3) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 
[15:16:34] <icinga-wm>	 PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18)  
[15:16:42] <RobH>	 !log all ssl servers in eqiad have been updated with new cert and restarted
[15:16:51] <RobH>	 !log rolling updates on ssl3001-3003 presently
[15:17:10] <grrrit-wm>	 (03PS1) 10Dzahn: enable base monitoring for ALL hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124609 
[15:17:24] <icinga-wm>	 PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19)  
[15:18:04] <icinga-wm>	 RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms  
[15:19:03] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 (owner: 10Andrew Bogott)
[15:19:04] <icinga-wm>	 RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms  
[15:19:07] <mutante>	 apergos: https://gerrit.wikimedia.org/r/#/c/124609/1
[15:19:46] <mutante>	 ugly, eh.. since i have to change all those lines because of indentation :p
[15:22:25] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 031] enable base monitoring for ALL hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124609 (owner: 10Dzahn)
[15:22:39] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] enable base monitoring for ALL hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124609 (owner: 10Dzahn)
[15:23:46] <grrrit-wm>	 (03CR) 10Ottomata: module to manage new python-diamond package (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush)
[15:27:31] <icinga-wm>	 PROBLEM - HTTPS on cp4009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:41] <icinga-wm>	 PROBLEM - HTTPS on ssl3003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:41] <icinga-wm>	 PROBLEM - HTTPS on ssl1006 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:41] <icinga-wm>	 PROBLEM - HTTPS on cp4014 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:51] <icinga-wm>	 PROBLEM - HTTPS on ssl1004 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:51] <icinga-wm>	 PROBLEM - HTTPS on ssl1005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:51] <icinga-wm>	 PROBLEM - HTTPS on cp4008 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:51] <icinga-wm>	 PROBLEM - HTTPS on cp4004 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:51] <icinga-wm>	 PROBLEM - HTTPS on cp4015 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:52] <icinga-wm>	 PROBLEM - HTTPS on cp4001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:52] <icinga-wm>	 PROBLEM - HTTPS on cp4017 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:53] <icinga-wm>	 PROBLEM - HTTPS on amssq47 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:53] <icinga-wm>	 PROBLEM - HTTPS on ssl1002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:54] <icinga-wm>	 PROBLEM - HTTPS on ssl1001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:54] <icinga-wm>	 PROBLEM - HTTPS on cp4005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:27:55] <icinga-wm>	 PROBLEM - HTTPS on cp4012 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:01] <icinga-wm>	 PROBLEM - HTTPS on cp4016 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:01] <icinga-wm>	 PROBLEM - HTTPS on sodium is CRITICAL: SSL_CERT CRITICAL lists.wikimedia.org: invalid CN (lists.wikimedia.org does not match *.wikimedia.org)  
[15:28:11] <icinga-wm>	 PROBLEM - HTTPS on ssl1007 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:11] <icinga-wm>	 PROBLEM - HTTPS on iodine is CRITICAL: SSL_CERT CRITICAL ticket.wikimedia.org: invalid CN (ticket.wikimedia.org does not match *.wikimedia.org)  
[15:28:11] <icinga-wm>	 PROBLEM - HTTPS on ssl3002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:11] <icinga-wm>	 PROBLEM - HTTPS on ssl3001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:11] <icinga-wm>	 PROBLEM - HTTPS on cp4018 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:12] <icinga-wm>	 PROBLEM - HTTPS on ssl1008 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:12] <icinga-wm>	 PROBLEM - HTTPS on ssl1009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:13] <icinga-wm>	 PROBLEM - HTTPS on ssl1003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:13] <icinga-wm>	 PROBLEM - HTTPS on cp4013 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:14] <icinga-wm>	 PROBLEM - HTTPS on cp4003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:14] <icinga-wm>	 PROBLEM - HTTPS on cp4007 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:15] <icinga-wm>	 PROBLEM - HTTPS on cp4011 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:15] <icinga-wm>	 PROBLEM - HTTPS on cp4010 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:21] <icinga-wm>	 PROBLEM - HTTPS on cp4020 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:21] <icinga-wm>	 PROBLEM - HTTPS on cp4006 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:31] <icinga-wm>	 PROBLEM - HTTPS on cp4002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:28:31] <icinga-wm>	 PROBLEM - HTTPS on cp4019 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)  
[15:30:02] <greg-g>	 holy fun :)
[15:30:37] <aude>	 :o
[15:32:08] <greg-g>	 aude: getting to your email :)
[15:32:13] <aude>	 ok
[15:32:25] <aude>	 want to see if it's ok to do today
[15:32:35] <aude>	 anytime works for us, i suppose
[15:34:45] <greg-g>	 aude: tl;dr of email: yep, looks good
[15:34:50] <aude>	 ok
[15:35:07] <aude>	 we were smart to put i18n stuff a while ago :)
[15:35:42] <icinga-wm>	 PROBLEM - RAID on holmium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)  
[15:35:52] <icinga-wm>	 PROBLEM - DPKG on fenari is CRITICAL: NRPE: Command check_dpkg not defined  
[15:36:01] <andrewbogott>	 the https failures are me muching with monitoring, nothing to worry about
[15:36:02] <icinga-wm>	 PROBLEM - Disk space on fenari is CRITICAL: NRPE: Command check_disk_space not defined  
[15:36:12] <icinga-wm>	 PROBLEM - RAID on fenari is CRITICAL: NRPE: Command check_raid not defined  
[15:36:22] <icinga-wm>	 PROBLEM - puppet disabled on fenari is CRITICAL: NRPE: Command check_puppet_disabled not defined  
[15:36:57] <hashar>	 mutante: fenari is not happy :-D
[15:38:21] <mutante>	 hashar: thanks, that's cause we just added more monitoring
[15:38:33] <mutante>	 RT #80 :)
[15:38:48] <hashar>	 mutante: yeah I noticed your puppet change.  Guess fenari is missing some bits
[15:41:12] <mutante>	 hashar: wasn't running nagios-nrpe-server
[15:41:52] <mutante>	 greg-g: re: SSL certs, andrewbogott is on that one
[15:41:57] <mutante>	 ops monitoring sprint over here
[15:42:11] <greg-g>	 mutante: ahh, good to know who's on point for that, thanks
[15:42:23] <greg-g>	 wasn't sure if it'd be a opsen party thing or not
[15:42:44] <mutante>	 it is. ops in Athens
[15:43:05] <mutante>	 that check is new, in that it checks for validity of cert, not just expiry
[15:43:18] <mutante>	 and wikimedia vs. wikipedia thing
[15:43:30] * greg-g  nods
[15:44:52] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 438.266663  
[15:45:02] <grrrit-wm>	 (03PS1) 10Andrew Bogott: When checking unified certs, check for *.wikipedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/124616 
[15:45:32] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 434.533325  
[15:46:21] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] When checking unified certs, check for *.wikipedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/124616 (owner: 10Andrew Bogott)
[15:46:22] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:45:20 PM UTC  
[15:53:10] <icinga-wm>	 RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0  
[15:53:17] <mutante>	 hashar: ^ :)
[15:53:20] <icinga-wm>	 RECOVERY - puppet disabled on fenari is OK: OK  
[15:53:26] <hashar>	 nice
[15:53:40] <icinga-wm>	 RECOVERY - Disk space on fenari is OK: DISK OK  
[15:53:41] <mutante>	 RT #80 ftw
[15:53:48] <andrewbogott>	 With any luck there'll be another flood of OKs in a minute...
[15:53:50] <icinga-wm>	 RECOVERY - DPKG on fenari is OK: All packages OK  
[15:54:10] <icinga-wm>	 PROBLEM - puppet disabled on bast1001 is CRITICAL: NRPE: Command check_puppet_disabled not defined  
[15:54:10] <icinga-wm>	 PROBLEM - Disk space on cp3003 is CRITICAL: NRPE: Command check_disk_space not defined  
[15:54:10] <icinga-wm>	 PROBLEM - Disk space on dobson is CRITICAL: Connection refused by host  
[15:54:10] <icinga-wm>	 PROBLEM - DPKG on pdf2 is CRITICAL: Connection refused by host  
[15:54:20] <icinga-wm>	 PROBLEM - puppet disabled on iron is CRITICAL: NRPE: Command check_puppet_disabled not defined  
[15:54:20] <icinga-wm>	 PROBLEM - RAID on dobson is CRITICAL: Connection refused by host  
[15:54:20] <icinga-wm>	 PROBLEM - RAID on cp3003 is CRITICAL: NRPE: Command check_raid not defined  
[15:54:20] <icinga-wm>	 PROBLEM - Disk space on pdf2 is CRITICAL: Connection refused by host  
[15:54:30] <icinga-wm>	 PROBLEM - puppet disabled on dobson is CRITICAL: Connection refused by host  
[15:54:30] <icinga-wm>	 PROBLEM - RAID on pdf2 is CRITICAL: Connection refused by host  
[15:54:30] <icinga-wm>	 PROBLEM - DPKG on iodine is CRITICAL: NRPE: Command check_dpkg not defined  
[15:54:30] <icinga-wm>	 PROBLEM - puppet disabled on pdf2 is CRITICAL: Connection refused by host  
[15:54:40] <icinga-wm>	 PROBLEM - Disk space on iodine is CRITICAL: NRPE: Command check_disk_space not defined  
[15:54:40] <icinga-wm>	 PROBLEM - puppet disabled on cp3003 is CRITICAL: NRPE: Command check_puppet_disabled not defined  
[15:54:40] <icinga-wm>	 PROBLEM - DPKG on pdf3 is CRITICAL: Connection refused by host  
[15:54:48] <andrewbogott>	 that's not what I meant
[15:54:50] <icinga-wm>	 PROBLEM - RAID on iodine is CRITICAL: NRPE: Command check_raid not defined  
[15:54:50] <icinga-wm>	 PROBLEM - Disk space on pdf3 is CRITICAL: Connection refused by host  
[15:54:50] <icinga-wm>	 PROBLEM - DPKG on tridge is CRITICAL: NRPE: Command check_dpkg not defined  
[15:54:50] <icinga-wm>	 PROBLEM - DPKG on bast1001 is CRITICAL: NRPE: Command check_dpkg not defined  
[15:54:51] <icinga-wm>	 PROBLEM - puppet disabled on iodine is CRITICAL: NRPE: Command check_puppet_disabled not defined  
[15:54:51] <icinga-wm>	 PROBLEM - RAID on pdf3 is CRITICAL: Connection refused by host  
[15:54:51] <icinga-wm>	 PROBLEM - Disk space on tridge is CRITICAL: NRPE: Command check_disk_space not defined  
[15:55:00] <icinga-wm>	 PROBLEM - Disk space on bast1001 is CRITICAL: NRPE: Command check_disk_space not defined  
[15:55:00] <icinga-wm>	 PROBLEM - puppet disabled on pdf3 is CRITICAL: Connection refused by host  
[15:55:10] <icinga-wm>	 PROBLEM - Disk space on iron is CRITICAL: NRPE: Command check_disk_space not defined  
[15:55:10] <icinga-wm>	 PROBLEM - RAID on bast1001 is CRITICAL: NRPE: Command check_raid not defined  
[15:55:10] <icinga-wm>	 PROBLEM - DPKG on dobson is CRITICAL: Connection refused by host  
[15:55:10] <icinga-wm>	 PROBLEM - DPKG on cp3003 is CRITICAL: NRPE: Command check_dpkg not defined  
[15:55:10] <icinga-wm>	 PROBLEM - DPKG on virt1000 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[15:55:10] <icinga-wm>	 PROBLEM - puppet disabled on tridge is CRITICAL: NRPE: Command check_puppet_disabled not defined  
[15:55:41] <greg-g>	 ahhh, so today is going to be a worthless -operations channel day, more than normal, due to the sprint? :)
[15:56:03] <andrewbogott>	 We're about to all go to dinner though.
[15:56:09] <andrewbogott>	 So things should quiet down shortly.
[15:56:10] <icinga-wm>	 PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:55:50 PM UTC  
[15:56:19] <andrewbogott>	 But the channel will still be useless if you want to talk to ops :)
[15:56:50] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[15:57:03] <mutante>	 will start nagios-nrpe-server on those
[15:57:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:56:15 PM UTC  
[15:58:42] <icinga-wm>	 RECOVERY - HTTPS on ssl3001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[15:58:42] <icinga-wm>	 RECOVERY - HTTPS on ssl1006 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[15:58:52] <icinga-wm>	 RECOVERY - HTTPS on ssl1007 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[15:58:52] <icinga-wm>	 RECOVERY - HTTPS on ssl1002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[15:59:32] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[15:59:52] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[16:00:04] <aude>	 back in 5 min or so
[16:00:06] <grrrit-wm>	 (03Abandoned) 10Physikerwelt: WIP: Enable orthogonal MathJax config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110240 (owner: 10Physikerwelt)
[16:00:42] <icinga-wm>	 PROBLEM - DPKG on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.  
[16:00:42] <icinga-wm>	 PROBLEM - Disk space on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.  
[16:00:52] <icinga-wm>	 PROBLEM - RAID on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.  
[16:01:02] <icinga-wm>	 PROBLEM - puppet disabled on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.  
[16:02:22] <icinga-wm>	 PROBLEM - Puppet freshness on ms6 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:02:03 PM UTC  
[16:04:37] <aude>	 back
[16:08:22] <icinga-wm>	 PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:07:31 PM UTC  
[16:09:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:09:07 PM UTC  
[16:09:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs4003 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:08:32 PM UTC  
[16:09:27] <icinga-wm>	 RECOVERY - HTTPS on cp4020 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:27] <icinga-wm>	 RECOVERY - HTTPS on cp4006 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:27] <icinga-wm>	 RECOVERY - HTTPS on cp4013 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:37] <icinga-wm>	 RECOVERY - HTTPS on cp4009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:37] <icinga-wm>	 RECOVERY - HTTPS on cp4010 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:37] <icinga-wm>	 RECOVERY - HTTPS on ssl3003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:47] <icinga-wm>	 RECOVERY - HTTPS on ssl3002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:47] <icinga-wm>	 RECOVERY - HTTPS on ssl1004 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:09:56] <paravoid>	 ottomata: ping
[16:09:57] <icinga-wm>	 RECOVERY - HTTPS on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:10:07] <icinga-wm>	 RECOVERY - HTTPS on cp4016 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:10:07] <icinga-wm>	 RECOVERY - HTTPS on ssl1008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:10:07] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[16:10:07] <icinga-wm>	 RECOVERY - HTTPS on cp4018 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:10:17] <icinga-wm>	 RECOVERY - HTTPS on ssl1009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:11:23] <paravoid>	 ottomata: ping ping
[16:12:47] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[16:12:49] <ottomata>	 pong pong
[16:13:05] <ottomata>	 paravoid
[16:13:08] <ottomata>	 wassupp
[16:13:14] <paravoid>	 what's with stat1's puppet?
[16:13:18] <paravoid>	 why is it admin disabled?
[16:13:47] <ottomata>	 because it is going to be decomed very soon
[16:13:56] <ottomata>	 and i wanted to make puppet changes that would apply to stat1003 but not mess with what was on stat1
[16:14:05] <ottomata>	 and I didn't want to re-write a bunch of statistics.pp stuff :/
[16:14:07] <_joe_>	 ori: are you around? seems like graphite is *not* working
[16:14:24] <paravoid>	 ottomata: that's bad
[16:14:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:13:54 PM UTC  
[16:14:35] <ottomata>	 paravoid: even if we are going to decom it soon?
[16:14:36] <paravoid>	 ottomata: can you remove the "include statistics*" stuff and enable it again?
[16:14:40] <paravoid>	 yes
[16:14:42] <ottomata>	 yeah probably can
[16:14:47] <paravoid>	 because it's messing with monitoring and all that
[16:15:06] <ottomata>	 ah i see it
[16:15:20] <ottomata>	 paravoid, what is the differnece between the 3 numbers in each severity category in icinga?
[16:15:25] <mark>	 ottomata: disabling puppet for more than a few hours max is almost always a really bad idea
[16:15:31] <ottomata>	 mark, ok, noted.
[16:15:36] <mark>	 thanks
[16:16:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:16:04 PM UTC  
[16:16:27] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[16:17:07] <_joe_>	 :/
[16:17:27] <icinga-wm>	 PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:16:39 PM UTC  
[16:18:10] <ottomata>	 mark, can you help with the current network ACL problems?  
[16:18:22] <mark>	 sorry, what's that?
[16:18:25] <ottomata>	 analytics nodes can't talk to apt
[16:18:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:17:50 PM UTC  
[16:18:30] <ottomata>	 nor statsd.eqiad.wmnet
[16:18:32] <ottomata>	 https://rt.wikimedia.org/Ticket/Display.html?id=4433
[16:18:37] <ottomata>	 I added to the bottom of that ticket
[16:18:51] <mark>	 ok
[16:18:59] <ottomata>	 i think vanadium was having the same trouble, is it on the vlan too?
[16:19:27] <icinga-wm>	 PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:19:10 PM UTC  
[16:19:31] <aude>	 still working on wikiquote
[16:19:35] <mark>	 we can look at getting rid of those ACLs perhaps
[16:19:41] <mark>	 but we'll need to discuss what you're doing with firewalling
[16:20:18] <grrrit-wm>	 (03PS1) 10Ottomata: Disabling statistics roles on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124621 
[16:20:18] <se4598>	 the fingerprint of the wikis SSL cert apparently changed, but it is not a new issued cert but with the same dates as the previous one that i saved. Is that okay that the fingerprint changed?
[16:20:34] <ottomata>	 mark, yeah, hm, not sure, i kind of like them
[16:20:35] <paravoid>	 se4598: yes
[16:20:45] <ottomata>	 especially since anyone with hadoop access can launch whatever mapreduce jobs they want
[16:21:37] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Disabling statistics roles on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124621 (owner: 10Ottomata)
[16:21:37] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[16:21:44] <ottomata>	 hmmmm
[16:21:48] <ottomata>	 that's weird
[16:21:59] <ottomata>	 checking on that 5xx thing in a sec
[16:22:05] <ottomata>	 that's surely my fault...
[16:22:27] <icinga-wm>	 PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:21:21 PM UTC  
[16:22:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:21:26 PM UTC  
[16:22:27] <icinga-wm>	 PROBLEM - Puppet freshness on lvs4002 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:22:07 PM UTC  
[16:22:53] <ottomata>	 hmm, graphite down?
[16:23:04] <mark>	 ottomata: statsd access for analytics seems already there
[16:23:07] <ottomata>	 maybe that 5xx thing is not my fault!
[16:23:26] <ottomata>	 yeah, mark, i think we already had these set up too
[16:23:27] <icinga-wm>	 PROBLEM - Puppet freshness on virt2 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:22:28 PM UTC  
[16:23:37] <icinga-wm>	 RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Tue Apr  8 16:23:30 UTC 2014  
[16:23:43] <ottomata>	 but it seems that they aren't working right now, starting yesterday when I tried
[16:24:02] <grrrit-wm>	 (03PS1) 10Hashar: beta: reenable fatalmonitor script on eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/124624 
[16:24:13] <mark>	 and carbon is in there already too
[16:24:15] <ottomata>	 mark, unless pings just aren't allowed and i'm checking wrong?
[16:24:24] <mark>	 pings may not be allowed no
[16:24:27] <ottomata>	 ori and I both had trouble runnign apt-get update because we coudln't talk to carbon
[16:24:31] <mark>	 check again?
[16:24:35] <ottomata>	 yeah checking
[16:24:48] <ottomata>	 and i was trying to run sqstat on analytics1003
[16:24:52] <ottomata>	 so we can decom emery
[16:24:59] <ottomata>	 but it couldn't talk to statsd
[16:25:38] <ottomata>	 hm.
[16:25:44] <ottomata>	 yeah totally working now
[16:25:57] <ottomata>	 ooooook.
[16:25:59] <ottomata>	 weird.
[16:26:00] <_joe_>	 ottomata: graphite is borked 
[16:26:04] <mark>	 i think faidon did it earlier
[16:26:05] <grrrit-wm>	 (03CR) 10Hashar: "puppet is broken on deployment-bastion.eqiad.wmflabs, can't deploy the change right now :-/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124624 (owner: 10Hashar)
[16:26:21] <ottomata>	 oh, fixed the acl problem?
[16:26:33] <ottomata>	 maybe something else was just not working, and I assumed because I couldn't ping it was an ACL thing?
[16:26:55] <mark>	 ping is not a good way to test that
[16:27:10] <ottomata>	 yeah, i just saw the packets being filtered from ping
[16:27:11] <mark>	 we allow specific protocols/ports, ping uses different ones
[16:27:14] <ottomata>	 aye
[16:27:30] <ottomata>	 yeah, just figured if i couldn't at least ping then probably other stuff was blcoked too, but ja
[16:27:57] <ottomata>	 but yeah, ori couldn't use apt on vanadium either, so dunno...
[16:28:10] <ottomata>	 and sqstat couldnt' talk to tungsten, so hm
[16:28:12] <ottomata>	 but ok!
[16:28:16] <mark>	 :)
[16:28:22] <mark>	 we're going for dinner in a bit
[16:28:44] <ottomata>	 mark
[16:28:45] <ottomata>	 hm
[16:28:53] <ottomata>	 so sqstat is trying to talk to tungsten on 2003
[16:28:56] <hashar>	 !log Jenkins: killed jenkins-slave java process on gallium and repooled gallium slave.  It was no more registered in Zuul :-/
[16:28:57] <icinga-wm>	 RECOVERY - puppet disabled on iron is OK: OK  
[16:28:57] <ottomata>	 is that open?
[16:29:01] <morebots>	 Logged the message, Master
[16:29:07] <icinga-wm>	 RECOVERY - Disk space on iron is OK: DISK OK  
[16:29:09] <ottomata>	 can't seem to reach it from an03
[16:29:34] <manybubbles>	 ganglia seems upset
[16:29:40] <mark>	         protocol udp;
[16:29:40] <mark>	         destination-port 8125;
[16:29:45] <aude>	 tables added
[16:29:51] <mark>	 so port 2003 isn't
[16:29:54] <ottomata>	 ah ok
[16:30:03] <ottomata>	 that's why then, could you add?
[16:30:13] <mark>	 ok
[16:30:40] <ottomata>	 i'm going to see if reqstats gets flaky when we move it to analytics1003
[16:30:51] <ottomata>	 it was either flaky because erbium is busy
[16:30:57] <ottomata>	 or because the multicast firehose is just too lossy
[16:31:37] <aude>	 !log added sites and site_identifiers core tables on wikiquote
[16:31:41] <morebots>	 Logged the message, Master
[16:32:22] <mark>	 2003 should work now
[16:33:36] <icinga-wm>	 RECOVERY - DPKG on iodine is OK: All packages OK  
[16:33:36] <icinga-wm>	 RECOVERY - Disk space on iodine is OK: DISK OK  
[16:33:36] <icinga-wm>	 RECOVERY - puppet disabled on cp3003 is OK: OK  
[16:33:39] <ottomata>	 ah just noticed it is udp, mark, will that work still?
[16:33:46] <icinga-wm>	 RECOVERY - HTTPS on cp4014 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:46] <icinga-wm>	 RECOVERY - RAID on cp3003 is OK: OK: optimal, 2 logical, 2 physical  
[16:33:46] <icinga-wm>	 RECOVERY - RAID on iodine is OK: OK: no disks configured for RAID  
[16:33:46] <icinga-wm>	 RECOVERY - HTTPS on ssl1005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:46] <icinga-wm>	 RECOVERY - HTTPS on cp4003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:47] <mark>	 yes
[16:33:51] <ottomata>	 ok cool
[16:33:52] <ottomata>	 thanks
[16:33:53] <ottomata>	 ok go eat
[16:33:55] <ottomata>	 thank you!
[16:33:56] <icinga-wm>	 RECOVERY - DPKG on bast1001 is OK: All packages OK  
[16:33:56] <icinga-wm>	 RECOVERY - puppet disabled on iodine is OK: OK  
[16:33:56] <icinga-wm>	 RECOVERY - HTTPS on cp4002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:56] <icinga-wm>	 RECOVERY - HTTPS on amssq47 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:56] <icinga-wm>	 RECOVERY - HTTPS on cp4004 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:57] <icinga-wm>	 RECOVERY - HTTPS on cp4001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:57] <icinga-wm>	 RECOVERY - HTTPS on cp4017 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:58] <icinga-wm>	 RECOVERY - HTTPS on cp4015 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:58] <icinga-wm>	 RECOVERY - HTTPS on cp4008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:59] <icinga-wm>	 RECOVERY - HTTPS on ssl1001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:33:59] <icinga-wm>	 RECOVERY - HTTPS on cp4005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:34:00] <icinga-wm>	 RECOVERY - Disk space on bast1001 is OK: DISK OK  
[16:34:00] <icinga-wm>	 RECOVERY - HTTPS on cp4019 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:34:06] <icinga-wm>	 RECOVERY - RAID on bast1001 is OK: OK: no RAID installed  
[16:34:06] <icinga-wm>	 RECOVERY - DPKG on cp3003 is OK: All packages OK  
[16:34:06] <icinga-wm>	 RECOVERY - HTTPS on ssl1003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:34:06] <icinga-wm>	 RECOVERY - HTTPS on cp4007 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:34:16] <icinga-wm>	 RECOVERY - puppet disabled on bast1001 is OK: OK  
[16:34:16] <icinga-wm>	 RECOVERY - Disk space on cp3003 is OK: DISK OK  
[16:34:16] <icinga-wm>	 RECOVERY - HTTPS on cp4011 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)  
[16:35:36] <icinga-wm>	 PROBLEM - Puppet freshness on lvs4004 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:35:09 PM UTC  
[16:35:46] <icinga-wm>	 PROBLEM - HTTPS on cp1044 is CRITICAL: SSL_CERT CRITICAL *.wikimedia.org: invalid CN (*.wikimedia.org does not match *.wikipedia.org)  
[16:35:56] <icinga-wm>	 PROBLEM - HTTPS on cp1043 is CRITICAL: SSL_CERT CRITICAL *.wikimedia.org: invalid CN (*.wikimedia.org does not match *.wikipedia.org)  
[16:36:48] <grrrit-wm>	 (03PS1) 10Ottomata: Putting sqstat back on analytics1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124630 
[16:37:16] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Putting sqstat back on analytics1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124630 (owner: 10Ottomata)
[16:38:30] <grrrit-wm>	 (03PS1) 10Springle: invalid MariaDB variable name: user_stat [operations/puppet] - 10https://gerrit.wikimedia.org/r/124632 
[16:40:40] <grrrit-wm>	 (03CR) 10Springle: [C: 032] invalid MariaDB variable name: user_stat [operations/puppet] - 10https://gerrit.wikimedia.org/r/124632 (owner: 10Springle)
[16:46:50] <grrrit-wm>	 (03PS1) 10RobH: replace misc-web-lb cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/124634 
[16:48:11] <grrrit-wm>	 (03CR) 10RobH: [C: 032 V: 032] replace misc-web-lb cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/124634 (owner: 10RobH)
[16:49:09] <aude>	 sorry, being slow... populating sites table
[16:49:20] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Removing ethtool package from other places [operations/puppet] - 10https://gerrit.wikimedia.org/r/124637 
[16:49:22] <aude>	 suppose no hurry
[16:50:08] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] Removing ethtool package from other places [operations/puppet] - 10https://gerrit.wikimedia.org/r/124637 (owner: 10Alexandros Kosiaris)
[16:52:03] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "now included in base" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124637 (owner: 10Alexandros Kosiaris)
[16:53:08] <grrrit-wm>	 (03CR) 10Cmcmahon: [C: 031] "Thanks for putting this back." [operations/puppet] - 10https://gerrit.wikimedia.org/r/124624 (owner: 10Hashar)
[16:53:36] <icinga-wm>	 RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Tue Apr  8 16:53:29 UTC 2014  
[16:53:46] <icinga-wm>	 RECOVERY - Puppet freshness on dataset1001 is OK: puppet ran at Tue Apr  8 16:53:39 UTC 2014  
[16:55:06] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[16:55:28] <ottomata>	 rats
[16:56:36] <icinga-wm>	 RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Tue Apr  8 16:56:30 UTC 2014  
[16:56:46] <icinga-wm>	 RECOVERY - Puppet freshness on lvs1003 is OK: puppet ran at Tue Apr  8 16:56:45 UTC 2014  
[16:59:04] <aude>	 waiting for jenkins
[17:01:46] <icinga-wm>	 RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Tue Apr  8 17:01:37 UTC 2014  
[17:01:48] <grrrit-wm>	 (03PS2) 10Manybubbles: Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 
[17:03:06] <logmsgbot>	 !log aude synchronized php-1.23wmf20/extensions/Wikidata  'Update Wikidata build, to allow populating sites table on wikiquote'
[17:03:10] <morebots>	 Logged the message, Master
[17:05:20] <icinga-wm>	 RECOVERY - Puppet freshness on lvs4004 is OK: puppet ran at Tue Apr  8 17:05:14 UTC 2014  
[17:05:30] <icinga-wm>	 PROBLEM - RAID on dataset1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded)  
[17:06:40] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused  
[17:07:40] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.012 second response time  
[17:08:20] <icinga-wm>	 RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Tue Apr  8 17:08:15 UTC 2014  
[17:08:30] <icinga-wm>	 RECOVERY - Puppet freshness on lvs4003 is OK: puppet ran at Tue Apr  8 17:08:25 UTC 2014  
[17:08:44] <grrrit-wm>	 (03CR) 10Chad: [C: 032] Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 (owner: 10Manybubbles)
[17:08:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 (owner: 10Manybubbles)
[17:09:40] <icinga-wm>	 RECOVERY - Puppet freshness on lvs1006 is OK: puppet ran at Tue Apr  8 17:09:30 UTC 2014  
[17:10:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[17:10:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[17:10:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[17:10:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[17:10:19] <grrrit-wm>	 (03CR) 10QChris: "Prerequisite got merged." [operations/puppet] - 10https://gerrit.wikimedia.org/r/121546 (owner: 10Ottomata)
[17:10:52] <aude>	 ^demon|away: are you deploying stuff?
[17:11:14] <aude>	 i'll need to sneak in some point for a config change, but not yet
[17:11:29] <grrrit-wm>	 (03PS1) 10Ottomata: Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 
[17:11:38] <grrrit-wm>	 (03PS2) 10Ottomata: Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 
[17:11:40] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 (owner: 10Ottomata)
[17:11:50] <grrrit-wm>	 (03CR) 10Ottomata: [C: 032 V: 032] Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 (owner: 10Ottomata)
[17:12:28] <manybubbles>	 aude: no, he just merged something for beta
[17:12:34] <aude>	 ok
[17:12:41] <aude>	 probably need 10 more minutes
[17:12:50] <aude>	 done populating tables, now checking they are ok
[17:13:00] <aude>	 then can do the config change and then done :)
[17:13:19] <^demon|away>	 aude: Nope, just merged that for Nik for beta.
[17:13:21] <^demon|away>	 Like he said :)
[17:13:22] <aude>	 going slow and careful since i'm still newish 
[17:13:25] <aude>	 doign this stuff
[17:13:32] <^demon|away>	 Someone should sync it eventually for consistency, but no biggie.
[17:13:53] <aude>	 i can do
[17:14:04] <hoo>	 so can I
[17:14:29] <aude>	 hoo: want to check the sites tables and site_identifiers for wikiquote?
[17:14:30] <icinga-wm>	 RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Tue Apr  8 17:14:22 UTC 2014  
[17:14:36] <aude>	 they look ok to me
[17:15:30] <icinga-wm>	 RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Tue Apr  8 17:15:22 UTC 2014  
[17:16:02] <grrrit-wm>	 (03CR) 10Aude: "sites table and site_identifiers are added and populated" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude)
[17:16:10] <icinga-wm>	 RECOVERY - Puppet freshness on lvs1004 is OK: puppet ran at Tue Apr  8 17:16:02 UTC 2014  
[17:16:28] <manybubbles>	 !log finished upgrading elastic1001-1006.  starting on 1007.  yay progress.
[17:16:32] <morebots>	 Logged the message, Master
[17:16:34] <hoo>	 enwikiqoute looks good to me
[17:16:39] <aude>	 alright
[17:16:40] <hoo>	 sites and site_identifiers
[17:16:44] <aude>	 strip protocals and all
[17:16:52] <hoo>	 yep
[17:16:58] <aude>	 https://gerrit.wikimedia.org/r/#/c/124516/ want to merge
[17:17:07] <aude>	 i can deploy it and sync the cirrus thing
[17:17:19] <manybubbles>	 thanks1
[17:17:22] <hoo>	 ok, also looks good on WD
[17:17:30] <aude>	 ok
[17:17:45] <aude>	 let me sync cirrus
[17:17:52] <hoo>	 go ahead
[17:17:53] <Nemo_bis>	 Oh, today is the day
[17:18:06] <aude>	 it's *the* day :)
[17:18:10] <icinga-wm>	 RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Apr  8 17:18:03 UTC 2014  
[17:19:18] <hoo>	 aude: You also sorted the wikidataclient dblist? :P
[17:19:53] <aude>	 yes
[17:20:04] <hoo>	 Ok, looks good to me, can approve whenever you want
[17:20:05] <aude>	 they will get sorted eventually
[17:20:13] <aude>	 doing chad's thing
[17:20:30] <icinga-wm>	 RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Tue Apr  8 17:20:23 UTC 2014  
[17:21:30] <icinga-wm>	 RECOVERY - Puppet freshness on lvs1001 is OK: puppet ran at Tue Apr  8 17:21:24 UTC 2014  
[17:21:50] <icinga-wm>	 RECOVERY - Puppet freshness on amslvs4 is OK: puppet ran at Tue Apr  8 17:21:45 UTC 2014  
[17:22:30] <icinga-wm>	 RECOVERY - Puppet freshness on lvs4002 is OK: puppet ran at Tue Apr  8 17:22:21 UTC 2014  
[17:22:43] <logmsgbot>	 !log aude synchronized wmf-config/CirrusSearch-labs.php  'config change for beta, to enable highlighting'
[17:22:47] <morebots>	 Logged the message, Master
[17:23:06] <aude>	 hoo: ready
[17:23:45] <grrrit-wm>	 (03CR) 10Hoo man: [C: 032] "Preparation finished, so do this! \o/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude)
[17:23:49] <aude>	 yay!
[17:23:51] <hoo>	 there you go ;)
[17:23:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: Enable Wikibase on Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude)
[17:27:20] <hoo>	 aude: About to sync or shall I take it?
[17:27:21] <aude>	 sync dblist then wmf-config?
[17:27:31] * Nemo_bis  waiting
[17:27:43] <aude>	 no other way
[17:27:52] <hoo>	 other way round sounds sane
[17:28:02] <aude>	 wmf-config then dblist is good
[17:28:06] <hoo>	 wmf-config changes will work w/o the rest
[17:28:10] <aude>	 right
[17:28:20] <aude>	 that' what ree-dy did for wikisource
[17:28:52] <aude>	 doing
[17:28:55] <hoo>	 :)
[17:28:59] <logmsgbot>	 !log aude synchronized wmf-config  'config changes to enable Wikibase on Wikiquote'
[17:29:04] <morebots>	 Logged the message, Master
[17:29:12] <grrrit-wm>	 (03PS1) 10Matthias Mullie: Increase Flow cache version [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124646 
[17:29:52] <logmsgbot>	 !log aude synchronized wikidataclient.dblist  'Enable Wikibase on Wikiquote'
[17:29:57] <morebots>	 Logged the message, Master
[17:30:01] <hoo>	 oO
[17:30:02] <hoo>	 :)
[17:30:12] <aude>	 alright time to check it's all good
[17:30:17] <hoo>	 on that
[17:31:13] <hoo>	 oh well... I think we have to bump wgCacheEpoch once again
[17:31:14] <hoo>	 aude: ^
[17:31:36] <aude>	 huh
[17:31:45] <aude>	 ah, yes
[17:32:00] <hoo>	 shall I patch or will you?
[17:32:26] <Nemo_bis>	 https://www.wikidata.org/wiki/Q189119#sitelinks-wikiquote
[17:32:34] <hoo>	 Nemo_bis: Yes, the usual stuff
[17:32:34] <aude>	 go ahead
[17:33:06] <aude>	 it says list of values is complete
[17:33:09] <aude>	 i assume caching
[17:33:16] <aude>	 on Q60
[17:33:57] <aude>	 debug=true, i can add wikiquote
[17:34:23] <Nemo_bis>	 yep, I did action=purge
[17:34:23] <grrrit-wm>	 (03PS1) 10Hoo man: Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124648 
[17:34:24] <hoo>	 yep
[17:34:31] <hoo>	 aude: ^
[17:34:35] <aude>	 ok
[17:35:21] <ottomata>	 !log restarted gmetad on nickel to fix ganglia
[17:35:26] <morebots>	 Logged the message, Master
[17:35:33] <grrrit-wm>	 (03CR) 10Aude: [C: 032] Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124648 (owner: 10Hoo man)
[17:35:40] <grrrit-wm>	 (03Merged) 10jenkins-bot: Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124648 (owner: 10Hoo man)
[17:37:00] <hoo>	 aude: Syncing? I have to sync a touch out
[17:37:10] <aude>	 doing
[17:37:12] <hoo>	 ok
[17:37:18] <logmsgbot>	 !log aude synchronized wmf-config/Wikibase.php  'bump wgCacheEpoch for wikidata after enabling wikiquote site links'
[17:37:19] <aude>	 just being careful
[17:37:22] <morebots>	 Logged the message, Master
[17:37:28] <logmsgbot>	 !log hoo synchronized php-1.23wmf20/extensions/Wikidata/extensions/Wikibase/lib/resources/wikibase.Site.js  'touch'
[17:37:32] <morebots>	 Logged the message, Master
[17:37:34] <hoo>	 that should purge the sites cache
[17:37:43] <greg-g>	 "13:37 <      aude> just being careful"  +1 ;)
[17:37:44] <hoo>	 in resource loader
[17:37:47] <aude>	 :)
[17:38:25] <aude>	 still says complete
[17:38:30] <hoo>	 mh :/
[17:38:45] <aude>	 sites module has always been a pain
[17:40:24] <aude>	 maybe php-1.23wmf20/extensions/Wikidata/extensions/Wikibase/lib/includes/modules/SitesModule.php ?
[17:40:43] <hoo>	 aude: Wont help, RL does timestamps based on the JS scripts
[17:40:50] <aude>	 hmmm, ok
[17:41:13] <hoo>	 works for me
[17:41:16] <hoo>	 now at least
[17:41:35] <aude>	 trying in firefox
[17:41:39] <aude>	 might be my caching
[17:41:42] <hoo>	 \o/ Just added the first link
[17:41:46] <hoo>	 https://www.wikidata.org/wiki/Q40904#sitelinks-wikiquote
[17:41:48] <aude>	 already did one :)
[17:41:54] <aude>	 with debug=true
[17:41:59] <hoo>	 Cheating :D
[17:42:11] <aude>	 heh
[17:42:23] <aude>	 looks good in firefox
[17:42:30] <aude>	 i have to assume it's my cache
[17:42:31] <Nemo_bis>	 I did one ten minutes ago already :P
[17:42:35] <hoo>	 :P
[17:42:36] <aude>	 yay
[17:42:45] <hoo>	 Nemo_bis: with debug true, I guess?!
[17:42:50] <Nemo_bis>	 lol Heisenberg
[17:42:55] <Nemo_bis>	 19.34 < Nemo_bis> yep, I did action=purge
[17:43:01] <hoo>	 :P
[17:43:01] <aude>	 ah
[17:43:50] <Guest75555>	 Is there a procedure to delete gerrit repositories?
[17:45:00] <aude>	 i can add links in wikidata now in chrome
[17:45:09] <hoo>	 aude: https://en.wikiquote.org/w/index.php?title=Werner_Heisenberg&action=info mh
[17:45:14] <hoo>	 why is it not showing up?
[17:45:34] <Nemo_bis>	 Guest64226 / krinkle : probably you can ask on the same gerrit queue page as usual
[17:45:53] <hoo>	 ah, I see
[17:45:57] <Nemo_bis>	 unless it's not "your" repository, in which case maybe a bug is better
[17:46:11] <hoo>	 dispatching is ... :S
[17:47:21] <aude>	 hmmm
[17:47:28] <hoo>	 https://www.wikidata.org/wiki/Special:DispatchStats
[17:47:44] <aude>	 i did action=purge on https://en.wikiquote.org/wiki/New_York_City
[17:47:46] <hoo>	 aude: Can we safely skip theses changes? If not just waiting is also fine
[17:47:54] <hoo>	 it's catching up rather quickly AFAIS
[17:47:55] <aude>	 removed dewikiquote
[17:48:08] <aude>	 we can wait
[17:48:16] * bd808|deploy  waits in line to do a group0 to 1.23wmf21 scap
[17:48:28] <aude>	 give us 5 more minutes to poke
[17:48:43] <bd808|deploy>	 aude: Sounds good
[17:48:59] <aude>	 i think we're ok though... 
[17:49:32] <aude>	 or nothing we solve in 5 min, but didn't break anything
[17:50:51] <hoo>	 aude: I can bump the chd_seen fields
[17:51:12] <aude>	 ok
[17:52:05] <hoo>	 Just looking for the right change id
[17:53:43] <hoo>	 got that
[17:54:37] <aude>	 something is weird with wikiquote... like it's not actually enabled now
[17:54:45] <aude>	 but sure i saw it was
[17:55:29] * aude  thinks this happened with wikisource
[17:56:19] <hoo>	 !log changed the Wikidata wb_changes_dispatch position of all wikiquote wikis to 118158153
[17:56:23] <morebots>	 Logged the message, Master
[17:56:39] <aude>	 enwikiquote is in wikidataclient.dblist
[17:56:42] <hoo>	 20140408172900
[17:57:03] <hoo>	 that was the timestamp, should be a few moments before anything happened regarding wikiquote
[17:57:12] <aude>	 ok
[17:57:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 540.333313  
[17:58:28] <hoo>	 still https://en.wikiquote.org/w/index.php?title=Werner_Heisenberg&action=info
[17:58:56] <hoo>	 Wikidata is not even loaded there... wtf
[17:58:59] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 645.666687  
[17:59:03] <aude>	 right, 
[17:59:05] <aude>	 i'm sure it was
[17:59:25] <aude>	 do i have to sync dblist again?
[17:59:37] <aude>	 did we somehow undo it?
[18:00:58] <hoo>	 no, looks good on a random mw* machine
[18:01:09] <icinga-wm>	 PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 1694 MB (2% inode=86%):  
[18:01:14] <hoo>	 ah
[18:01:50] <logmsgbot>	 !log hoo synchronized wmf-config/InitialiseSettings.php  'Touch to clear config. cache'
[18:01:54] <morebots>	 Logged the message, Master
[18:01:55] <aude>	 ok
[18:02:09] <aude>	 it's back!
[18:02:11] <hoo>	 Sorry, I forgot about that
[18:02:33] <aude>	 was about to try that
[18:02:37] <hoo>	 :)
[18:02:41] <aude>	 touch all the wikidata things :)
[18:02:43] * bd808|deploy  wants to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=58618 so that's automatic
[18:02:56] <aude>	 i think we are done!
[18:03:19] <aude>	 i am sure this happened on wikisource or previously where it was enabled and then not
[18:03:38] * aude  puzzled but we're good now
[18:04:13] <hoo>	 Yep, looks good to me
[18:04:23] <bd808|deploy>	 aude, hoo: All clear for me to mess with /a/common on tin and then scap?
[18:04:37] <hoo>	 Yep, go ahead... we're done for now :)
[18:04:47] <bd808|deploy>	 Cool
[18:05:08] <aude>	 done
[18:06:11] <grrrit-wm>	 (03PS1) 10BryanDavis: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124655 
[18:06:50] * greg-g  crosses fingers and knocks on wood
[18:07:03] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124655 (owner: 10BryanDavis)
[18:07:05] * aude  too!
[18:07:46] <bd808|deploy>	 greg-g: Aaron merged my fix so in theory I should only need one scap. I'll verify the file after the first scap to be certain
[18:08:21] * greg-g  nods
[18:08:28] <grrrit-wm>	 (03Merged) 10jenkins-bot: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124655 (owner: 10BryanDavis)
[18:10:36] <logmsgbot>	 !log bd808 Started scap: group0 wikis to 1.23wmf21 (with patch for bug 63659)
[18:10:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[18:10:41] <morebots>	 Logged the message, Master
[18:11:25] <bd808|deploy>	 l10n cache did not rebuild which is a great sign
[18:11:58] <jackmcbarn>	 Unable to open /usr/local/apache/common-local/wikiversions.cdb. 
[18:11:58] <MatmaRex>	 https://pl.wikipedia.org/w/index.php?title=Dyskusja_wikiprojektu:%C5%9Ar%C3%B3dziemie&oldid=prev&diff=39218000
[18:12:01] <MatmaRex>	 i get a "Unable to open /usr/local/apache/common-local/wikiversions.cdb."
[18:12:10] <andre__>	 ...and same here.
[18:12:12] <manybubbles>	 [2014-04-08 18:11:37] Fatal error: Unable to open /usr/local/apache/common-local/wikiversions.cdb.
[18:12:15] <rschen7754>	 uh-oh
[18:12:19] <bd808|deploy>	 Yeah. fuck
[18:12:21] <manybubbles>	 yeah, you got it
[18:12:22] <Steinsplitter>	 here the same
[18:12:26] <bd808|deploy>	 It will be fixed in a few moments
[18:12:30] <manybubbles>	 thats everything
[18:12:31] <greg-g>	 well shit
[18:12:45] <bd808|deploy>	 fuuuuck
[18:12:49] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000  
[18:12:57] <bd808|deploy>	 There's my first crash all of the wikis
[18:12:59] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[18:13:00] <MaxSem>	 SNAFU?
[18:13:05] <aude>	 wtf
[18:13:13] <Amgine>	 down on wm
[18:13:21] <manybubbles>	 damn it, I was actually reading an article and I reloaded it to test
[18:13:23] <bd808|deploy>	 It was my "fix" for the scap problem
[18:13:25] <manybubbles>	 now I can't read it while I wait
[18:13:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.007 second response time  
[18:13:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1055 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.013 second response time  
[18:13:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.004 second response time  
[18:13:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.005 second response time  
[18:13:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.009 second response time  
[18:13:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.003 second response time  
[18:13:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time  
[18:13:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time  
[18:13:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.013 second response time  
[18:13:31] <icinga-wm>	 PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors -  
[18:13:33] <marktraceur>	 Whoa
[18:13:34] * aude  cries
[18:13:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.018 second response time  
[18:13:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.012 second response time  
[18:13:39] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.008 second response time  
[18:13:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time  
[18:13:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.022 second response time  
[18:13:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.011 second response time  
[18:13:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.010 second response time  
[18:13:42] <icinga-wm>	 PROBLEM - Apache HTTP on mw1154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.007 second response time  
[18:13:52] <bd808|deploy>	 It will be fixed soon… scap will fix it at the end
[18:13:54] <logmsgbot>	 !log bd808 Finished scap: group0 wikis to 1.23wmf21 (with patch for bug 63659) (duration: 03m 18s)
[18:13:59] <morebots>	 Logged the message, Master
[18:14:00] <aude>	 alright
[18:14:01] <bd808|deploy>	 Should be fixed now
[18:14:04] <manybubbles>	 fixed
[18:14:15] * greg-g  breathes again
[18:14:22] <jackmcbarn>	 can whoever's in charge of icinga-wm bring it back to life?
[18:14:35] <sjoerddebruin>	 Damn it. :P
[18:14:37] <greg-g>	 jackmcbarn: it'll again automatically, I *believe*
[18:14:38] <PiRCarre>	 Someone
[18:14:39] <MaxSem>	 so what happened?
[18:14:47] <PiRCarre>	 Oh, you know about it?
[18:14:48] <Marybelle>	 greg-g: You accidentally a verb.
[18:14:49] <PiRCarre>	 ok
[18:14:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time  
[18:14:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.073 second response time  
[18:14:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.084 second response time  
[18:14:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.111 second response time  
[18:14:51] <bd808|deploy>	 Patch https://gerrit.wikimedia.org/r/#/c/124627/
[18:14:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.062 second response time  
[18:14:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.059 second response time  
[18:15:07] <greg-g>	 Marybelle: :)
[18:15:16] <bd808|deploy>	 I'll write up the email. I know exactly what I fucked up
[18:15:21] <PiRCarre>	 bd808|deploy: thanks, I was just about to report "Unable to open /usr/local/apache/common-local/wikiversions.cdb." - glad to see it's under control
[18:15:29] * aude  breathes
[18:15:54] <paravoid>	 what's going on?
[18:16:08] <paravoid>	 we are all at dinner
[18:16:23] <manybubbles>	 fixed now
[18:16:24] <aude>	 it's ok
[18:16:25] <bd808|deploy>	 paravoid: My fault. Should be fixed now
[18:16:31] <paravoid>	 okay
[18:16:35] <greg-g>	 paravoid: go back to dinner, all's ok again :)
[18:16:36] <aude>	 scap temporarily broke everything though
[18:16:36] <paravoid>	 do you need anything?
[18:16:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 183.266663  
[18:16:39] <paravoid>	 ok
[18:16:44] <paravoid>	 manual page us if something happens
[18:16:52] <greg-g>	 paravoid: nope, known ef up
[18:16:57] <greg-g>	 paravoid: will do, enjoy!
[18:17:05] <paravoid>	 ciao
[18:18:17] <grrrit-wm>	 (03PS2) 10Gergő Tisza: Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 
[18:18:56] <grrrit-wm>	 (03CR) 10Gergő Tisza: "Updated to display feedback survey on beta enwiki." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 (owner: 10Gergő Tisza)
[18:19:29] <bd808|deploy>	 greg-g: I just reverted my patch to scap that caused that cascade of horribleness
[18:19:36] <greg-g>	 :)
[18:19:44] <bd808|deploy>	 One the plus side, group0 is on wmf21 now
[18:19:50] <greg-g>	 lol
[18:19:58] <greg-g>	 literal-lol
[18:20:09] * aude  scared to change it back
[18:20:20] <greg-g>	 "Don't. Touch. Any. Thing."
[18:20:25] <aude>	 i suppose if bd808|deploy 's patch is reverted then ok
[18:20:39] <greg-g>	 well, we still have the previous issue which it was trying to fix ;)
[18:20:59] <greg-g>	 1 step forward, 1 step back
[18:21:23] <bd808|deploy>	 So yes we are temporarily back to needing to double-scap, but I'll make a patch that doesn't melt the world after lunch
[18:22:25] <greg-g>	 bd808|deploy: :)
[18:23:15] <aude>	 wikiquote etc all looks fine, so i'm going home / eating
[18:23:20] <aude>	 back in hour
[18:23:26] <greg-g>	 k, I'll do the same
[18:23:33] <Nemo_bis>	 quite late dinner for berlin
[18:23:47] <manybubbles>	 so I told my wife we broke the internet.  she told me facebook was working....
[18:24:18] <hoo>	 Nemo_bis: It's never to late for food :P
[18:24:41] <Jamesofur>	 ^
[18:28:38] <Nemo_bis>	 hoo: well, I'd call death for starvation, pellagra etc. "too late" :P
[18:29:07] <hoo>	 Nemo_bis: :P To late as in time of the day...
[18:29:08] <hoo>	 :D
[18:30:17] <ori>	 hoo: http://p.defau.lt/?md_cbLJuORDNsGkhY6_NAg :P
[18:30:55] <hoo>	 at least the other errors are gone now, I guess
[18:31:28] <greg-g>	 manybubbles: :(
[18:31:42] * greg-g  goes to lunch for real
[18:32:34] <ori>	 hoo: yeah, i submitted a patch for hhvm to fix that other issue btw
[18:32:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144  
[18:34:15] <hoo>	 ori: Oh... nice that it's actually done in PHP :)
[18:35:34] <manybubbles>	 yeah yeah yeah, elasticsearch 1012 is being upgraded
[18:37:56] <ori>	 hoo: which component should that be filed under?
[18:39:25] <hoo>	 ori: already done https://bugzilla.wikimedia.org/show_bug.cgi?id=63691
[18:39:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 639.299988  
[18:39:40] <ori>	 oh cool, thanks!
[18:42:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 530.733337  
[18:42:20] <hoo>	 ori: Any idea who to poke about https://gerrit.wikimedia.org/r/121709 ?
[18:43:46] <grrrit-wm>	 (03CR) 10Matanya: add interface speed check for all hosts (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124606 (owner: 10Cmjohnson)
[18:44:08] <grrrit-wm>	 (03PS2) 10Ori.livneh: Change wgServer and wgCanonicalServer for arbcom wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121709 (owner: 10Hoo man)
[18:44:53] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Change wgServer and wgCanonicalServer for arbcom wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121709 (owner: 10Hoo man)
[18:45:06] <logmsgbot>	 !log ori updated /a/common to {{Gerrit|I4b18e4ce8}}: Change wgServer and wgCanonicalServer for arbcom wikis
[18:45:11] <morebots>	 Logged the message, Master
[18:45:28] <hoo>	 heh :)
[18:45:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[18:45:50] <logmsgbot>	 !log ori synchronized wmf-config/InitialiseSettings.php  'I4b18e4ce8: Change wgServer and wgCanonicalServer for arbcom wikis'
[18:45:55] <morebots>	 Logged the message, Master
[18:53:40] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[18:56:09] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[18:57:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 172.800003  
[18:58:59] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[18:59:00] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:00] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:00] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:00] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10  
[18:59:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:09] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:10] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[18:59:29] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409  
[19:00:03] <manybubbles>	 blhe
[19:00:11] <manybubbles>	 it recovered in a few seconds
[19:00:16] <manybubbles>	 not sure why it did that
[19:07:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 341.200012  
[19:12:00] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:00] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:00] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:00] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:11] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:11] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:12:11] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:13:16] <manybubbles>	 thats right
[19:13:18] <manybubbles>	 horrible check
[19:13:36] <manybubbles>	 no errors in the logs associated with those warnings
[19:18:49] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[19:20:55] <huh>	 https://en.wikipedia.org/wiki/Wikipedia:VPT#Heartbleed_bug.3F
[19:23:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 531.166687  
[19:24:29] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12  
[19:24:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:50] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:50] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:24:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414  
[19:25:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 635.799988  
[19:25:11] * Jamesofur  kicks icinga-wm
[19:26:39] <icinga-wm>	 PROBLEM - DPKG on elastic1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages  
[19:28:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:29:38] <matanya>	 huh: it is being fixed by ops
[19:31:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:36:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:37:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:49] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:50] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:37:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:38:00] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:38:07] <huh>	 again?
[19:38:09] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[19:38:10] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:38:10] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:38:10] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13  
[19:38:10] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:38:29] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407  
[19:38:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 224.199997  
[19:39:39] <icinga-wm>	 RECOVERY - DPKG on elastic1015 is OK: All packages OK  
[19:40:19] <manybubbles>	 oh shut up
[19:40:52] <manybubbles>	 I'm doing rolling restarts 
[19:41:47] <manybubbles>	 got it: labswiki_content_1394813391
[19:41:53] <manybubbles>	 that thing is configured without replicas
[19:46:40] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 341.066681  
[19:48:00] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:01] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:01] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:01] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:10] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:30] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:43] <manybubbles>	 and, more noise!
[19:48:49] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:49] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:49] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:48:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303  
[19:48:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303  
[19:48:59] <icinga-wm>	 PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303  
[19:49:22] <manybubbles>	 bit me labswiki!
[19:52:34] * bd808|LUNCH  cheers manybubbles on
[19:52:53] <manybubbles>	 it'll spam us again in a few minutes
[19:52:59] <manybubbles>	 labswiki recovered a long time ago
[19:53:05] <manybubbles>	 it was only out for ~30 seconds each time
[19:53:20] <manybubbles>	 but ganglia wants all the shards on all the wikis to be recovered before it is happy
[19:53:59] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:53:59] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:53:59] <icinga-wm>	 RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0  
[19:56:15] <manybubbles>	 !log upgraded all elasticsearch servers except elastic1008.  that is coming now.
[19:56:20] <morebots>	 Logged the message, Master
[19:58:20] <manybubbles>	 !log finished upgrading to Elasticsearch 1.1.0.  The process went well with no issues other then some knocking out search in labs 3 times for 30 seconds a piece.  And logging lots of nasty warnings to irc.  I've started to the process to fix search in labs so it won't happen again.
[19:58:25] <morebots>	 Logged the message, Master
[20:05:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 420.066681  
[20:08:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.900024  
[20:10:29] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[20:10:29] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[20:10:29] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[20:10:29] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[20:10:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[20:12:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[20:16:56] <se4598>	 Does someone here know about dns issues with wmflabs-domains or related stuff that happened recently?
[20:19:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[20:20:41] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 176.399994  
[20:22:09] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[20:26:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 368.466675  
[20:28:02] <cajoel>	 re:heartbleed, I think we'll be wanting a new corp certificate...   do you guys have a favorite vendor for star certs these days?
[20:28:21] <cajoel>	 it's almost due for a re-up anyway, so it's worth the effort
[20:29:53] <ebernhardson>	 r
[20:48:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 642.700012  
[20:51:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[20:51:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[20:52:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 537.099976  
[20:59:46] <odder>	 greg-g: don't believe you
[20:59:58] <odder>	 http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/000666.html
[21:00:04] <odder>	 This is the work of the Beast
[21:00:11] <bd808>	 greg-g: Do you still want to try group1 to 1.23wmf21 today or have we had enough excitement?
[21:00:53] * apergos  reminds folks that all ops are out at a bar except for those who are about to go to sleep :-D
[21:01:06] <greg-g>	 bd808: we're back to "if you run scap, run it twice" world, right?
[21:01:10] <greg-g>	 apergos: :)
[21:01:23] <greg-g>	 odder: which part? :)
[21:01:36] <bd808>	 greg-g: Yes, but for group1 to 1.23wmf21 we only need to run sync-wikiversions
[21:01:49] <greg-g>	 right
[21:02:09] <greg-g>	 the world looks sane on phase0?
[21:02:11] * greg-g  looks
[21:02:34] <odder>	 greg-g: all of it - notice the number immediately preceding .html
[21:02:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 232.46666  
[21:02:48] <greg-g>	 odder: haha
[21:03:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:03:54] <greg-g>	 this is neat: https://graphite.wikimedia.org/render/?title=HTTP%204xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.4xx,%224xxx%20resp/min%22%29%29,%22blue%22%29
[21:04:36] <greg-g>	 I think that's what ori told me yesterdayt to not worry about
[21:05:09] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:05:25] <greg-g>	 bd808: if we do, we do now, so we have 2 hours before SWAT of settle bug report time. May I take your whole day?
[21:06:36] <bd808>	 greg-g: I'm yours to command. :)
[21:06:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 269.866669  
[21:06:42] <odder>	 http://heartbleed.com/
[21:06:48] <odder>	 Q&ampA
[21:06:55] <odder>	 :-P
[21:07:09] <greg-g>	 bd808: go forth, please
[21:09:36] <grrrit-wm>	 (03PS1) 10BryanDavis: Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124744 
[21:11:12] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 032] Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124744 (owner: 10BryanDavis)
[21:11:20] <grrrit-wm>	 (03Merged) 10jenkins-bot: Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124744 (owner: 10BryanDavis)
[21:12:17] <logmsgbot>	 !log bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.23wmf21
[21:12:23] <morebots>	 Logged the message, Master
[21:12:47] <hoo>	 greg-g: Have you guys already killed all user sessions?
[21:12:52] <hoo>	 Can't see a server admin log entry
[21:15:44] <odder>	 greg-g: I did a https://commons.wikimedia.org/wiki/Commons:Village_pump#Users_are_being_forced_to_log_out
[21:18:21] <Jamesofur>	   Thanks odder, I left a note about it on en VPT since I saw a question about the bug in general
[21:18:48] <odder>	 Maybe I'll cross-post that to Meta too
[21:19:59] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[21:20:14] <logmsgbot>	 !log bd808 Purged l10n cache for 1.23wmf18
[21:20:19] <morebots>	 Logged the message, Master
[21:21:46] <logmsgbot>	 !log bd808 Purged l10n cache for 1.23wmf19
[21:21:50] <morebots>	 Logged the message, Master
[21:21:54] <greg-g>	 hoo: in process
[21:22:55] <hoo>	 :)
[21:23:09] <greg-g>	 hoo: it takes longer than you'd imagine, maybe :)
[21:23:37] <bd808|deploy>	 greg-g: group1 to 1.23wmf21 is {{done}}
[21:23:40] <se4598>	 greg-g: just change the cookie name? (like last time)
[21:24:09] <greg-g>	 se4598: I'm defering to chris on it (not sure what his exact process is, honestly)
[21:24:14] <greg-g>	 bd808|deploy: ty
[21:24:53] <se4598>	 mh, the tokens will be still valid I think, wasn't a good idea
[21:25:14] <bd808>	 se4598: Yeah I think that's why it takes a while
[21:26:45] <hoo>	 greg-g: Well given how many users we have and that we probably don't want to hammer the DBs to much, I can imagine this to take some time 
[21:26:52] * greg-g  nods
[21:28:16] <hoo>	 csteipp: Why not run one process per shard?
[21:29:24] <odder>	 Jamesofur: if you're keeping track of things, I alerted Commons and Meta; perhaps someone would need to alert the other big Wikipedias
[21:29:35] <odder>	 Dunno if the message to tech-ambassadors will be enough; may be.
[21:30:35] <grrrit-wm>	 (03PS2) 10MaxSem: Put a safeguard on GeoData's usage of CirrusSearch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121874 
[21:30:37] <grrrit-wm>	 (03PS1) 10MaxSem: Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 
[21:30:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:30:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 535.0  
[21:30:54] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 (owner: 10MaxSem)
[21:31:32] <csteipp>	 se4598: Assuming attacker has the login token, they could use the new name and again spoof the user
[21:31:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:31:46] <grrrit-wm>	 (03PS2) 10MaxSem: Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 
[21:32:09] <Jamesofur>	 odder: yeah, I'll see if we can poke people, we're going to send out SM messages as well in a couple minutes
[21:32:19] <Jamesofur>	 with a recommendation to password reset
[21:33:09] <odder>	 SM?
[21:33:22] <Jamesofur>	 sorry, Social Media (Twitter/Facebook/G+ etc)
[21:33:42] <odder>	 TMA, Too Many Abbreviations
[21:33:45] <odder>	 :)
[21:33:59] <Jamesofur>	 yup lol
[21:34:09] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.133362  
[21:34:10] <Jamesofur>	 I abuse them, I even make up my own and forget that they are just in my head
[21:34:23] <HaeB>	 https://twitter.com/Wikimedia/status/453646877397757953
[21:34:49] <JohnLewis>	 Jamesofur: EUS IAA. TA IANAL.
[21:34:58] <JohnLewis>	 *EYS :p
[21:35:42] <odder>	 thanks HaeB, retweeted
[21:40:46] <aude>	 woah, new code on wikidata?
[21:40:46] <matanya>	 Jamesofur: using mass-message might be a good idea
[21:41:15] <greg-g>	 aude: yep, all ok?
[21:41:26] <Jamesofur>	 HaeB: ^ what do you think? (about MM)
[21:41:48] <greg-g>	 wdyt?
[21:42:08] <JohnLewis>	 greg-g: itjdi
[21:42:12] <aude>	 so we're confident?
[21:42:39] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 187.866669  
[21:42:53] <greg-g>	 aude: in that it won't break at 2:00 utc? yeah
[21:43:06] <greg-g>	 aude: the only thing we're still not confident about is scap on thursday
[21:44:19] <aude>	 alright
[21:44:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:44:40] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 320.200012  
[21:44:55] <HaeB>	 Jamesofur, matanya: i think for the session ending, massmessage would be overkill. regarding the password reset, it's a judgment call (how high one estimates the risk for users who don't change it)
[21:45:24] <matanya>	 HaeB: it depends on user rights as well
[21:45:27] <bd808>	 aude: The bug that caused all the 1.23wmf21 l10n issues is https://bugzilla.wikimedia.org/show_bug.cgi?id=63659
[21:46:31] <HaeB>	 are there any other major sites who notified all users?
[21:46:54] <Jamesofur>	 not that I've seen yet, but I have a feeling some are still going through the fixing process
[21:46:55] <aude>	 interesting
[21:46:59] <HaeB>	 (to recommend a password chanage)
[21:47:10] <hoo>	 eg. just got stuff from CloudBees
[21:47:15] <hoo>	 github also logged me out
[21:47:37] <HaeB>	 would also be interesting to know how quick the wikis were fixed after the news broke yesterday
[21:47:40] <Jamesofur>	 latimes has an article about resetting your password, but that's different 
[21:48:09] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:48:13] <HaeB>	 last night (PT) i filed a RT ticket for the blog, which was vulnerable at the time, but at that point the wikis tested ok already
[21:48:36] <hoo>	 The wikis auto update OpenSSL via puppet
[21:49:00] <Jamesofur>	 hoo: well ya ;) the question is when we updated puppet ;)
[21:49:24] <hoo>	 Jamesofur: The servers do that themselves
[21:49:39] <HaeB>	 per https://wikitech.wikimedia.org/wiki/Server_admin_log , the blog (holmium) was pretty late in the game
[21:49:50] <bd808>	 The timeline is all in SAL from last night
[21:49:51] <hoo>	 Yesterday I posted about that to the internal ops list, but forgot to poke a root to do a apt-cache clean and force puppet run
[21:50:08] <HaeB>	 "04:03 Tim: upgrading libssl on ssl1001,ssl1002,ssl1003,ssl1004,ssl1005,ssl1006,ssl1007,ssl1008,ssl1009,ssl3001.esams.wikimedia.org,ssl3002.esams.wikimedia.org,ssl3003.esams.wikimedia.org" - is that the entry for the wikis?
[21:50:37] <bd808>	 Mostly yes
[21:53:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:53:39] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[21:53:59] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[21:54:55] <grrrit-wm>	 (03PS1) 10Jean-Frédéric: Add Musées de la Haute-Saône to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124754 
[22:01:11] <mwalker>	 greg-g, poking you because I'm not sure who's on point for the i18n / scap stuff -- but I recall getting pinged a couple of days ago (on a centralnotice keyword) saying that the i18n update was failing due to exceptions on CN (and others). I'm wondering if CN's fail was due to being on a deployment branch that did not have the JSON updates (until just now).
[22:01:46] <greg-g>	 shouldn't be
[22:01:57] <greg-g>	 there's backward compat in l10nupdate
[22:02:17] <greg-g>	 mwalker: see https://bugzilla.wikimedia.org/show_bug.cgi?id=63659 for all the gorey details
[22:02:33] * mwalker  puts on tyvek suit
[22:02:38] <greg-g>	 :)
[22:30:59] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[22:33:06] <csteipp>	 greg-g: Could I push a small centralauth update soon?
[22:33:44] <greg-g>	 yeah, now is fine, 30 minutes until swat
[22:34:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:36:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:37:04] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 34.533333  
[22:37:34] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 260.733337  
[22:38:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:40:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:42:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:44:14] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 625.166687  
[22:44:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:45:36] <se4598>	 marktraceur: I see in deploy-calendar that you have changeset which especially activates MediaViewer on en-beta. You(r pc) may get hit by https://bugzilla.wikimedia.org/show_bug.cgi?id=63709
[22:46:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:47:22] <marktraceur>	 se4598: Is there a fix?
[22:47:50] <marktraceur>	 I'm guessing it's an SSL problem
[22:48:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:48:43] <marktraceur>	 se4598: Replied on bug
[22:49:09] <grrrit-wm>	 (03PS1) 10BryanDavis: Create symlink for compile-wikiversions in /usr/local/bin [operations/puppet] - 10https://gerrit.wikimedia.org/r/124763 
[22:49:23] <se4598>	 marktraceur: We in #wikimedia-labs haven't one. And thats not about https but dns resolve, so I don't understand what do you mean by https?
[22:49:35] <marktraceur>	 Oh, hm
[22:49:37] <marktraceur>	 Never mind, sorry
[22:50:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:52:04] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[22:52:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:52:34] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[22:52:56] <se4598>	 marktraceur: currently the fix is.....: it may work if you try multiple times or wait some time (minutes, hours) ;P
[22:54:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:56:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:56:54] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000  
[22:58:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC  
[22:58:41] <hoo>	 greg-g: csteipp: got both core changes ready
[22:58:53] <hoo>	 I mean changes to the deploy branch
[22:59:52] <csteipp>	 hoo: Cool.. one sec and I'll merge and deploy it
[23:00:12] <hoo>	 I can also jump in, am on tin still anyway
[23:00:14] <icinga-wm>	 RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Tue Apr  8 23:00:04 UTC 2014  
[23:02:24] <icinga-wm>	 PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 11:00:04 PM UTC  
[23:05:24] <greg-g>	 stupid puppet
[23:06:33] * Jasper_Deng  always wondered what Puppet does anyways
[23:07:09] <Jamesofur>	 pulls the strings ;)
[23:07:20] <Jamesofur>	 (or, probably better 'is the strings' )
[23:07:26] <hoo>	 Jasper_Deng: Playing with the servers :D
[23:08:20] <JohnLewis>	 Technically, the sysadmins are a puppet in the WMFs plans, right? :p
[23:08:37] <logmsgbot>	 !log csteipp synchronized php-1.23wmf21/extensions/CentralAuth/maintenance  'Push maintenance script for token reset'
[23:08:39] <Jamesofur>	 or we're all just puppets in their plans, duh
[23:08:41] <morebots>	 Logged the message, Master
[23:09:04] <JohnLewis>	 Jamesofur: You're the past of the puppets :p
[23:09:09] <JohnLewis>	 *master of the
[23:09:57] <csteipp>	 greg-g: CentralAuth updates are out, so swat can go ahead if they were waiting on me
[23:10:01] <Jamesofur>	 ;) the user with said name may dislike me claiming the title
[23:10:40] <greg-g>	 mwalker: ori ebernhardson ^
[23:10:46] <greg-g>	 also, what the heck, oit_display ?
[23:10:54] <greg-g>	 :)
[23:11:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[23:11:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[23:11:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[23:11:10] <icinga-wm>	 PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC  
[23:11:51] <mwalker>	 oh
[23:11:54] <mwalker>	 yes; it's 4!
[23:13:25] <Danny_B>	 SUL doesn't work?
[23:14:02] <mwalker>	 csteipp, ^
[23:14:03] <hoo>	 Danny_B: We are logging out all users
[23:14:10] <hoo>	 see http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/000666.html
[23:14:32] <MaxSem>	 csteipp, warn ppl with a site notice?
[23:14:35] <se4598>	 hoo: you know that this isn't merged? https://gerrit.wikimedia.org/r/124756
[23:15:00] <hoo>	 se4598: not this important at the very moments
[23:15:03] <hoo>	 * moment
[23:15:23] <csteipp>	 Danny_B: SUL should work... You should just be logged out. If you can't login, let me know
[23:15:53] <Jamesofur>	 csteipp: will we get logged out each time you hit a wiki we've visited recently? or just the once per user in theory
[23:16:15] <csteipp>	 If you're a global user, just once (right now as I logout all the centralauth users)
[23:16:32] <csteipp>	 If you have multiple ununified local accounts, each will get logged out
[23:16:51] <Danny_B>	 csteipp: i have to log in on every single project although i have central username
[23:16:54] <Amgine>	 <grumbles about that><waves fist impotently at it.wp>
[23:17:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 135.300003  
[23:17:55] <mwalker>	 marktraceur, MaxSem I'm going to +2 and confirm https://gerrit.wikimedia.org/r/#/c/124036/2 , https://gerrit.wikimedia.org/r/#/c/121874/2 , https://gerrit.wikimedia.org/r/#/c/124747/
[23:18:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 173.666672  
[23:18:32] <mwalker>	 it would be wonderful if you all could +1 that so that I know you've looked and said this is good to me
[23:18:35] <marktraceur>	 'kay
[23:18:53] <Danny_B>	 csteipp: +1 to notice ppl with central notice
[23:18:57] <grrrit-wm>	 (03CR) 10MarkTraceur: [C: 031] Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 (owner: 10Gergő Tisza)
[23:19:00] <MaxSem>	 +1 ourselves?
[23:19:16] <MaxSem>	 doesn't sound very assuring:)
[23:19:21] <mwalker>	 nah; you're probably OK MaxSem :p
[23:19:27] <mwalker>	 but I don't know who Gergo is
[23:19:44] <mwalker>	 but mark was sponsoring the patch
[23:19:53] <MaxSem>	 he's tgr :P
[23:20:00] <grrrit-wm>	 (03CR) 10Mwalker: [C: 032] Put a safeguard on GeoData's usage of CirrusSearch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121874 (owner: 10MaxSem)
[23:20:08] <grrrit-wm>	 (03CR) 10Mwalker: [C: 032] Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 (owner: 10MaxSem)
[23:20:21] <grrrit-wm>	 (03CR) 10Mwalker: [C: 032] Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 (owner: 10Gergő Tisza)
[23:20:27] <ori>	 greg-g: missed your ping; still need me?
[23:21:00] <greg-g>	 dont think so
[23:23:33] <mwalker>	 interesting; sync-common doesn't log to IRC?
[23:23:34] <csteipp>	 Danny_B: That doesn't sound right.. At the risk of sounding cliche, can you log out and log back in, and see if that helps?
[23:23:55] <mwalker>	 marktraceur, MaxSem can you tell if your configuration stuff got pushed?
[23:24:15] <MaxSem>	 mwalker, mine's noop on prod
[23:24:25] <marktraceur>	 Ditto, but will check on beta
[23:24:26] <MaxSem>	 checking if prod still works...
[23:24:35] <mwalker>	 also; marktraceur I presume you want https://gerrit.wikimedia.org/r/#/c/124510/ to go to wmf20 and wmf21?
[23:24:38] <HaeB>	 Danny_B, hoo : we're still thinking about massmessage instead (more for the password changing advice)
[23:24:43] <marktraceur>	 mwalker: Sorry, only 21
[23:25:24] <marktraceur>	 mwalker: Confirmed, beta has the configuration we wanted
[23:26:36] <MaxSem>	 mwalker, lgtm
[23:27:40] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx  
[23:28:34] <Danny_B>	 csteipp: log out from any currently logged project, log back to it and then try if sul works on other?
[23:29:14] <csteipp>	 Danny_B: Yeah
[23:29:22] <Danny_B>	 csteipp: ok, sec
[23:29:38] <csteipp>	 Hmm... Danny_B What's you're wiki username?
[23:30:51] <icinga-wm>	 RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Tue Apr  8 23:30:43 UTC 2014  
[23:30:55] <Danny_B>	 csteipp: Danny B.
[23:31:17] <Danny_B>	 csteipp: seems to work now, will let you know if i'll spot another disconnection
[23:31:27] <csteipp>	 Danny_B: Cool, thanks
[23:32:03] <Danny_B>	 yw
[23:32:15] <Danny_B>	 thanks for care
[23:33:30] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[23:34:31] <logmsgbot>	 !log mwalker synchronized php-1.23wmf21/extensions/MultimediaViewer/  'Updating MultimediaViewer for {{gerrit|124510}}'
[23:34:35] <morebots>	 Logged the message, Master
[23:35:16] <mwalker>	 marktraceur, ^ if you would test what you need to test for that
[23:35:26] <mwalker>	 I'm not seeing any fatals or exceptions which is good :)
[23:35:31] <icinga-wm>	 RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0  
[23:35:32] <marktraceur>	 mwalker: Works
[23:35:32] <marktraceur>	 Ta
[23:35:39] <mwalker>	 cool; greg-g SWAT done
[23:58:30] <icinga-wm>	 PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 179.666672  
[23:59:04] <jackmcbarn>	 "Firefox can't find the server at en.wikipedia.beta.wmflabs.org."
[23:59:08] <jackmcbarn>	 why?
[23:59:14] <grrrit-wm>	 (03CR) 10Aaron Schulz: [C: 031] Create symlink for compile-wikiversions in /usr/local/bin [operations/puppet] - 10https://gerrit.wikimedia.org/r/124763 (owner: 10BryanDavis)
[23:59:31] <marktraceur>	 jackmcbarn: https://bugzilla.wikimedia.org/show_bug.cgi?id=63709 probably