2014-04-08 00:00:35
Sure thing
2014-04-08 00:00:42
I think that's the SWAT all done
2014-04-08 00:00:44
Sorry for the slowness everyone
2014-04-08 00:01:16
RoanKattouw: If it makes my mailbox less full of debate about font faces...
2014-04-08 00:01:36
is sure that muting those threads will continue
2014-04-08 00:02:28
PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.426 second response time
2014-04-08 00:08:52
looks for a python reviewer for: https://gerrit.wikimedia.org/r/#/c/124500/
2014-04-08 00:09:10
I think that will fix the 1.23wmf21 l10n problems
2014-04-08 00:09:30
Because … mystery action at a distance!
2014-04-08 00:12:27
RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 219118 bytes in 8.455 second response time
2014-04-08 00:24:56
!log catrope synchronized php-1.23wmf20/extensions/VisualEditor 'it helps if you run git submodule update first'
2014-04-08 00:25:02
Logged the message, Master
2014-04-08 00:25:05
!log catrope synchronized php-1.23wmf21/extensions/VisualEditor 'it helps if you run git submodule update first'
2014-04-08 00:25:11
Logged the message, Master
2014-04-08 00:27:34
('PS1') 'BryanDavis': test2wiki to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124505'
2014-04-08 00:28:54
RoanKattouw_away: Are you {{done}} done now? I'd like to run some more scap tests
2014-04-08 00:38:27
('Abandoned') 'BryanDavis': l10nupdate: Add temporary debugging captures [operations/puppet] - 'https://gerrit.wikimedia.org/r/124467' (owner: 'BryanDavis')
2014-04-08 00:38:40
('PS2') 'BryanDavis': test2wiki to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124505'
2014-04-08 00:39:44
('Abandoned') 'BryanDavis': test2wiki to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124505' (owner: 'BryanDavis')
2014-04-08 00:41:34
('PS1') 'BryanDavis': Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124506'
2014-04-08 00:43:55
greg-g: Are you still on a bus? I'd like to scap group0 to 1.23wmf21 to test my band aid fix. I would be on the hook to revert immediately following if ExtensionMessages looks like it will cause a problem for l10nupdate.
2014-04-08 00:44:03
bd808: Yes, sorry
2014-04-08 00:44:43
RoanKattouw_away: :) thanks. I watched your idle time on tin climb until I felt safe.
2014-04-08 00:45:28
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 00:46:57
decides that greg-g won't have changed his mind in the last 1:30 and proceeds
2014-04-08 00:48:38
('CR') 'BryanDavis': [C: '2'] "Approving to test band aid fix for ExtensionMessages generation problem. Will revert if ExtensionMessages doesn't look right after scap." [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124506' (owner: 'BryanDavis')
2014-04-08 00:48:45
('Merged') 'jenkins-bot': Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124506' (owner: 'BryanDavis')
2014-04-08 00:50:53
!log bd808 Started scap: group0 to 1.23wmf21 (testing python change for mwversionsinuse)
2014-04-08 00:50:58
Logged the message, Master
2014-04-08 00:53:12
sees l10n cache updating yet again for 1.23wmf21 and loses all confidence in his "fix"
2014-04-08 00:53:51
!log bd808 scap aborted: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (duration: 02m 57s)
2014-04-08 00:53:56
Logged the message, Master
2014-04-08 00:54:30
!log bd808 Started scap: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (again)
2014-04-08 00:54:35
Logged the message, Master
2014-04-08 00:54:56
!log bd808 scap aborted: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (again) (duration: 00m 25s)
2014-04-08 00:55:01
Logged the message, Master
2014-04-08 00:55:12
('PS1') 'BryanDavis': Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124507'
2014-04-08 00:55:34
('CR') 'BryanDavis': [C: '2'] Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124507' (owner: 'BryanDavis')
2014-04-08 00:55:42
('Merged') 'jenkins-bot': Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124507' (owner: 'BryanDavis')
2014-04-08 00:56:51
!log bd808 Started scap: revert group0 to 1.23wmf21 (testwiki still on 1.23wmf21)
2014-04-08 00:56:55
Logged the message, Master
2014-04-08 01:01:33
('PS3') 'Ori.livneh': Add EventLogging Kafka writer plug-in [operations/puppet] - 'https://gerrit.wikimedia.org/r/85337'
2014-04-08 01:06:45
!log bd808 Finished scap: revert group0 to 1.23wmf21 (testwiki still on 1.23wmf21) (duration: 09m 54s)
2014-04-08 01:06:53
Logged the message, Master
2014-04-08 01:22:25
ori: working now
2014-04-08 01:22:29
2014-04-08 02:07:07
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 02:07:07
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 02:07:08
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 02:07:08
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 02:15:58
!log LocalisationUpdate completed (1.23wmf20) at 2014-04-08 02:15:58+00:00
2014-04-08 02:16:06
Logged the message, Master
2014-04-08 02:34:57
!log LocalisationUpdate completed (1.23wmf21) at 2014-04-08 02:34:56+00:00
2014-04-08 02:35:02
Logged the message, Master
2014-04-08 02:45:57
PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 02:48:37
PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 02:48:57
RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds
2014-04-08 02:49:06
springle_: db1047 has been very sad lately
2014-04-08 02:49:27
RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds
2014-04-08 03:00:17
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 03:08:06
With 1.23wmf21 not getting deployed to mediawiki.org last thursday, does that mean the deployment schedule for 1.23wmf22 will be off by a week?
2014-04-08 03:11:07
!log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 8 03:11:04 UTC 2014 (duration 11m 3s)
2014-04-08 03:11:11
Logged the message, Master
2014-04-08 03:31:47
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 03:38:12
greg-g: still around?
2014-04-08 03:53:36
greg-g: check your mail
2014-04-08 04:03:35
!log upgrading libssl on ssl1001,ssl1002,ssl1003,ssl1004,ssl1005,ssl1006,ssl1007,ssl1008,ssl1009,ssl3001.esams.wikimedia.org,ssl3002.esams.wikimedia.org,ssl3003.esams.wikimedia.org
2014-04-08 04:03:41
Logged the message, Master
2014-04-08 04:03:57
TimStarling: is this the heartbleed.com thing?
2014-04-08 04:04:07
didn't know we used openssl
2014-04-08 04:15:22
Jasper_Deng: yes
2014-04-08 04:15:47
!log also upgraded libssl on cp4001-4019. Restarted nginx on these servers and also the previous list.
2014-04-08 04:15:51
Logged the message, Master
2014-04-08 04:37:40
!log upgrading libssl on virt1000
2014-04-08 04:37:44
Logged the message, Master
2014-04-08 04:38:21
!log upgrading libssl on virt0
2014-04-08 04:38:26
Logged the message, Master
2014-04-08 04:41:03
!log upgraded libssl on zirconium.wikimedia.org,neon.wikimedia.org,netmon1001.wikimedia.org,iodine.wikimedia.org,ytterbium.wikimedia.org,gerrit.wikimedia.org,virt1000.wikimedia.org,labs-ns1.wikimedia.org,stat1001.wikimedia.org
2014-04-08 04:43:13
!log restarted apache on the above list, failed on labs-ns1, virt1000, ytterbium
2014-04-08 04:43:18
Logged the message, Master
2014-04-08 04:43:47
TimStarling: I'll poke ytterbium
2014-04-08 04:44:00
Keep moving on to other boxes if you need.
2014-04-08 04:44:35
Seems up now.
2014-04-08 04:45:04
yeah, labs-ns1 and virt1000 are actually the same server
2014-04-08 04:45:19
and apache is running there with stime after the upgrade
2014-04-08 04:46:30
!log on dataset1001: upgraded libssl and restarted lighttpd
2014-04-08 04:46:34
Logged the message, Master
2014-04-08 04:53:47
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 05:08:07
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 05:08:07
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 05:08:07
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 05:08:07
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 05:25:10
('PS1') 'Aude': Enable Wikibase on Wikiquote [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124516'
2014-04-08 05:26:24
('CR') 'Aude': [C: '-2'] "requires sites and site_identifiers tables to be added and populated on wikiquote" [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124516' (owner: 'Aude')
2014-04-08 05:31:00
!log upgraded openssl on cp10* and cp30* servers as well
2014-04-08 05:31:06
Logged the message, Master
2014-04-08 05:39:29
!log restarted apache on fenari magnesium yterrbium antimony
2014-04-08 05:39:33
Logged the message, Master
2014-04-08 05:39:51
with some mispellings but people will get the point
2014-04-08 05:47:01
!log shot many old apache processes running as stats user from 2013, on stat1001 (restarting apache runs it as www-data user)
2014-04-08 05:47:06
Logged the message, Master
2014-04-08 06:34:37
('PS3') 'Matanya': dataset: fix module path [operations/puppet] - 'https://gerrit.wikimedia.org/r/119212'
2014-04-08 06:37:44
('PS3') 'Matanya': exim: fix scoping [operations/puppet] - 'https://gerrit.wikimedia.org/r/119496'
2014-04-08 06:43:48
springle: did you hear from otto regarding https://gerrit.wikimedia.org/r/#/c/122406/ ?
2014-04-08 06:45:27
matanya: no
2014-04-08 06:45:41
:/ i need to chase him down, thanks
2014-04-08 06:46:04
not sure otto knows about it? i emailed analytics lists directly
2014-04-08 06:46:29
so far the answer is: probably fine to decom db67, but lets wait for enveryone to chime in
2014-04-08 06:46:43
i'll bump it this week
2014-04-08 06:47:05
thank you
2014-04-08 07:30:44
('PS1') 'Faidon Liambotis': base: add debian-goodies [operations/puppet] - 'https://gerrit.wikimedia.org/r/124524'
2014-04-08 07:47:07
!log restarted nginx on cp1044 and cp1043
2014-04-08 07:47:12
Logged the message, Master
2014-04-08 07:53:07
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 07:53:07
('CR') 'coren': [C: '2'] base: add debian-goodies [operations/puppet] - 'https://gerrit.wikimedia.org/r/124524' (owner: 'Faidon Liambotis')
2014-04-08 08:02:57
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 08:09:07
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 08:09:07
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 08:09:07
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 08:09:07
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 08:11:47
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 08:15:17
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 08:36:30
ori: still working?
2014-04-08 09:03:47
PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 09:04:07
hashar: help with setting up zuul for the apps? https://gerrit.wikimedia.org/r/#/c/124539/
2014-04-08 09:08:37
PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 09:08:47
RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical
2014-04-08 09:08:57
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 09:11:47
PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 09:16:55
('PS1') 'RobH': Replacing the unified certificate [operations/puppet] - 'https://gerrit.wikimedia.org/r/124542'
2014-04-08 09:24:34
('CR') 'RobH': [C: '2'] Replacing the unified certificate [operations/puppet] - 'https://gerrit.wikimedia.org/r/124542' (owner: 'RobH')
2014-04-08 09:29:47
RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical
2014-04-08 09:33:47
PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 09:36:37
RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical
2014-04-08 09:37:37
RECOVERY - Disk space on labstore3 is OK: DISK OK
2014-04-08 09:39:19
YuviPanda: hello
2014-04-08 09:39:25
hashar: hello!
2014-04-08 09:40:00
PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 09:40:37
PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
2014-04-08 09:40:57
('PS1') 'Andrew Bogott': Add eth1 checks to nova compute hosts. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124560'
2014-04-08 09:44:12
and we lost YuviPanda
2014-04-08 09:45:10
Noooo not our panda. :(
2014-04-08 09:46:25
panda \O/
2014-04-08 09:46:28
PROBLEM - SSH on labstore3 is CRITICAL: Connection refused
2014-04-08 09:46:28
PROBLEM - DPKG on labstore3 is CRITICAL: Connection refused by host
2014-04-08 09:46:47
PROBLEM - puppet disabled on labstore3 is CRITICAL: Connection refused by host
2014-04-08 09:47:00
mutante: https://gerrit.wikimedia.org/r/#/c/124560/
2014-04-08 09:47:43
ACKNOWLEDGEMENT - DPKG on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
2014-04-08 09:47:44
ACKNOWLEDGEMENT - Disk space on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
2014-04-08 09:47:44
ACKNOWLEDGEMENT - RAID on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
2014-04-08 09:47:44
ACKNOWLEDGEMENT - SSH on labstore3 is CRITICAL: Connection refused daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
2014-04-08 09:47:44
ACKNOWLEDGEMENT - puppet disabled on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44.
2014-04-08 09:49:57
so nice to see all ops in an europian time zone :)
2014-04-08 09:50:37
PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100%
2014-04-08 09:57:12
('CR') 'Dzahn': [C: '-1'] Add eth1 checks to nova compute hosts. ('3' comments) [operations/puppet] - 'https://gerrit.wikimedia.org/r/124560' (owner: 'Andrew Bogott')
2014-04-08 10:00:49
ori: what is udpprofile::collector, and can i move it from db1014 to... somewhere else?
2014-04-08 10:02:47
springle: oh, wow. is there any indication that continues to see activity? mediawiki's profiler class can be configured to write to a database, but i didn't know anyone was using it in production. is it not ancient?
2014-04-08 10:04:56
mutante, cmjohnson: https://wikitech.wikimedia.org/wiki/Help:Git_rebase#Don.27t_panic
2014-04-08 10:05:21
andrewbogott: 42
2014-04-08 10:05:57
springle: it can go away
2014-04-08 10:06:34
springle: it was added in this commit: <https://gerrit.wikimedia.org/r/#/c/83953/>;. the message reads: "testing graphite 0.910 on db1014".
2014-04-08 10:07:04
yeah, asher stole db1014 for graphite
2014-04-08 10:07:12
trying to steal it back :)
2014-04-08 10:07:20
ori: thanks
2014-04-08 10:07:46
springle: it's not in any way implicated in our current graphite setup, which exists solely on tungsten.eqiad.wmnet (and labs)
2014-04-08 10:08:13
('PS2') 'Andrew Bogott': Add eth1 checks to nova compute hosts. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124560'
2014-04-08 10:08:18
mutante: ^
2014-04-08 10:09:24
('PS1') 'Cmjohnson': adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572'
2014-04-08 10:11:07
('CR') 'jenkins-bot': [V: '-1'] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572' (owner: 'Cmjohnson')
2014-04-08 10:12:49
('CR') 'Dzahn': [C: ''] Add eth1 checks to nova compute hosts. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124560' (owner: 'Andrew Bogott')
2014-04-08 10:15:34
!log update & reboot samarium
2014-04-08 10:15:38
Logged the message, Master
2014-04-08 10:15:48
('CR') 'Andrew Bogott': [C: '2'] Add eth1 checks to nova compute hosts. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124560' (owner: 'Andrew Bogott')
2014-04-08 10:16:26
('PS1') 'Springle': Remove unused db1014 block. db1014 was renamed tungsten rt5871. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124575'
2014-04-08 10:18:19
('CR') 'Springle': [C: '2'] Remove unused db1014 block. db1014 was renamed tungsten rt5871. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124575' (owner: 'Springle')
2014-04-08 10:21:04
!log update & reboot barium
2014-04-08 10:21:09
Logged the message, Master
2014-04-08 10:23:09
('PS1') 'Dzahn': add nrpe to base [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576'
2014-04-08 10:24:10
('CR') 'jenkins-bot': [V: '-1'] add nrpe to base [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576' (owner: 'Dzahn')
2014-04-08 11:09:28
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 11:09:28
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 11:09:28
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 11:09:28
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 11:32:05
('PS20') 'Matanya': etherpad: convert into a module [operations/puppet] - 'https://gerrit.wikimedia.org/r/107567'
2014-04-08 11:32:32
akosiaris: in a meeting or this ^ can be handled ?
2014-04-08 11:39:18
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 12:32:58
('PS2') 'Dzahn': add nrpe to base [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576'
2014-04-08 12:39:13
matanya: in ops meeting
2014-04-08 12:39:19
2014-04-08 12:39:27
and please tell me you did not resubmit from your local repo
2014-04-08 12:39:48
rebase* sorry
2014-04-08 12:39:50
('PS2') 'Cmjohnson': adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572'
2014-04-08 12:40:26
('CR') 'Andrew Bogott': [V: ''] "This looks good -- we'll see if it makes new alarms go off :)" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576' (owner: 'Dzahn')
2014-04-08 12:46:38
('PS3') 'Cmjohnson': adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572'
2014-04-08 12:48:28
PROBLEM - DPKG on strontium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
2014-04-08 12:49:28
RECOVERY - DPKG on strontium is OK: All packages OK
2014-04-08 12:49:35
('CR') 'Matanya': [C: ''] add nrpe to base [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576' (owner: 'Dzahn')
2014-04-08 12:50:21
paravoid: can you review please https://gerrit.wikimedia.org/r/124572
2014-04-08 12:50:38
mutante: https://rt.wikimedia.org/Ticket/Display.html?id=5064
2014-04-08 12:51:29
('CR') 'Dzahn': [C: ''] "yep, if we want to monitor this on everything, then standard-packages sounds good to me" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572' (owner: 'Cmjohnson')
2014-04-08 12:52:38
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 12:53:10
('CR') 'Alexandros Kosiaris': [C: '2'] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572' (owner: 'Cmjohnson')
2014-04-08 12:55:34
can anyone around update Elasticsearch in apt?
2014-04-08 12:55:55
and ack nagios errors (so they don't spam to irc) for a couple horus?
2014-04-08 12:56:39
!log reedy updated /a/common to {{Gerrit|Id15ddc665}}: Revert "Group0 wikis to 1.23wmf21"
2014-04-08 12:56:44
Logged the message, Master
2014-04-08 12:57:23
('PS1') 'Reedy': Non wikipedias to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124591'
2014-04-08 12:59:03
pokes qchris_away and ^d
2014-04-08 13:01:42
Any idea why https://gerrit.wikimedia.org/changes/?q=status:merged+age%3A0d&o=DETAILED_ACCOUNTS&n=100 doesn't work?
2014-04-08 13:02:00
('CR') 'Cmjohnson': [C: '2'] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 'https://gerrit.wikimedia.org/r/124572' (owner: 'Cmjohnson')
2014-04-08 13:03:24
2014-04-08 13:03:24
2014-04-08 13:07:41
('PS3') 'Dzahn': add nrpe to base [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576'
2014-04-08 13:12:48
('PS4') 'Dzahn': add nrpe to base [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576'
2014-04-08 13:15:18
2014-04-08 13:15:42
test akosiaris
2014-04-08 13:15:43
apergos: :-)
2014-04-08 13:15:51
2014-04-08 13:16:54
already pinged
2014-04-08 13:17:06
('PS1') 'coren': Tool Labs: forcibly upgrade libssl [operations/puppet] - 'https://gerrit.wikimedia.org/r/124594'
2014-04-08 13:19:25
('CR') 'Dzahn': [C: '2'] "RT #80 :)" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124576' (owner: 'Dzahn')
2014-04-08 13:21:58
ori: If you're here, please let me know :)
2014-04-08 13:26:57
_joe_: Couple of hours from now
2014-04-08 13:27:05
Though, he is around early sometimes
2014-04-08 13:27:31
Reedy: thanks
2014-04-08 13:30:38
('CR') 'RobH': [C: ''] Tool Labs: forcibly upgrade libssl [operations/puppet] - 'https://gerrit.wikimedia.org/r/124594' (owner: 'coren')
2014-04-08 13:31:20
ottomata: welcome!
2014-04-08 13:31:34
can you help me get started today?
2014-04-08 13:31:42
('CR') 'coren': [C: '2'] Tool Labs: forcibly upgrade libssl [operations/puppet] - 'https://gerrit.wikimedia.org/r/124594' (owner: 'coren')
2014-04-08 13:31:50
manybubbles: We have an extension for that
2014-04-08 13:31:51
2014-04-08 13:31:57
Reedy: thanks!
2014-04-08 13:32:01
I totally used it a while ago
2014-04-08 13:32:27
Reedy: Because we're using /r/ to mark the reverse proxy ...
2014-04-08 13:32:33
Reedy: https://gerrit.wikimedia.org/r/changes/?q=status:merged+age%3A0d&o=DETAILED_ACCOUNTS&n=100
2014-04-08 13:32:37
Reedy: ^ should work
2014-04-08 13:32:47
Aha, sweet!
2014-04-08 13:33:43
('PS1') 'RobH': replace blog.wikimedia.org certificate [operations/puppet] - 'https://gerrit.wikimedia.org/r/124595'
2014-04-08 13:35:07
ottomata: I need Elasticsearch 1.1.0 shoved into apt
2014-04-08 13:35:37
('PS2') 'RobH': replace blog.wikimedia.org certificate [operations/puppet] - 'https://gerrit.wikimedia.org/r/124595'
2014-04-08 13:36:15
qchris: thanks
2014-04-08 13:36:22
2014-04-08 13:37:04
PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds
2014-04-08 13:37:33
!log restarting gitblit
2014-04-08 13:37:33
('CR') 'RobH': [C: '2'] replace blog.wikimedia.org certificate [operations/puppet] - 'https://gerrit.wikimedia.org/r/124595' (owner: 'RobH')
2014-04-08 13:37:37
Logged the message, Master
2014-04-08 13:39:00
!log replacing the blog cert, if holmium crashes I didn't do it correctly.
2014-04-08 13:39:01
('PS1') 'Faidon Liambotis': Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124597'
2014-04-08 13:39:03
manybubbles: ok!
2014-04-08 13:39:03
Logged the message, RobH
2014-04-08 13:39:04
RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 305803 bytes in 9.337 second response time
2014-04-08 13:39:08
2014-04-08 13:39:28
!log update & reboot tellurium
2014-04-08 13:39:33
Logged the message, Master
2014-04-08 13:39:47
('CR') 'jenkins-bot': [V: '-1'] Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124597' (owner: 'Faidon Liambotis')
2014-04-08 13:41:14
PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100%
2014-04-08 13:42:38
('PS2') 'Faidon Liambotis': Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124597'
2014-04-08 13:43:27
('CR') 'Faidon Liambotis': [C: '2' V: '2'] Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124597' (owner: 'Faidon Liambotis')
2014-04-08 13:44:28
('CR') 'Manybubbles': "Is there a better place to run this?" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124597' (owner: 'Faidon Liambotis')
2014-04-08 13:45:14
RECOVERY - Host tellurium is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms
2014-04-08 13:46:13
!log upgraded libssl on holmium
2014-04-08 13:46:18
Logged the message, RobH
2014-04-08 13:48:49
ottomata: kafka upgrade doesn't work on an1004
2014-04-08 13:49:41
paravoid, analytics1004 (and analytics1003) were kafka test brokers, and were never productionized or puppetized
2014-04-08 13:49:50
i thought I had removed kafka from analytics1004, actually
2014-04-08 13:50:38
ottomata: can you install git fat on tin?
2014-04-08 13:50:42
I cannot
2014-04-08 13:50:46
hm, sure, why do you need git-fat there?
2014-04-08 13:50:55
to git deploy
2014-04-08 13:50:58
to Elasticsearch
2014-04-08 13:51:07
the plugins
2014-04-08 13:51:14
or is there another server
2014-04-08 13:51:17
you don't need git-fat on tin though
2014-04-08 13:51:23
the git-fat commands are run on deplo hsots
2014-04-08 13:51:27
on the targets
2014-04-08 13:51:46
huh, I'm used to running it on the server to check the jars got there. I'll just do it without and see
2014-04-08 13:53:21
ottomata: that worked as you said it would
2014-04-08 13:53:35
!log synced first Elasticsearch plugin to production Elasticsearch servers
2014-04-08 13:53:39
Logged the message, Master
2014-04-08 13:54:01
!log they'll pick it up during the rolling restart today to upgrade to 1.1.0
2014-04-08 13:54:05
Logged the message, Master
2014-04-08 13:54:08
2014-04-08 13:54:18
manybubbles: , i was going to start reinstalling an elastic search server today
2014-04-08 13:54:33
ottomata: not a _great_ day for it
2014-04-08 13:54:37
because I'm upgrading to 1.1.0
2014-04-08 13:54:43
2014-04-08 13:54:45
that is on the deployment calendar and everything
2014-04-08 13:55:05
maybe tomorrow?
2014-04-08 13:57:09
2014-04-08 14:04:07
ottomata: please ping me when you get a chance to update apt
2014-04-08 14:04:35
i was about to to do it, but am in standup now
2014-04-08 14:04:36
2014-04-08 14:04:41
q for akosiaris, if you are around
2014-04-08 14:04:54
I should change VerifyRelease, right?
2014-04-08 14:04:54
PROBLEM - DPKG on labstore4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
2014-04-08 14:04:59
i'm trying to find the right thing to change it to
2014-04-08 14:05:14
i downloaded 1.1's Release.gpg and am doing what the reprepro man page says to do
2014-04-08 14:05:17
but am not sure
2014-04-08 14:05:23
the output doesn't look like what you have
2014-04-08 14:05:54
RECOVERY - DPKG on labstore4 is OK: All packages OK
2014-04-08 14:09:44
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 14:09:44
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 14:09:44
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 14:09:44
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 14:11:17
('PS1') 'Andrew Bogott': Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601'
2014-04-08 14:18:13
('PS2') 'Andrew Bogott': Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601'
2014-04-08 14:19:21
('PS1') 'Ottomata': reprepro/updates - upgrading elasticsearch to 1.1 [operations/puppet] - 'https://gerrit.wikimedia.org/r/124603'
2014-04-08 14:20:08
('CR') 'Ottomata': [C: '2' V: '2'] reprepro/updates - upgrading elasticsearch to 1.1 [operations/puppet] - 'https://gerrit.wikimedia.org/r/124603' (owner: 'Ottomata')
2014-04-08 14:23:54
PROBLEM - HTTPS on ssl1002 is CRITICAL: Connection refused
2014-04-08 14:24:06
manybubbles: http://apt.wikimedia.org/wikimedia/pool/main/e/elasticsearch/
2014-04-08 14:24:09
look ok?
2014-04-08 14:28:54
RECOVERY - HTTPS on ssl1002 is OK: OK - Certificate will expire on 01/20/2016 12:00.
2014-04-08 14:29:45
ottomata: looks good - let me try elastic1001
2014-04-08 14:30:35
('PS3') 'Andrew Bogott': Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601'
2014-04-08 14:30:57
mutante, ^ pls?
2014-04-08 14:31:37
!log upgrading elastic1001
2014-04-08 14:31:42
Logged the message, Master
2014-04-08 14:32:38
!log woops, just restarted elastic1002. silly me
2014-04-08 14:32:42
Logged the message, Master
2014-04-08 14:32:46
!log no harm done, just lost time
2014-04-08 14:32:50
Logged the message, Master
2014-04-08 14:33:53
ottomata: can you make nagios not bother us about Elasticsearch warning over the next few hours?
2014-04-08 14:33:56
I'm paying attention
2014-04-08 14:34:25
uh hm
2014-04-08 14:35:43
i think so, how long manybubbles
2014-04-08 14:35:45
4 hours?
2014-04-08 14:35:48
2014-04-08 14:36:14
PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown
2014-04-08 14:38:14
RECOVERY - NTP peers on linne is OK: NTP OK: Offset 0.016747 secs
2014-04-08 14:44:43
andrewbogott: https://gerrit.wikimedia.org/r/#/c/77332/7/modules/base/manifests/monitoring/host.pp
2014-04-08 14:44:51
('PS4') 'Andrew Bogott': Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601'
2014-04-08 14:54:18
('PS5') 'Andrew Bogott': Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601'
2014-04-08 14:54:59
('PS3') 'Cmjohnson': add interface speed check for all hosts [operations/puppet] - 'https://gerrit.wikimedia.org/r/124606'
2014-04-08 15:01:42
mutante: can you review https://gerrit.wikimedia.org/r/124606
2014-04-08 15:02:06
('CR') 'Alexandros Kosiaris': [C: '-1'] "Great idea. Minor stuff here and there like making it parameterizable but looks nice." ('6' comments) [operations/puppet] - 'https://gerrit.wikimedia.org/r/124606' (owner: 'Cmjohnson')
2014-04-08 15:03:10
manybubbles: i think I just scheduled downtime in icinga for elastic search for the next ~4 hours
2014-04-08 15:03:19
never done that before, so not sure what it will do
2014-04-08 15:03:47
('PS1') 'Rush': module to manage new python-diamond package [operations/puppet] - 'https://gerrit.wikimedia.org/r/124608'
2014-04-08 15:04:54
ottomata: its cool!
2014-04-08 15:04:56
2014-04-08 15:07:45
('CR') 'Ottomata': module to manage new python-diamond package ('5' comments) [operations/puppet] - 'https://gerrit.wikimedia.org/r/124608' (owner: 'Rush')
2014-04-08 15:08:18
('CR') 'Dzahn': [C: ''] Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601' (owner: 'Andrew Bogott')
2014-04-08 15:12:34
('PS2') 'Rush': module to manage new python-diamond package [operations/puppet] - 'https://gerrit.wikimedia.org/r/124608'
2014-04-08 15:13:35
('CR') 'jenkins-bot': [V: '-1'] module to manage new python-diamond package [operations/puppet] - 'https://gerrit.wikimedia.org/r/124608' (owner: 'Rush')
2014-04-08 15:15:36
('PS3') 'Rush': module to manage new python-diamond package [operations/puppet] - 'https://gerrit.wikimedia.org/r/124608'
2014-04-08 15:16:34
PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (
2014-04-08 15:16:42
!log all ssl servers in eqiad have been updated with new cert and restarted
2014-04-08 15:16:51
!log rolling updates on ssl3001-3003 presently
2014-04-08 15:17:10
('PS1') 'Dzahn': enable base monitoring for ALL hosts [operations/puppet] - 'https://gerrit.wikimedia.org/r/124609'
2014-04-08 15:17:24
PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (
2014-04-08 15:18:04
RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms
2014-04-08 15:19:03
('CR') 'Andrew Bogott': [C: '2'] Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 'https://gerrit.wikimedia.org/r/124601' (owner: 'Andrew Bogott')
2014-04-08 15:19:04
RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms
2014-04-08 15:19:07
apergos: https://gerrit.wikimedia.org/r/#/c/124609/1
2014-04-08 15:19:46
ugly, eh.. since i have to change all those lines because of indentation :p
2014-04-08 15:22:25
('CR') 'ArielGlenn': [C: ''] enable base monitoring for ALL hosts [operations/puppet] - 'https://gerrit.wikimedia.org/r/124609' (owner: 'Dzahn')
2014-04-08 15:22:39
('CR') 'Dzahn': [C: '2'] enable base monitoring for ALL hosts [operations/puppet] - 'https://gerrit.wikimedia.org/r/124609' (owner: 'Dzahn')
2014-04-08 15:23:46
('CR') 'Ottomata': module to manage new python-diamond package ('2' comments) [operations/puppet] - 'https://gerrit.wikimedia.org/r/124608' (owner: 'Rush')
2014-04-08 15:27:31
PROBLEM - HTTPS on cp4009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:41
PROBLEM - HTTPS on ssl3003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:41
PROBLEM - HTTPS on ssl1006 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:41
PROBLEM - HTTPS on cp4014 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:51
PROBLEM - HTTPS on ssl1004 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:51
PROBLEM - HTTPS on ssl1005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:51
PROBLEM - HTTPS on cp4008 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:51
PROBLEM - HTTPS on cp4004 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:51
PROBLEM - HTTPS on cp4015 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:52
PROBLEM - HTTPS on cp4001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:52
PROBLEM - HTTPS on cp4017 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:53
PROBLEM - HTTPS on amssq47 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:53
PROBLEM - HTTPS on ssl1002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:54
PROBLEM - HTTPS on ssl1001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:54
PROBLEM - HTTPS on cp4005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:27:55
PROBLEM - HTTPS on cp4012 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:01
PROBLEM - HTTPS on cp4016 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:01
PROBLEM - HTTPS on sodium is CRITICAL: SSL_CERT CRITICAL lists.wikimedia.org: invalid CN (lists.wikimedia.org does not match *.wikimedia.org)
2014-04-08 15:28:11
PROBLEM - HTTPS on ssl1007 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:11
PROBLEM - HTTPS on iodine is CRITICAL: SSL_CERT CRITICAL ticket.wikimedia.org: invalid CN (ticket.wikimedia.org does not match *.wikimedia.org)
2014-04-08 15:28:11
PROBLEM - HTTPS on ssl3002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:11
PROBLEM - HTTPS on ssl3001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:11
PROBLEM - HTTPS on cp4018 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:12
PROBLEM - HTTPS on ssl1008 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:12
PROBLEM - HTTPS on ssl1009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:13
PROBLEM - HTTPS on ssl1003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:13
PROBLEM - HTTPS on cp4013 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:14
PROBLEM - HTTPS on cp4003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:14
PROBLEM - HTTPS on cp4007 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:15
PROBLEM - HTTPS on cp4011 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:15
PROBLEM - HTTPS on cp4010 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:21
PROBLEM - HTTPS on cp4020 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:21
PROBLEM - HTTPS on cp4006 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:31
PROBLEM - HTTPS on cp4002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:28:31
PROBLEM - HTTPS on cp4019 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org)
2014-04-08 15:30:02
holy fun :)
2014-04-08 15:30:37
2014-04-08 15:32:08
aude: getting to your email :)
2014-04-08 15:32:13
2014-04-08 15:32:25
want to see if it's ok to do today
2014-04-08 15:32:35
anytime works for us, i suppose
2014-04-08 15:34:45
aude: tl;dr of email: yep, looks good
2014-04-08 15:34:50
2014-04-08 15:35:07
we were smart to put i18n stuff a while ago :)
2014-04-08 15:35:42
PROBLEM - RAID on holmium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
2014-04-08 15:35:52
PROBLEM - DPKG on fenari is CRITICAL: NRPE: Command check_dpkg not defined
2014-04-08 15:36:01
the https failures are me muching with monitoring, nothing to worry about
2014-04-08 15:36:02
PROBLEM - Disk space on fenari is CRITICAL: NRPE: Command check_disk_space not defined
2014-04-08 15:36:12
PROBLEM - RAID on fenari is CRITICAL: NRPE: Command check_raid not defined
2014-04-08 15:36:22
PROBLEM - puppet disabled on fenari is CRITICAL: NRPE: Command check_puppet_disabled not defined
2014-04-08 15:36:57
mutante: fenari is not happy :-D
2014-04-08 15:38:21
hashar: thanks, that's cause we just added more monitoring
2014-04-08 15:38:33
RT #80 :)
2014-04-08 15:38:48
mutante: yeah I noticed your puppet change. Guess fenari is missing some bits
2014-04-08 15:41:12
hashar: wasn't running nagios-nrpe-server
2014-04-08 15:41:52
greg-g: re: SSL certs, andrewbogott is on that one
2014-04-08 15:41:57
ops monitoring sprint over here
2014-04-08 15:42:11
mutante: ahh, good to know who's on point for that, thanks
2014-04-08 15:42:23
wasn't sure if it'd be a opsen party thing or not
2014-04-08 15:42:44
it is. ops in Athens
2014-04-08 15:43:05
that check is new, in that it checks for validity of cert, not just expiry
2014-04-08 15:43:18
and wikimedia vs. wikipedia thing
2014-04-08 15:43:30
2014-04-08 15:44:52
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 438.266663
2014-04-08 15:45:02
('PS1') 'Andrew Bogott': When checking unified certs, check for *.wikipedia.org [operations/puppet] - 'https://gerrit.wikimedia.org/r/124616'
2014-04-08 15:45:32
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 434.533325
2014-04-08 15:46:21
('CR') 'Andrew Bogott': [C: '2'] When checking unified certs, check for *.wikipedia.org [operations/puppet] - 'https://gerrit.wikimedia.org/r/124616' (owner: 'Andrew Bogott')
2014-04-08 15:46:22
PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:45:20 PM UTC
2014-04-08 15:53:10
RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0
2014-04-08 15:53:17
hashar: ^ :)
2014-04-08 15:53:20
RECOVERY - puppet disabled on fenari is OK: OK
2014-04-08 15:53:26
2014-04-08 15:53:40
RECOVERY - Disk space on fenari is OK: DISK OK
2014-04-08 15:53:41
RT #80 ftw
2014-04-08 15:53:48
With any luck there'll be another flood of OKs in a minute...
2014-04-08 15:53:50
RECOVERY - DPKG on fenari is OK: All packages OK
2014-04-08 15:54:10
PROBLEM - puppet disabled on bast1001 is CRITICAL: NRPE: Command check_puppet_disabled not defined
2014-04-08 15:54:10
PROBLEM - Disk space on cp3003 is CRITICAL: NRPE: Command check_disk_space not defined
2014-04-08 15:54:10
PROBLEM - Disk space on dobson is CRITICAL: Connection refused by host
2014-04-08 15:54:10
PROBLEM - DPKG on pdf2 is CRITICAL: Connection refused by host
2014-04-08 15:54:20
PROBLEM - puppet disabled on iron is CRITICAL: NRPE: Command check_puppet_disabled not defined
2014-04-08 15:54:20
PROBLEM - RAID on dobson is CRITICAL: Connection refused by host
2014-04-08 15:54:20
PROBLEM - RAID on cp3003 is CRITICAL: NRPE: Command check_raid not defined
2014-04-08 15:54:20
PROBLEM - Disk space on pdf2 is CRITICAL: Connection refused by host
2014-04-08 15:54:30
PROBLEM - puppet disabled on dobson is CRITICAL: Connection refused by host
2014-04-08 15:54:30
PROBLEM - RAID on pdf2 is CRITICAL: Connection refused by host
2014-04-08 15:54:30
PROBLEM - DPKG on iodine is CRITICAL: NRPE: Command check_dpkg not defined
2014-04-08 15:54:30
PROBLEM - puppet disabled on pdf2 is CRITICAL: Connection refused by host
2014-04-08 15:54:40
PROBLEM - Disk space on iodine is CRITICAL: NRPE: Command check_disk_space not defined
2014-04-08 15:54:40
PROBLEM - puppet disabled on cp3003 is CRITICAL: NRPE: Command check_puppet_disabled not defined
2014-04-08 15:54:40
PROBLEM - DPKG on pdf3 is CRITICAL: Connection refused by host
2014-04-08 15:54:48
that's not what I meant
2014-04-08 15:54:50
PROBLEM - RAID on iodine is CRITICAL: NRPE: Command check_raid not defined
2014-04-08 15:54:50
PROBLEM - Disk space on pdf3 is CRITICAL: Connection refused by host
2014-04-08 15:54:50
PROBLEM - DPKG on tridge is CRITICAL: NRPE: Command check_dpkg not defined
2014-04-08 15:54:50
PROBLEM - DPKG on bast1001 is CRITICAL: NRPE: Command check_dpkg not defined
2014-04-08 15:54:51
PROBLEM - puppet disabled on iodine is CRITICAL: NRPE: Command check_puppet_disabled not defined
2014-04-08 15:54:51
PROBLEM - RAID on pdf3 is CRITICAL: Connection refused by host
2014-04-08 15:54:51
PROBLEM - Disk space on tridge is CRITICAL: NRPE: Command check_disk_space not defined
2014-04-08 15:55:00
PROBLEM - Disk space on bast1001 is CRITICAL: NRPE: Command check_disk_space not defined
2014-04-08 15:55:00
PROBLEM - puppet disabled on pdf3 is CRITICAL: Connection refused by host
2014-04-08 15:55:10
PROBLEM - Disk space on iron is CRITICAL: NRPE: Command check_disk_space not defined
2014-04-08 15:55:10
PROBLEM - RAID on bast1001 is CRITICAL: NRPE: Command check_raid not defined
2014-04-08 15:55:10
PROBLEM - DPKG on dobson is CRITICAL: Connection refused by host
2014-04-08 15:55:10
PROBLEM - DPKG on cp3003 is CRITICAL: NRPE: Command check_dpkg not defined
2014-04-08 15:55:10
PROBLEM - DPKG on virt1000 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
2014-04-08 15:55:10
PROBLEM - puppet disabled on tridge is CRITICAL: NRPE: Command check_puppet_disabled not defined
2014-04-08 15:55:41
ahhh, so today is going to be a worthless -operations channel day, more than normal, due to the sprint? :)
2014-04-08 15:56:03
We're about to all go to dinner though.
2014-04-08 15:56:09
So things should quiet down shortly.
2014-04-08 15:56:10
PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:55:50 PM UTC
2014-04-08 15:56:19
But the channel will still be useless if you want to talk to ops :)
2014-04-08 15:56:50
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 15:57:03
will start nagios-nrpe-server on those
2014-04-08 15:57:10
PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:56:15 PM UTC
2014-04-08 15:58:42
RECOVERY - HTTPS on ssl3001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 15:58:42
RECOVERY - HTTPS on ssl1006 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 15:58:52
RECOVERY - HTTPS on ssl1007 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 15:58:52
RECOVERY - HTTPS on ssl1002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 15:59:32
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 15:59:52
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 16:00:04
back in 5 min or so
2014-04-08 16:00:06
('Abandoned') 'Physikerwelt': WIP: Enable orthogonal MathJax config [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/110240' (owner: 'Physikerwelt')
2014-04-08 16:00:42
PROBLEM - DPKG on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
2014-04-08 16:00:42
PROBLEM - Disk space on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
2014-04-08 16:00:52
PROBLEM - RAID on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
2014-04-08 16:01:02
PROBLEM - puppet disabled on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
2014-04-08 16:02:22
PROBLEM - Puppet freshness on ms6 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:02:03 PM UTC
2014-04-08 16:04:37
2014-04-08 16:08:22
PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:07:31 PM UTC
2014-04-08 16:09:27
PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:09:07 PM UTC
2014-04-08 16:09:27
PROBLEM - Puppet freshness on lvs4003 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:08:32 PM UTC
2014-04-08 16:09:27
RECOVERY - HTTPS on cp4020 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:27
RECOVERY - HTTPS on cp4006 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:27
RECOVERY - HTTPS on cp4013 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:37
RECOVERY - HTTPS on cp4009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:37
RECOVERY - HTTPS on cp4010 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:37
RECOVERY - HTTPS on ssl3003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:47
RECOVERY - HTTPS on ssl3002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:47
RECOVERY - HTTPS on ssl1004 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:09:56
ottomata: ping
2014-04-08 16:09:57
RECOVERY - HTTPS on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:10:07
RECOVERY - HTTPS on cp4016 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:10:07
RECOVERY - HTTPS on ssl1008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:10:07
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 16:10:07
RECOVERY - HTTPS on cp4018 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:10:17
RECOVERY - HTTPS on ssl1009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:11:23
ottomata: ping ping
2014-04-08 16:12:47
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 16:12:49
pong pong
2014-04-08 16:13:05
2014-04-08 16:13:08
2014-04-08 16:13:14
what's with stat1's puppet?
2014-04-08 16:13:18
why is it admin disabled?
2014-04-08 16:13:47
because it is going to be decomed very soon
2014-04-08 16:13:56
and i wanted to make puppet changes that would apply to stat1003 but not mess with what was on stat1
2014-04-08 16:14:05
and I didn't want to re-write a bunch of statistics.pp stuff :/
2014-04-08 16:14:07
ori: are you around? seems like graphite is *not* working
2014-04-08 16:14:24
ottomata: that's bad
2014-04-08 16:14:27
PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:13:54 PM UTC
2014-04-08 16:14:35
paravoid: even if we are going to decom it soon?
2014-04-08 16:14:36
ottomata: can you remove the "include statistics*" stuff and enable it again?
2014-04-08 16:14:40
2014-04-08 16:14:42
yeah probably can
2014-04-08 16:14:47
because it's messing with monitoring and all that
2014-04-08 16:15:06
ah i see it
2014-04-08 16:15:20
paravoid, what is the differnece between the 3 numbers in each severity category in icinga?
2014-04-08 16:15:25
ottomata: disabling puppet for more than a few hours max is almost always a really bad idea
2014-04-08 16:15:31
mark, ok, noted.
2014-04-08 16:15:36
2014-04-08 16:16:27
PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:16:04 PM UTC
2014-04-08 16:16:27
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 16:17:07
2014-04-08 16:17:27
PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:16:39 PM UTC
2014-04-08 16:18:10
mark, can you help with the current network ACL problems?
2014-04-08 16:18:22
sorry, what's that?
2014-04-08 16:18:25
analytics nodes can't talk to apt
2014-04-08 16:18:27
PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:17:50 PM UTC
2014-04-08 16:18:30
nor statsd.eqiad.wmnet
2014-04-08 16:18:32
2014-04-08 16:18:37
I added to the bottom of that ticket
2014-04-08 16:18:51
2014-04-08 16:18:59
i think vanadium was having the same trouble, is it on the vlan too?
2014-04-08 16:19:27
PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:19:10 PM UTC
2014-04-08 16:19:31
still working on wikiquote
2014-04-08 16:19:35
we can look at getting rid of those ACLs perhaps
2014-04-08 16:19:41
but we'll need to discuss what you're doing with firewalling
2014-04-08 16:20:18
('PS1') 'Ottomata': Disabling statistics roles on stat1 [operations/puppet] - 'https://gerrit.wikimedia.org/r/124621'
2014-04-08 16:20:18
the fingerprint of the wikis SSL cert apparently changed, but it is not a new issued cert but with the same dates as the previous one that i saved. Is that okay that the fingerprint changed?
2014-04-08 16:20:34
mark, yeah, hm, not sure, i kind of like them
2014-04-08 16:20:35
se4598: yes
2014-04-08 16:20:45
especially since anyone with hadoop access can launch whatever mapreduce jobs they want
2014-04-08 16:21:37
('CR') 'Ottomata': [C: '2' V: '2'] Disabling statistics roles on stat1 [operations/puppet] - 'https://gerrit.wikimedia.org/r/124621' (owner: 'Ottomata')
2014-04-08 16:21:37
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 16:21:44
2014-04-08 16:21:48
that's weird
2014-04-08 16:21:59
checking on that 5xx thing in a sec
2014-04-08 16:22:05
that's surely my fault...
2014-04-08 16:22:27
PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:21:21 PM UTC
2014-04-08 16:22:27
PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:21:26 PM UTC
2014-04-08 16:22:27
PROBLEM - Puppet freshness on lvs4002 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:22:07 PM UTC
2014-04-08 16:22:53
hmm, graphite down?
2014-04-08 16:23:04
ottomata: statsd access for analytics seems already there
2014-04-08 16:23:07
maybe that 5xx thing is not my fault!
2014-04-08 16:23:26
yeah, mark, i think we already had these set up too
2014-04-08 16:23:27
PROBLEM - Puppet freshness on virt2 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:22:28 PM UTC
2014-04-08 16:23:37
RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Tue Apr 8 16:23:30 UTC 2014
2014-04-08 16:23:43
but it seems that they aren't working right now, starting yesterday when I tried
2014-04-08 16:24:02
('PS1') 'Hashar': beta: reenable fatalmonitor script on eqiad [operations/puppet] - 'https://gerrit.wikimedia.org/r/124624'
2014-04-08 16:24:13
and carbon is in there already too
2014-04-08 16:24:15
mark, unless pings just aren't allowed and i'm checking wrong?
2014-04-08 16:24:24
pings may not be allowed no
2014-04-08 16:24:27
ori and I both had trouble runnign apt-get update because we coudln't talk to carbon
2014-04-08 16:24:31
check again?
2014-04-08 16:24:35
yeah checking
2014-04-08 16:24:48
and i was trying to run sqstat on analytics1003
2014-04-08 16:24:52
so we can decom emery
2014-04-08 16:24:59
but it couldn't talk to statsd
2014-04-08 16:25:38
2014-04-08 16:25:44
yeah totally working now
2014-04-08 16:25:57
2014-04-08 16:25:59
2014-04-08 16:26:00
ottomata: graphite is borked
2014-04-08 16:26:04
i think faidon did it earlier
2014-04-08 16:26:05
('CR') 'Hashar': "puppet is broken on deployment-bastion.eqiad.wmflabs, can't deploy the change right now :-/" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124624' (owner: 'Hashar')
2014-04-08 16:26:21
oh, fixed the acl problem?
2014-04-08 16:26:33
maybe something else was just not working, and I assumed because I couldn't ping it was an ACL thing?
2014-04-08 16:26:55
ping is not a good way to test that
2014-04-08 16:27:10
yeah, i just saw the packets being filtered from ping
2014-04-08 16:27:11
we allow specific protocols/ports, ping uses different ones
2014-04-08 16:27:14
2014-04-08 16:27:30
yeah, just figured if i couldn't at least ping then probably other stuff was blcoked too, but ja
2014-04-08 16:27:57
but yeah, ori couldn't use apt on vanadium either, so dunno...
2014-04-08 16:28:10
and sqstat couldnt' talk to tungsten, so hm
2014-04-08 16:28:12
but ok!
2014-04-08 16:28:16
2014-04-08 16:28:22
we're going for dinner in a bit
2014-04-08 16:28:44
2014-04-08 16:28:45
2014-04-08 16:28:53
so sqstat is trying to talk to tungsten on 2003
2014-04-08 16:28:56
!log Jenkins: killed jenkins-slave java process on gallium and repooled gallium slave. It was no more registered in Zuul :-/
2014-04-08 16:28:57
RECOVERY - puppet disabled on iron is OK: OK
2014-04-08 16:28:57
is that open?
2014-04-08 16:29:01
Logged the message, Master
2014-04-08 16:29:07
RECOVERY - Disk space on iron is OK: DISK OK
2014-04-08 16:29:09
can't seem to reach it from an03
2014-04-08 16:29:34
ganglia seems upset
2014-04-08 16:29:40
protocol udp;
2014-04-08 16:29:40
destination-port 8125;
2014-04-08 16:29:45
tables added
2014-04-08 16:29:51
so port 2003 isn't
2014-04-08 16:29:54
ah ok
2014-04-08 16:30:03
that's why then, could you add?
2014-04-08 16:30:13
2014-04-08 16:30:40
i'm going to see if reqstats gets flaky when we move it to analytics1003
2014-04-08 16:30:51
it was either flaky because erbium is busy
2014-04-08 16:30:57
or because the multicast firehose is just too lossy
2014-04-08 16:31:37
!log added sites and site_identifiers core tables on wikiquote
2014-04-08 16:31:41
Logged the message, Master
2014-04-08 16:32:22
2003 should work now
2014-04-08 16:33:36
RECOVERY - DPKG on iodine is OK: All packages OK
2014-04-08 16:33:36
RECOVERY - Disk space on iodine is OK: DISK OK
2014-04-08 16:33:36
RECOVERY - puppet disabled on cp3003 is OK: OK
2014-04-08 16:33:39
ah just noticed it is udp, mark, will that work still?
2014-04-08 16:33:46
RECOVERY - HTTPS on cp4014 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:46
RECOVERY - RAID on cp3003 is OK: OK: optimal, 2 logical, 2 physical
2014-04-08 16:33:46
RECOVERY - RAID on iodine is OK: OK: no disks configured for RAID
2014-04-08 16:33:46
RECOVERY - HTTPS on ssl1005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:46
RECOVERY - HTTPS on cp4003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:47
2014-04-08 16:33:51
ok cool
2014-04-08 16:33:52
2014-04-08 16:33:53
ok go eat
2014-04-08 16:33:55
thank you!
2014-04-08 16:33:56
RECOVERY - DPKG on bast1001 is OK: All packages OK
2014-04-08 16:33:56
RECOVERY - puppet disabled on iodine is OK: OK
2014-04-08 16:33:56
RECOVERY - HTTPS on cp4002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:56
RECOVERY - HTTPS on amssq47 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:56
RECOVERY - HTTPS on cp4004 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:57
RECOVERY - HTTPS on cp4001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:57
RECOVERY - HTTPS on cp4017 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:58
RECOVERY - HTTPS on cp4015 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:58
RECOVERY - HTTPS on cp4008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:59
RECOVERY - HTTPS on ssl1001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:33:59
RECOVERY - HTTPS on cp4005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:34:00
RECOVERY - Disk space on bast1001 is OK: DISK OK
2014-04-08 16:34:00
RECOVERY - HTTPS on cp4019 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:34:06
RECOVERY - RAID on bast1001 is OK: OK: no RAID installed
2014-04-08 16:34:06
RECOVERY - DPKG on cp3003 is OK: All packages OK
2014-04-08 16:34:06
RECOVERY - HTTPS on ssl1003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:34:06
RECOVERY - HTTPS on cp4007 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:34:16
RECOVERY - puppet disabled on bast1001 is OK: OK
2014-04-08 16:34:16
RECOVERY - Disk space on cp3003 is OK: DISK OK
2014-04-08 16:34:16
RECOVERY - HTTPS on cp4011 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days)
2014-04-08 16:35:36
PROBLEM - Puppet freshness on lvs4004 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:35:09 PM UTC
2014-04-08 16:35:46
PROBLEM - HTTPS on cp1044 is CRITICAL: SSL_CERT CRITICAL *.wikimedia.org: invalid CN (*.wikimedia.org does not match *.wikipedia.org)
2014-04-08 16:35:56
PROBLEM - HTTPS on cp1043 is CRITICAL: SSL_CERT CRITICAL *.wikimedia.org: invalid CN (*.wikimedia.org does not match *.wikipedia.org)
2014-04-08 16:36:48
('PS1') 'Ottomata': Putting sqstat back on analytics1003 [operations/puppet] - 'https://gerrit.wikimedia.org/r/124630'
2014-04-08 16:37:16
('CR') 'Ottomata': [C: '2' V: '2'] Putting sqstat back on analytics1003 [operations/puppet] - 'https://gerrit.wikimedia.org/r/124630' (owner: 'Ottomata')
2014-04-08 16:38:30
('PS1') 'Springle': invalid MariaDB variable name: user_stat [operations/puppet] - 'https://gerrit.wikimedia.org/r/124632'
2014-04-08 16:40:40
('CR') 'Springle': [C: '2'] invalid MariaDB variable name: user_stat [operations/puppet] - 'https://gerrit.wikimedia.org/r/124632' (owner: 'Springle')
2014-04-08 16:46:50
('PS1') 'RobH': replace misc-web-lb cert [operations/puppet] - 'https://gerrit.wikimedia.org/r/124634'
2014-04-08 16:48:11
('CR') 'RobH': [C: '2' V: '2'] replace misc-web-lb cert [operations/puppet] - 'https://gerrit.wikimedia.org/r/124634' (owner: 'RobH')
2014-04-08 16:49:09
sorry, being slow... populating sites table
2014-04-08 16:49:20
('PS1') 'Alexandros Kosiaris': Removing ethtool package from other places [operations/puppet] - 'https://gerrit.wikimedia.org/r/124637'
2014-04-08 16:49:22
suppose no hurry
2014-04-08 16:50:08
('CR') 'Dzahn': [C: ''] Removing ethtool package from other places [operations/puppet] - 'https://gerrit.wikimedia.org/r/124637' (owner: 'Alexandros Kosiaris')
2014-04-08 16:52:03
('CR') 'Dzahn': [C: '2'] "now included in base" [operations/puppet] - 'https://gerrit.wikimedia.org/r/124637' (owner: 'Alexandros Kosiaris')
2014-04-08 16:53:08
('CR') 'Cmcmahon': [C: ''] "Thanks for putting this back." [operations/puppet] - 'https://gerrit.wikimedia.org/r/124624' (owner: 'Hashar')
2014-04-08 16:53:36
RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Tue Apr 8 16:53:29 UTC 2014
2014-04-08 16:53:46
RECOVERY - Puppet freshness on dataset1001 is OK: puppet ran at Tue Apr 8 16:53:39 UTC 2014
2014-04-08 16:55:06
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 16:55:28
2014-04-08 16:56:36
RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Tue Apr 8 16:56:30 UTC 2014
2014-04-08 16:56:46
RECOVERY - Puppet freshness on lvs1003 is OK: puppet ran at Tue Apr 8 16:56:45 UTC 2014
2014-04-08 16:59:04
waiting for jenkins
2014-04-08 17:01:46
RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Tue Apr 8 17:01:37 UTC 2014
2014-04-08 17:01:48
('PS2') 'Manybubbles': Turn on experimental highlighting in beta [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124003'
2014-04-08 17:03:06
!log aude synchronized php-1.23wmf20/extensions/Wikidata 'Update Wikidata build, to allow populating sites table on wikiquote'
2014-04-08 17:03:10
Logged the message, Master
2014-04-08 17:05:20
RECOVERY - Puppet freshness on lvs4004 is OK: puppet ran at Tue Apr 8 17:05:14 UTC 2014
2014-04-08 17:05:30
PROBLEM - RAID on dataset1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded)
2014-04-08 17:06:40
PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused
2014-04-08 17:07:40
RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.012 second response time
2014-04-08 17:08:20
RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Tue Apr 8 17:08:15 UTC 2014
2014-04-08 17:08:30
RECOVERY - Puppet freshness on lvs4003 is OK: puppet ran at Tue Apr 8 17:08:25 UTC 2014
2014-04-08 17:08:44
('CR') 'Chad': [C: '2'] Turn on experimental highlighting in beta [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124003' (owner: 'Manybubbles')
2014-04-08 17:08:53
('Merged') 'jenkins-bot': Turn on experimental highlighting in beta [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124003' (owner: 'Manybubbles')
2014-04-08 17:09:40
RECOVERY - Puppet freshness on lvs1006 is OK: puppet ran at Tue Apr 8 17:09:30 UTC 2014
2014-04-08 17:10:10
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 17:10:10
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 17:10:10
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 17:10:10
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 17:10:19
('CR') 'QChris': "Prerequisite got merged." [operations/puppet] - 'https://gerrit.wikimedia.org/r/121546' (owner: 'Ottomata')
2014-04-08 17:10:52
^demon|away: are you deploying stuff?
2014-04-08 17:11:14
i'll need to sneak in some point for a config change, but not yet
2014-04-08 17:11:29
('PS1') 'Ottomata': Moving sqstat back to emery :/ [operations/puppet] - 'https://gerrit.wikimedia.org/r/124641'
2014-04-08 17:11:38
('PS2') 'Ottomata': Moving sqstat back to emery :/ [operations/puppet] - 'https://gerrit.wikimedia.org/r/124641'
2014-04-08 17:11:40
('CR') 'jenkins-bot': [V: '-1'] Moving sqstat back to emery :/ [operations/puppet] - 'https://gerrit.wikimedia.org/r/124641' (owner: 'Ottomata')
2014-04-08 17:11:50
('CR') 'Ottomata': [C: '2' V: '2'] Moving sqstat back to emery :/ [operations/puppet] - 'https://gerrit.wikimedia.org/r/124641' (owner: 'Ottomata')
2014-04-08 17:12:28
aude: no, he just merged something for beta
2014-04-08 17:12:34
2014-04-08 17:12:41
probably need 10 more minutes
2014-04-08 17:12:50
done populating tables, now checking they are ok
2014-04-08 17:13:00
then can do the config change and then done :)
2014-04-08 17:13:19
aude: Nope, just merged that for Nik for beta.
2014-04-08 17:13:21
Like he said :)
2014-04-08 17:13:22
going slow and careful since i'm still newish
2014-04-08 17:13:25
doign this stuff
2014-04-08 17:13:32
Someone should sync it eventually for consistency, but no biggie.
2014-04-08 17:13:53
i can do
2014-04-08 17:14:04
so can I
2014-04-08 17:14:29
hoo: want to check the sites tables and site_identifiers for wikiquote?
2014-04-08 17:14:30
RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Tue Apr 8 17:14:22 UTC 2014
2014-04-08 17:14:36
they look ok to me
2014-04-08 17:15:30
RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Tue Apr 8 17:15:22 UTC 2014
2014-04-08 17:16:02
('CR') 'Aude': "sites table and site_identifiers are added and populated" [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124516' (owner: 'Aude')
2014-04-08 17:16:10
RECOVERY - Puppet freshness on lvs1004 is OK: puppet ran at Tue Apr 8 17:16:02 UTC 2014
2014-04-08 17:16:28
!log finished upgrading elastic1001-1006. starting on 1007. yay progress.
2014-04-08 17:16:32
Logged the message, Master
2014-04-08 17:16:34
enwikiqoute looks good to me
2014-04-08 17:16:39
2014-04-08 17:16:40
sites and site_identifiers
2014-04-08 17:16:44
strip protocals and all
2014-04-08 17:16:52
2014-04-08 17:16:58
https://gerrit.wikimedia.org/r/#/c/124516/ want to merge
2014-04-08 17:17:07
i can deploy it and sync the cirrus thing
2014-04-08 17:17:19
2014-04-08 17:17:22
ok, also looks good on WD
2014-04-08 17:17:30
2014-04-08 17:17:45
let me sync cirrus
2014-04-08 17:17:52
go ahead
2014-04-08 17:17:53
Oh, today is the day
2014-04-08 17:18:06
it's *the* day :)
2014-04-08 17:18:10
RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Apr 8 17:18:03 UTC 2014
2014-04-08 17:19:18
aude: You also sorted the wikidataclient dblist? :P
2014-04-08 17:19:53
2014-04-08 17:20:04
Ok, looks good to me, can approve whenever you want
2014-04-08 17:20:05
they will get sorted eventually
2014-04-08 17:20:13
doing chad's thing
2014-04-08 17:20:30
RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Tue Apr 8 17:20:23 UTC 2014
2014-04-08 17:21:30
RECOVERY - Puppet freshness on lvs1001 is OK: puppet ran at Tue Apr 8 17:21:24 UTC 2014
2014-04-08 17:21:50
RECOVERY - Puppet freshness on amslvs4 is OK: puppet ran at Tue Apr 8 17:21:45 UTC 2014
2014-04-08 17:22:30
RECOVERY - Puppet freshness on lvs4002 is OK: puppet ran at Tue Apr 8 17:22:21 UTC 2014
2014-04-08 17:22:43
!log aude synchronized wmf-config/CirrusSearch-labs.php 'config change for beta, to enable highlighting'
2014-04-08 17:22:47
Logged the message, Master
2014-04-08 17:23:06
hoo: ready
2014-04-08 17:23:45
('CR') 'Hoo man': [C: '2'] "Preparation finished, so do this! \o/" [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124516' (owner: 'Aude')
2014-04-08 17:23:49
2014-04-08 17:23:51
there you go ;)
2014-04-08 17:23:53
('Merged') 'jenkins-bot': Enable Wikibase on Wikiquote [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124516' (owner: 'Aude')
2014-04-08 17:27:20
aude: About to sync or shall I take it?
2014-04-08 17:27:21
sync dblist then wmf-config?
2014-04-08 17:27:31
2014-04-08 17:27:43
no other way
2014-04-08 17:27:52
other way round sounds sane
2014-04-08 17:28:02
wmf-config then dblist is good
2014-04-08 17:28:06
wmf-config changes will work w/o the rest
2014-04-08 17:28:10
2014-04-08 17:28:20
that' what ree-dy did for wikisource
2014-04-08 17:28:52
2014-04-08 17:28:55
2014-04-08 17:28:59
!log aude synchronized wmf-config 'config changes to enable Wikibase on Wikiquote'
2014-04-08 17:29:04
Logged the message, Master
2014-04-08 17:29:12
('PS1') 'Matthias Mullie': Increase Flow cache version [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124646'
2014-04-08 17:29:52
!log aude synchronized wikidataclient.dblist 'Enable Wikibase on Wikiquote'
2014-04-08 17:29:57
Logged the message, Master
2014-04-08 17:30:01
2014-04-08 17:30:02
2014-04-08 17:30:12
alright time to check it's all good
2014-04-08 17:30:17
on that
2014-04-08 17:31:13
oh well... I think we have to bump wgCacheEpoch once again
2014-04-08 17:31:14
aude: ^
2014-04-08 17:31:36
2014-04-08 17:31:45
ah, yes
2014-04-08 17:32:00
shall I patch or will you?
2014-04-08 17:32:26
2014-04-08 17:32:34
Nemo_bis: Yes, the usual stuff
2014-04-08 17:32:34
go ahead
2014-04-08 17:33:06
it says list of values is complete
2014-04-08 17:33:09
i assume caching
2014-04-08 17:33:16
on Q60
2014-04-08 17:33:57
debug=true, i can add wikiquote
2014-04-08 17:34:23
yep, I did action=purge
2014-04-08 17:34:23
('PS1') 'Hoo man': Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124648'
2014-04-08 17:34:24
2014-04-08 17:34:31
aude: ^
2014-04-08 17:34:35
2014-04-08 17:35:21
!log restarted gmetad on nickel to fix ganglia
2014-04-08 17:35:26
Logged the message, Master
2014-04-08 17:35:33
('CR') 'Aude': [C: '2'] Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124648' (owner: 'Hoo man')
2014-04-08 17:35:40
('Merged') 'jenkins-bot': Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124648' (owner: 'Hoo man')
2014-04-08 17:37:00
aude: Syncing? I have to sync a touch out
2014-04-08 17:37:10
2014-04-08 17:37:12
2014-04-08 17:37:18
!log aude synchronized wmf-config/Wikibase.php 'bump wgCacheEpoch for wikidata after enabling wikiquote site links'
2014-04-08 17:37:19
just being careful
2014-04-08 17:37:22
Logged the message, Master
2014-04-08 17:37:28
!log hoo synchronized php-1.23wmf20/extensions/Wikidata/extensions/Wikibase/lib/resources/wikibase.Site.js 'touch'
2014-04-08 17:37:32
Logged the message, Master
2014-04-08 17:37:34
that should purge the sites cache
2014-04-08 17:37:43
"13:37 < aude> just being careful" +1 ;)
2014-04-08 17:37:44
in resource loader
2014-04-08 17:37:47
2014-04-08 17:38:25
still says complete
2014-04-08 17:38:30
mh :/
2014-04-08 17:38:45
sites module has always been a pain
2014-04-08 17:40:24
maybe php-1.23wmf20/extensions/Wikidata/extensions/Wikibase/lib/includes/modules/SitesModule.php ?
2014-04-08 17:40:43
aude: Wont help, RL does timestamps based on the JS scripts
2014-04-08 17:40:50
hmmm, ok
2014-04-08 17:41:13
works for me
2014-04-08 17:41:16
now at least
2014-04-08 17:41:35
trying in firefox
2014-04-08 17:41:39
might be my caching
2014-04-08 17:41:42
\o/ Just added the first link
2014-04-08 17:41:46
2014-04-08 17:41:48
already did one :)
2014-04-08 17:41:54
with debug=true
2014-04-08 17:41:59
Cheating :D
2014-04-08 17:42:11
2014-04-08 17:42:23
looks good in firefox
2014-04-08 17:42:30
i have to assume it's my cache
2014-04-08 17:42:31
I did one ten minutes ago already :P
2014-04-08 17:42:35
2014-04-08 17:42:36
2014-04-08 17:42:45
Nemo_bis: with debug true, I guess?!
2014-04-08 17:42:50
lol Heisenberg
2014-04-08 17:42:55
19.34 < Nemo_bis> yep, I did action=purge
2014-04-08 17:43:01
2014-04-08 17:43:01
2014-04-08 17:43:50
Is there a procedure to delete gerrit repositories?
2014-04-08 17:45:00
i can add links in wikidata now in chrome
2014-04-08 17:45:09
aude: https://en.wikiquote.org/w/index.php?title=Werner_Heisenberg&action=info mh
2014-04-08 17:45:14
why is it not showing up?
2014-04-08 17:45:34
Guest64226 / krinkle : probably you can ask on the same gerrit queue page as usual
2014-04-08 17:45:53
ah, I see
2014-04-08 17:45:57
unless it's not "your" repository, in which case maybe a bug is better
2014-04-08 17:46:11
dispatching is ... :S
2014-04-08 17:47:21
2014-04-08 17:47:28
2014-04-08 17:47:44
i did action=purge on https://en.wikiquote.org/wiki/New_York_City
2014-04-08 17:47:46
aude: Can we safely skip theses changes? If not just waiting is also fine
2014-04-08 17:47:54
it's catching up rather quickly AFAIS
2014-04-08 17:47:55
removed dewikiquote
2014-04-08 17:48:08
we can wait
2014-04-08 17:48:16
waits in line to do a group0 to 1.23wmf21 scap
2014-04-08 17:48:28
give us 5 more minutes to poke
2014-04-08 17:48:43
aude: Sounds good
2014-04-08 17:48:59
i think we're ok though...
2014-04-08 17:49:32
or nothing we solve in 5 min, but didn't break anything
2014-04-08 17:50:51
aude: I can bump the chd_seen fields
2014-04-08 17:51:12
2014-04-08 17:52:05
Just looking for the right change id
2014-04-08 17:53:43
got that
2014-04-08 17:54:37
something is weird with wikiquote... like it's not actually enabled now
2014-04-08 17:54:45
but sure i saw it was
2014-04-08 17:55:29
thinks this happened with wikisource
2014-04-08 17:56:19
!log changed the Wikidata wb_changes_dispatch position of all wikiquote wikis to 118158153
2014-04-08 17:56:23
Logged the message, Master
2014-04-08 17:56:39
enwikiquote is in wikidataclient.dblist
2014-04-08 17:56:42
2014-04-08 17:57:03
that was the timestamp, should be a few moments before anything happened regarding wikiquote
2014-04-08 17:57:12
2014-04-08 17:57:39
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 540.333313
2014-04-08 17:58:28
still https://en.wikiquote.org/w/index.php?title=Werner_Heisenberg&action=info
2014-04-08 17:58:56
Wikidata is not even loaded there... wtf
2014-04-08 17:58:59
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 645.666687
2014-04-08 17:59:03
2014-04-08 17:59:05
i'm sure it was
2014-04-08 17:59:25
do i have to sync dblist again?
2014-04-08 17:59:37
did we somehow undo it?
2014-04-08 18:00:58
no, looks good on a random mw* machine
2014-04-08 18:01:09
PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 1694 MB (2% inode=86%):
2014-04-08 18:01:14
2014-04-08 18:01:50
!log hoo synchronized wmf-config/InitialiseSettings.php 'Touch to clear config. cache'
2014-04-08 18:01:54
Logged the message, Master
2014-04-08 18:01:55
2014-04-08 18:02:09
it's back!
2014-04-08 18:02:11
Sorry, I forgot about that
2014-04-08 18:02:33
was about to try that
2014-04-08 18:02:37
2014-04-08 18:02:41
touch all the wikidata things :)
2014-04-08 18:02:43
wants to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=58618 so that's automatic
2014-04-08 18:02:56
i think we are done!
2014-04-08 18:03:19
i am sure this happened on wikisource or previously where it was enabled and then not
2014-04-08 18:03:38
puzzled but we're good now
2014-04-08 18:04:13
Yep, looks good to me
2014-04-08 18:04:23
aude, hoo: All clear for me to mess with /a/common on tin and then scap?
2014-04-08 18:04:37
Yep, go ahead... we're done for now :)
2014-04-08 18:04:47
2014-04-08 18:05:08
2014-04-08 18:06:11
('PS1') 'BryanDavis': Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124655'
2014-04-08 18:06:50
crosses fingers and knocks on wood
2014-04-08 18:07:03
('CR') 'BryanDavis': [C: '2'] Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124655' (owner: 'BryanDavis')
2014-04-08 18:07:05
2014-04-08 18:07:46
greg-g: Aaron merged my fix so in theory I should only need one scap. I'll verify the file after the first scap to be certain
2014-04-08 18:08:21
2014-04-08 18:08:28
('Merged') 'jenkins-bot': Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124655' (owner: 'BryanDavis')
2014-04-08 18:10:36
!log bd808 Started scap: group0 wikis to 1.23wmf21 (with patch for bug 63659)
2014-04-08 18:10:39
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 18:10:41
Logged the message, Master
2014-04-08 18:11:25
l10n cache did not rebuild which is a great sign
2014-04-08 18:11:58
Unable to open /usr/local/apache/common-local/wikiversions.cdb.
2014-04-08 18:11:58
2014-04-08 18:12:01
i get a "Unable to open /usr/local/apache/common-local/wikiversions.cdb."
2014-04-08 18:12:10
...and same here.
2014-04-08 18:12:12
[2014-04-08 18:11:37] Fatal error: Unable to open /usr/local/apache/common-local/wikiversions.cdb.
2014-04-08 18:12:15
2014-04-08 18:12:19
Yeah. fuck
2014-04-08 18:12:21
yeah, you got it
2014-04-08 18:12:22
here the same
2014-04-08 18:12:26
It will be fixed in a few moments
2014-04-08 18:12:30
thats everything
2014-04-08 18:12:31
well shit
2014-04-08 18:12:45
2014-04-08 18:12:49
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000
2014-04-08 18:12:57
There's my first crash all of the wikis
2014-04-08 18:12:59
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 18:13:00
2014-04-08 18:13:05
2014-04-08 18:13:13
down on wm
2014-04-08 18:13:21
damn it, I was actually reading an article and I reloaded it to test
2014-04-08 18:13:23
It was my "fix" for the scap problem
2014-04-08 18:13:25
now I can't read it while I wait
2014-04-08 18:13:29
PROBLEM - Apache HTTP on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.007 second response time
2014-04-08 18:13:29
PROBLEM - Apache HTTP on mw1055 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.013 second response time
2014-04-08 18:13:29
PROBLEM - Apache HTTP on mw1150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.004 second response time
2014-04-08 18:13:29
PROBLEM - Apache HTTP on mw1101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.005 second response time
2014-04-08 18:13:29
PROBLEM - Apache HTTP on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.009 second response time
2014-04-08 18:13:29
PROBLEM - Apache HTTP on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.003 second response time
2014-04-08 18:13:30
PROBLEM - Apache HTTP on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time
2014-04-08 18:13:30
PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time
2014-04-08 18:13:31
PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.013 second response time
2014-04-08 18:13:31
PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors -
2014-04-08 18:13:33
2014-04-08 18:13:34
2014-04-08 18:13:39
PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.018 second response time
2014-04-08 18:13:39
PROBLEM - Apache HTTP on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.012 second response time
2014-04-08 18:13:39
PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.008 second response time
2014-04-08 18:13:42
PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time
2014-04-08 18:13:42
PROBLEM - Apache HTTP on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.022 second response time
2014-04-08 18:13:42
PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.011 second response time
2014-04-08 18:13:42
PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.010 second response time
2014-04-08 18:13:42
PROBLEM - Apache HTTP on mw1154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.007 second response time
2014-04-08 18:13:52
It will be fixed soon… scap will fix it at the end
2014-04-08 18:13:54
!log bd808 Finished scap: group0 wikis to 1.23wmf21 (with patch for bug 63659) (duration: 03m 18s)
2014-04-08 18:13:59
Logged the message, Master
2014-04-08 18:14:00
2014-04-08 18:14:01
Should be fixed now
2014-04-08 18:14:04
2014-04-08 18:14:15
breathes again
2014-04-08 18:14:22
can whoever's in charge of icinga-wm bring it back to life?
2014-04-08 18:14:35
Damn it. :P
2014-04-08 18:14:37
jackmcbarn: it'll again automatically, I *believe*
2014-04-08 18:14:38
2014-04-08 18:14:39
so what happened?
2014-04-08 18:14:47
Oh, you know about it?
2014-04-08 18:14:48
greg-g: You accidentally a verb.
2014-04-08 18:14:49
2014-04-08 18:14:50
RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time
2014-04-08 18:14:50
RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.073 second response time
2014-04-08 18:14:51
RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.084 second response time
2014-04-08 18:14:51
RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.111 second response time
2014-04-08 18:14:51
Patch https://gerrit.wikimedia.org/r/#/c/124627/
2014-04-08 18:14:52
RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.062 second response time
2014-04-08 18:14:52
RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.059 second response time
2014-04-08 18:15:07
Marybelle: :)
2014-04-08 18:15:16
I'll write up the email. I know exactly what I fucked up
2014-04-08 18:15:21
bd808|deploy: thanks, I was just about to report "Unable to open /usr/local/apache/common-local/wikiversions.cdb." - glad to see it's under control
2014-04-08 18:15:29
2014-04-08 18:15:54
what's going on?
2014-04-08 18:16:08
we are all at dinner
2014-04-08 18:16:23
fixed now
2014-04-08 18:16:24
it's ok
2014-04-08 18:16:25
paravoid: My fault. Should be fixed now
2014-04-08 18:16:31
2014-04-08 18:16:35
paravoid: go back to dinner, all's ok again :)
2014-04-08 18:16:36
scap temporarily broke everything though
2014-04-08 18:16:36
do you need anything?
2014-04-08 18:16:39
PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 183.266663
2014-04-08 18:16:39
2014-04-08 18:16:44
manual page us if something happens
2014-04-08 18:16:52
paravoid: nope, known ef up
2014-04-08 18:16:57
paravoid: will do, enjoy!
2014-04-08 18:17:05
2014-04-08 18:18:17
('PS2') 'Gerg? Tisza': Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124036'
2014-04-08 18:18:56
('CR') 'Gerg? Tisza': "Updated to display feedback survey on beta enwiki." [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124036' (owner: 'Gerg? Tisza')
2014-04-08 18:19:29
greg-g: I just reverted my patch to scap that caused that cascade of horribleness
2014-04-08 18:19:36
2014-04-08 18:19:44
One the plus side, group0 is on wmf21 now
2014-04-08 18:19:50
2014-04-08 18:19:58
2014-04-08 18:20:09
scared to change it back
2014-04-08 18:20:20
"Don't. Touch. Any. Thing."
2014-04-08 18:20:25
i suppose if bd808|deploy 's patch is reverted then ok
2014-04-08 18:20:39
well, we still have the previous issue which it was trying to fix ;)
2014-04-08 18:20:59
1 step forward, 1 step back
2014-04-08 18:21:23
So yes we are temporarily back to needing to double-scap, but I'll make a patch that doesn't melt the world after lunch
2014-04-08 18:22:25
bd808|deploy: :)
2014-04-08 18:23:15
wikiquote etc all looks fine, so i'm going home / eating
2014-04-08 18:23:20
back in hour
2014-04-08 18:23:26
k, I'll do the same
2014-04-08 18:23:33
quite late dinner for berlin
2014-04-08 18:23:47
so I told my wife we broke the internet. she told me facebook was working....
2014-04-08 18:24:18
Nemo_bis: It's never to late for food :P
2014-04-08 18:24:41
2014-04-08 18:28:38
hoo: well, I'd call death for starvation, pellagra etc. "too late" :P
2014-04-08 18:29:07
Nemo_bis: :P To late as in time of the day...
2014-04-08 18:29:08
2014-04-08 18:30:17
hoo: http://p.defau.lt/?md_cbLJuORDNsGkhY6_NAg :P
2014-04-08 18:30:55
at least the other errors are gone now, I guess
2014-04-08 18:31:28
manybubbles: :(
2014-04-08 18:31:42
goes to lunch for real
2014-04-08 18:32:34
hoo: yeah, i submitted a patch for hhvm to fix that other issue btw
2014-04-08 18:32:49
PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server
2014-04-08 18:34:15
ori: Oh... nice that it's actually done in PHP :)
2014-04-08 18:35:34
yeah yeah yeah, elasticsearch 1012 is being upgraded
2014-04-08 18:37:56
hoo: which component should that be filed under?
2014-04-08 18:39:25
ori: already done https://bugzilla.wikimedia.org/show_bug.cgi?id=63691
2014-04-08 18:39:39
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 639.299988
2014-04-08 18:39:40
oh cool, thanks!
2014-04-08 18:42:09
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 530.733337
2014-04-08 18:42:20
ori: Any idea who to poke about https://gerrit.wikimedia.org/r/121709 ?
2014-04-08 18:43:46
('CR') 'Matanya': add interface speed check for all hosts ('2' comments) [operations/puppet] - 'https://gerrit.wikimedia.org/r/124606' (owner: 'Cmjohnson')
2014-04-08 18:44:08
('PS2') 'Ori.livneh': Change wgServer and wgCanonicalServer for arbcom wikis [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/121709' (owner: 'Hoo man')
2014-04-08 18:44:53
('CR') 'Ori.livneh': [C: '2'] Change wgServer and wgCanonicalServer for arbcom wikis [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/121709' (owner: 'Hoo man')
2014-04-08 18:45:06
!log ori updated /a/common to {{Gerrit|I4b18e4ce8}}: Change wgServer and wgCanonicalServer for arbcom wikis
2014-04-08 18:45:11
Logged the message, Master
2014-04-08 18:45:28
heh :)
2014-04-08 18:45:39
RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 18:45:50
!log ori synchronized wmf-config/InitialiseSettings.php 'I4b18e4ce8: Change wgServer and wgCanonicalServer for arbcom wikis'
2014-04-08 18:45:55
Logged the message, Master
2014-04-08 18:53:40
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 18:56:09
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 18:57:39
PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 172.800003
2014-04-08 18:58:59
RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 18:59:00
PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:00
PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:00
PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:00
PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:09
PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server
2014-04-08 18:59:09
PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:09
PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:10
PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 18:59:29
PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409
2014-04-08 19:00:03
2014-04-08 19:00:11
it recovered in a few seconds
2014-04-08 19:00:16
not sure why it did that
2014-04-08 19:07:39
PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 341.200012
2014-04-08 19:12:00
RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:00
RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:00
RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:00
RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:10
RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:11
RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:11
RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:12:11
RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:13:16
thats right
2014-04-08 19:13:18
horrible check
2014-04-08 19:13:36
no errors in the logs associated with those warnings
2014-04-08 19:18:49
RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000
2014-04-08 19:20:55
2014-04-08 19:23:39
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 531.166687
2014-04-08 19:24:29
PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server
2014-04-08 19:24:49
PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:50
PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:50
PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:59
PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:59
PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:59
PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:59
PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:59
PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:24:59
PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414
2014-04-08 19:25:09
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 635.799988
2014-04-08 19:25:11
kicks icinga-wm
2014-04-08 19:26:39
PROBLEM - DPKG on elastic1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
2014-04-08 19:28:39
RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 19:29:38
huh: it is being fixed by ops
2014-04-08 19:31:39
RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 19:36:39
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 19:37:49
PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:49
PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:50
PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:59
PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:59
PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:59
PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:59
PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:59
PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:37:59
PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:38:00
PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:38:07
2014-04-08 19:38:09
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 19:38:10
PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:38:10
PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:38:10
PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server
2014-04-08 19:38:10
PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:38:29
PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407
2014-04-08 19:38:39
PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 224.199997
2014-04-08 19:39:39
RECOVERY - DPKG on elastic1015 is OK: All packages OK
2014-04-08 19:40:19
oh shut up
2014-04-08 19:40:52
I'm doing rolling restarts
2014-04-08 19:41:47
got it: labswiki_content_1394813391
2014-04-08 19:41:53
that thing is configured without replicas
2014-04-08 19:46:40
PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 341.066681
2014-04-08 19:48:00
RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:01
RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:01
RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:01
RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:10
RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:10
RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:10
RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:10
RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:30
RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:43
and, more noise!
2014-04-08 19:48:49
RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:49
RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:49
RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:48:59
PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303
2014-04-08 19:48:59
PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303
2014-04-08 19:48:59
PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303
2014-04-08 19:49:22
bit me labswiki!
2014-04-08 19:52:34
cheers manybubbles on
2014-04-08 19:52:53
it'll spam us again in a few minutes
2014-04-08 19:52:59
labswiki recovered a long time ago
2014-04-08 19:53:05
it was only out for ~30 seconds each time
2014-04-08 19:53:20
but ganglia wants all the shards on all the wikis to be recovered before it is happy
2014-04-08 19:53:59
RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:53:59
RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:53:59
RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0
2014-04-08 19:56:15
!log upgraded all elasticsearch servers except elastic1008. that is coming now.
2014-04-08 19:56:20
Logged the message, Master
2014-04-08 19:58:20
!log finished upgrading to Elasticsearch 1.1.0. The process went well with no issues other then some knocking out search in labs 3 times for 30 seconds a piece. And logging lots of nasty warnings to irc. I've started to the process to fix search in labs so it won't happen again.
2014-04-08 19:58:25
Logged the message, Master
2014-04-08 20:05:39
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 420.066681
2014-04-08 20:08:09
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.900024
2014-04-08 20:10:29
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 20:10:29
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 20:10:29
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 20:10:29
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 20:10:39
RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 20:12:39
RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 20:16:56
Does someone here know about dns issues with wmflabs-domains or related stuff that happened recently?
2014-04-08 20:19:39
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 20:20:41
PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 176.399994
2014-04-08 20:22:09
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 20:26:39
PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 368.466675
2014-04-08 20:28:02
re:heartbleed, I think we'll be wanting a new corp certificate... do you guys have a favorite vendor for star certs these days?
2014-04-08 20:28:21
it's almost due for a re-up anyway, so it's worth the effort
2014-04-08 20:29:53
2014-04-08 20:48:39
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 642.700012
2014-04-08 20:51:39
RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 20:51:39
RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 20:52:09
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 537.099976
2014-04-08 20:59:46
greg-g: don't believe you
2014-04-08 20:59:58
2014-04-08 21:00:04
This is the work of the Beast
2014-04-08 21:00:11
greg-g: Do you still want to try group1 to 1.23wmf21 today or have we had enough excitement?
2014-04-08 21:00:53
reminds folks that all ops are out at a bar except for those who are about to go to sleep :-D
2014-04-08 21:01:06
bd808: we're back to "if you run scap, run it twice" world, right?
2014-04-08 21:01:10
apergos: :)
2014-04-08 21:01:23
odder: which part? :)
2014-04-08 21:01:36
greg-g: Yes, but for group1 to 1.23wmf21 we only need to run sync-wikiversions
2014-04-08 21:01:49
2014-04-08 21:02:09
the world looks sane on phase0?
2014-04-08 21:02:11
2014-04-08 21:02:34
greg-g: all of it - notice the number immediately preceding .html
2014-04-08 21:02:39
PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 232.46666
2014-04-08 21:02:48
odder: haha
2014-04-08 21:03:39
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:03:54
this is neat: https://graphite.wikimedia.org/render/…
2014-04-08 21:04:36
I think that's what ori told me yesterdayt to not worry about
2014-04-08 21:05:09
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:05:25
bd808: if we do, we do now, so we have 2 hours before SWAT of settle bug report time. May I take your whole day?
2014-04-08 21:06:36
greg-g: I'm yours to command. :)
2014-04-08 21:06:39
PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 269.866669
2014-04-08 21:06:42
2014-04-08 21:06:48
2014-04-08 21:06:55
2014-04-08 21:07:09
bd808: go forth, please
2014-04-08 21:09:36
('PS1') 'BryanDavis': Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124744'
2014-04-08 21:11:12
('CR') 'BryanDavis': [C: '2'] Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124744' (owner: 'BryanDavis')
2014-04-08 21:11:20
('Merged') 'jenkins-bot': Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124744' (owner: 'BryanDavis')
2014-04-08 21:12:17
!log bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.23wmf21
2014-04-08 21:12:23
Logged the message, Master
2014-04-08 21:12:47
greg-g: Have you guys already killed all user sessions?
2014-04-08 21:12:52
Can't see a server admin log entry
2014-04-08 21:15:44
greg-g: I did a https://commons.wikimedia.org/wiki/Commons:Village_pump#Users_are_being_forced_to_log_out
2014-04-08 21:18:21
Thanks odder, I left a note about it on en VPT since I saw a question about the bug in general
2014-04-08 21:18:48
Maybe I'll cross-post that to Meta too
2014-04-08 21:19:59
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 21:20:14
!log bd808 Purged l10n cache for 1.23wmf18
2014-04-08 21:20:19
Logged the message, Master
2014-04-08 21:21:46
!log bd808 Purged l10n cache for 1.23wmf19
2014-04-08 21:21:50
Logged the message, Master
2014-04-08 21:21:54
hoo: in process
2014-04-08 21:22:55
2014-04-08 21:23:09
hoo: it takes longer than you'd imagine, maybe :)
2014-04-08 21:23:37
greg-g: group1 to 1.23wmf21 is {{done}}
2014-04-08 21:23:40
greg-g: just change the cookie name? (like last time)
2014-04-08 21:24:09
se4598: I'm defering to chris on it (not sure what his exact process is, honestly)
2014-04-08 21:24:14
bd808|deploy: ty
2014-04-08 21:24:53
mh, the tokens will be still valid I think, wasn't a good idea
2014-04-08 21:25:14
se4598: Yeah I think that's why it takes a while
2014-04-08 21:26:45
greg-g: Well given how many users we have and that we probably don't want to hammer the DBs to much, I can imagine this to take some time
2014-04-08 21:26:52
2014-04-08 21:28:16
csteipp: Why not run one process per shard?
2014-04-08 21:29:24
Jamesofur: if you're keeping track of things, I alerted Commons and Meta; perhaps someone would need to alert the other big Wikipedias
2014-04-08 21:29:35
Dunno if the message to tech-ambassadors will be enough; may be.
2014-04-08 21:30:35
('PS2') 'MaxSem': Put a safeguard on GeoData's usage of CirrusSearch [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/121874'
2014-04-08 21:30:37
('PS1') 'MaxSem': Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124747'
2014-04-08 21:30:39
RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:30:39
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 535.0
2014-04-08 21:30:54
('CR') 'jenkins-bot': [V: '-1'] Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124747' (owner: 'MaxSem')
2014-04-08 21:31:32
se4598: Assuming attacker has the login token, they could use the new name and again spoof the user
2014-04-08 21:31:39
RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:31:46
('PS2') 'MaxSem': Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124747'
2014-04-08 21:32:09
odder: yeah, I'll see if we can poke people, we're going to send out SM messages as well in a couple minutes
2014-04-08 21:32:19
with a recommendation to password reset
2014-04-08 21:33:09
2014-04-08 21:33:22
sorry, Social Media (Twitter/Facebook/G+ etc)
2014-04-08 21:33:42
TMA, Too Many Abbreviations
2014-04-08 21:33:45
2014-04-08 21:33:59
yup lol
2014-04-08 21:34:09
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.133362
2014-04-08 21:34:10
I abuse them, I even make up my own and forget that they are just in my head
2014-04-08 21:34:23
2014-04-08 21:34:49
Jamesofur: EUS IAA. TA IANAL.
2014-04-08 21:34:58
*EYS :p
2014-04-08 21:35:42
thanks HaeB, retweeted
2014-04-08 21:40:46
woah, new code on wikidata?
2014-04-08 21:40:46
Jamesofur: using mass-message might be a good idea
2014-04-08 21:41:15
aude: yep, all ok?
2014-04-08 21:41:26
HaeB: ^ what do you think? (about MM)
2014-04-08 21:41:48
2014-04-08 21:42:08
greg-g: itjdi
2014-04-08 21:42:12
so we're confident?
2014-04-08 21:42:39
PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 187.866669
2014-04-08 21:42:53
aude: in that it won't break at 2:00 utc? yeah
2014-04-08 21:43:06
aude: the only thing we're still not confident about is scap on thursday
2014-04-08 21:44:19
2014-04-08 21:44:39
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:44:40
PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 320.200012
2014-04-08 21:44:55
Jamesofur, matanya: i think for the session ending, massmessage would be overkill. regarding the password reset, it's a judgment call (how high one estimates the risk for users who don't change it)
2014-04-08 21:45:24
HaeB: it depends on user rights as well
2014-04-08 21:45:27
aude: The bug that caused all the 1.23wmf21 l10n issues is https://bugzilla.wikimedia.org/show_bug.cgi?id=63659
2014-04-08 21:46:31
are there any other major sites who notified all users?
2014-04-08 21:46:54
not that I've seen yet, but I have a feeling some are still going through the fixing process
2014-04-08 21:46:55
2014-04-08 21:46:59
(to recommend a password chanage)
2014-04-08 21:47:10
eg. just got stuff from CloudBees
2014-04-08 21:47:15
github also logged me out
2014-04-08 21:47:37
would also be interesting to know how quick the wikis were fixed after the news broke yesterday
2014-04-08 21:47:40
latimes has an article about resetting your password, but that's different
2014-04-08 21:48:09
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:48:13
last night (PT) i filed a RT ticket for the blog, which was vulnerable at the time, but at that point the wikis tested ok already
2014-04-08 21:48:36
The wikis auto update OpenSSL via puppet
2014-04-08 21:49:00
hoo: well ya ;) the question is when we updated puppet ;)
2014-04-08 21:49:24
Jamesofur: The servers do that themselves
2014-04-08 21:49:39
per https://wikitech.wikimedia.org/wiki/Server_admin_log , the blog (holmium) was pretty late in the game
2014-04-08 21:49:50
The timeline is all in SAL from last night
2014-04-08 21:49:51
Yesterday I posted about that to the internal ops list, but forgot to poke a root to do a apt-cache clean and force puppet run
2014-04-08 21:50:08
"04:03 Tim: upgrading libssl on ssl1001,ssl1002,ssl1003,ssl1004,ssl1005,ssl1006,ssl1007,ssl1008,ssl1009,ssl3001.esams.wikimedia.org,ssl3002.esams.wikimedia.org,ssl3003.esams.wikimedia.org" - is that the entry for the wikis?
2014-04-08 21:50:37
Mostly yes
2014-04-08 21:53:39
RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:53:39
RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 21:53:59
RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000
2014-04-08 21:54:55
('PS1') 'Jean-Frédéric': Add Musées de la Haute-Saône to wgCopyUploadsDomains [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124754'
2014-04-08 22:01:11
greg-g, poking you because I'm not sure who's on point for the i18n / scap stuff -- but I recall getting pinged a couple of days ago (on a centralnotice keyword) saying that the i18n update was failing due to exceptions on CN (and others). I'm wondering if CN's fail was due to being on a deployment branch that did not have the JSON updates (until just now).
2014-04-08 22:01:46
shouldn't be
2014-04-08 22:01:57
there's backward compat in l10nupdate
2014-04-08 22:02:17
mwalker: see https://bugzilla.wikimedia.org/show_bug.cgi?id=63659 for all the gorey details
2014-04-08 22:02:33
puts on tyvek suit
2014-04-08 22:02:38
2014-04-08 22:30:59
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 22:33:06
greg-g: Could I push a small centralauth update soon?
2014-04-08 22:33:44
yeah, now is fine, 30 minutes until swat
2014-04-08 22:34:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:36:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:37:04
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 34.533333
2014-04-08 22:37:34
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 260.733337
2014-04-08 22:38:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:40:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:42:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:44:14
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 625.166687
2014-04-08 22:44:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:45:36
marktraceur: I see in deploy-calendar that you have changeset which especially activates MediaViewer on en-beta. You(r pc) may get hit by https://bugzilla.wikimedia.org/show_bug.cgi?id=63709
2014-04-08 22:46:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:47:22
se4598: Is there a fix?
2014-04-08 22:47:50
I'm guessing it's an SSL problem
2014-04-08 22:48:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:48:43
se4598: Replied on bug
2014-04-08 22:49:09
('PS1') 'BryanDavis': Create symlink for compile-wikiversions in /usr/local/bin [operations/puppet] - 'https://gerrit.wikimedia.org/r/124763'
2014-04-08 22:49:23
marktraceur: We in #wikimedia-labs haven't one. And thats not about https but dns resolve, so I don't understand what do you mean by https?
2014-04-08 22:49:35
Oh, hm
2014-04-08 22:49:37
Never mind, sorry
2014-04-08 22:50:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:52:04
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 22:52:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:52:34
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 22:52:56
marktraceur: currently the fix is.....: it may work if you try multiple times or wait some time (minutes, hours) ;P
2014-04-08 22:54:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:56:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:56:54
RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000
2014-04-08 22:58:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC
2014-04-08 22:58:41
greg-g: csteipp: got both core changes ready
2014-04-08 22:58:53
I mean changes to the deploy branch
2014-04-08 22:59:52
hoo: Cool.. one sec and I'll merge and deploy it
2014-04-08 23:00:12
I can also jump in, am on tin still anyway
2014-04-08 23:00:14
RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Tue Apr 8 23:00:04 UTC 2014
2014-04-08 23:02:24
PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 11:00:04 PM UTC
2014-04-08 23:05:24
stupid puppet
2014-04-08 23:06:33
always wondered what Puppet does anyways
2014-04-08 23:07:09
pulls the strings ;)
2014-04-08 23:07:20
(or, probably better 'is the strings' )
2014-04-08 23:07:26
Jasper_Deng: Playing with the servers :D
2014-04-08 23:08:20
Technically, the sysadmins are a puppet in the WMFs plans, right? :p
2014-04-08 23:08:37
!log csteipp synchronized php-1.23wmf21/extensions/CentralAuth/maintenance 'Push maintenance script for token reset'
2014-04-08 23:08:39
or we're all just puppets in their plans, duh
2014-04-08 23:08:41
Logged the message, Master
2014-04-08 23:09:04
Jamesofur: You're the past of the puppets :p
2014-04-08 23:09:09
*master of the
2014-04-08 23:09:57
greg-g: CentralAuth updates are out, so swat can go ahead if they were waiting on me
2014-04-08 23:10:01
;) the user with said name may dislike me claiming the title
2014-04-08 23:10:40
mwalker: ori ebernhardson ^
2014-04-08 23:10:46
also, what the heck, oit_display ?
2014-04-08 23:10:54
2014-04-08 23:11:10
PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 23:11:10
PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 23:11:10
PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 23:11:10
PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC
2014-04-08 23:11:51
2014-04-08 23:11:54
yes; it's 4!
2014-04-08 23:13:25
SUL doesn't work?
2014-04-08 23:14:02
csteipp, ^
2014-04-08 23:14:03
Danny_B: We are logging out all users
2014-04-08 23:14:10
see http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/000666.html
2014-04-08 23:14:32
csteipp, warn ppl with a site notice?
2014-04-08 23:14:35
hoo: you know that this isn't merged? https://gerrit.wikimedia.org/r/124756
2014-04-08 23:15:00
se4598: not this important at the very moments
2014-04-08 23:15:03
* moment
2014-04-08 23:15:23
Danny_B: SUL should work... You should just be logged out. If you can't login, let me know
2014-04-08 23:15:53
csteipp: will we get logged out each time you hit a wiki we've visited recently? or just the once per user in theory
2014-04-08 23:16:15
If you're a global user, just once (right now as I logout all the centralauth users)
2014-04-08 23:16:32
If you have multiple ununified local accounts, each will get logged out
2014-04-08 23:16:51
csteipp: i have to log in on every single project although i have central username
2014-04-08 23:16:54
<grumbles about that><waves fist impotently at it.wp>
2014-04-08 23:17:30
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 135.300003
2014-04-08 23:17:55
marktraceur, MaxSem I'm going to +2 and confirm https://gerrit.wikimedia.org/r/#/c/124036/2 , https://gerrit.wikimedia.org/r/#/c/121874/2 , https://gerrit.wikimedia.org/r/#/c/124747/
2014-04-08 23:18:30
PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 173.666672
2014-04-08 23:18:32
it would be wonderful if you all could +1 that so that I know you've looked and said this is good to me
2014-04-08 23:18:35
2014-04-08 23:18:53
csteipp: +1 to notice ppl with central notice
2014-04-08 23:18:57
('CR') 'MarkTraceur': [C: ''] Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124036' (owner: 'Gerg? Tisza')
2014-04-08 23:19:00
+1 ourselves?
2014-04-08 23:19:16
doesn't sound very assuring:)
2014-04-08 23:19:21
nah; you're probably OK MaxSem :p
2014-04-08 23:19:27
but I don't know who Gergo is
2014-04-08 23:19:44
but mark was sponsoring the patch
2014-04-08 23:19:53
he's tgr :P
2014-04-08 23:20:00
('CR') 'Mwalker': [C: '2'] Put a safeguard on GeoData's usage of CirrusSearch [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/121874' (owner: 'MaxSem')
2014-04-08 23:20:08
('CR') 'Mwalker': [C: '2'] Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124747' (owner: 'MaxSem')
2014-04-08 23:20:21
('CR') 'Mwalker': [C: '2'] Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 'https://gerrit.wikimedia.org/r/124036' (owner: 'Gerg? Tisza')
2014-04-08 23:20:27
greg-g: missed your ping; still need me?
2014-04-08 23:21:00
dont think so
2014-04-08 23:23:33
interesting; sync-common doesn't log to IRC?
2014-04-08 23:23:34
Danny_B: That doesn't sound right.. At the risk of sounding cliche, can you log out and log back in, and see if that helps?
2014-04-08 23:23:55
marktraceur, MaxSem can you tell if your configuration stuff got pushed?
2014-04-08 23:24:15
mwalker, mine's noop on prod
2014-04-08 23:24:25
Ditto, but will check on beta
2014-04-08 23:24:26
checking if prod still works...
2014-04-08 23:24:35
also; marktraceur I presume you want https://gerrit.wikimedia.org/r/#/c/124510/ to go to wmf20 and wmf21?
2014-04-08 23:24:38
Danny_B, hoo : we're still thinking about massmessage instead (more for the password changing advice)
2014-04-08 23:24:43
mwalker: Sorry, only 21
2014-04-08 23:25:24
mwalker: Confirmed, beta has the configuration we wanted
2014-04-08 23:26:36
mwalker, lgtm
2014-04-08 23:27:40
PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx
2014-04-08 23:28:34
csteipp: log out from any currently logged project, log back to it and then try if sul works on other?
2014-04-08 23:29:14
Danny_B: Yeah
2014-04-08 23:29:22
csteipp: ok, sec
2014-04-08 23:29:38
Hmm... Danny_B What's you're wiki username?
2014-04-08 23:30:51
RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Tue Apr 8 23:30:43 UTC 2014
2014-04-08 23:30:55
csteipp: Danny B.
2014-04-08 23:31:17
csteipp: seems to work now, will let you know if i'll spot another disconnection
2014-04-08 23:31:27
Danny_B: Cool, thanks
2014-04-08 23:32:03
2014-04-08 23:32:15
thanks for care
2014-04-08 23:33:30
RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 23:34:31
!log mwalker synchronized php-1.23wmf21/extensions/MultimediaViewer/ 'Updating MultimediaViewer for {{gerrit|124510}}'
2014-04-08 23:34:35
Logged the message, Master
2014-04-08 23:35:16
marktraceur, ^ if you would test what you need to test for that
2014-04-08 23:35:26
I'm not seeing any fatals or exceptions which is good :)
2014-04-08 23:35:31
RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0
2014-04-08 23:35:32
mwalker: Works
2014-04-08 23:35:32
2014-04-08 23:35:39
cool; greg-g SWAT done
2014-04-08 23:58:30
PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 179.666672
2014-04-08 23:59:04
"Firefox can't find the server at en.wikipedia.beta.wmflabs.org."
2014-04-08 23:59:08
2014-04-08 23:59:14
('CR') 'Aaron Schulz': [C: ''] Create symlink for compile-wikiversions in /usr/local/bin [operations/puppet] - 'https://gerrit.wikimedia.org/r/124763' (owner: 'BryanDavis')
2014-04-08 23:59:31
jackmcbarn: https://bugzilla.wikimedia.org/show_bug.cgi?id=63709 probably