[00:00:35] Sure thing [00:00:42] I think that's the SWAT all done [00:00:44] Sorry for the slowness everyone [00:01:16] RoanKattouw: If it makes my mailbox less full of debate about font faces... [00:01:36] * bd808 is sure that muting those threads will continue [00:02:28] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 7.426 second response time [00:08:52] * bd808 looks for a python reviewer for: https://gerrit.wikimedia.org/r/#/c/124500/ [00:09:10] I think that will fix the 1.23wmf21 l10n problems [00:09:30] Because … mystery action at a distance! [00:12:27] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 219118 bytes in 8.455 second response time [00:24:56] !log catrope synchronized php-1.23wmf20/extensions/VisualEditor 'it helps if you run git submodule update first' [00:25:02] Logged the message, Master [00:25:05] !log catrope synchronized php-1.23wmf21/extensions/VisualEditor 'it helps if you run git submodule update first' [00:25:11] Logged the message, Master [00:27:34] (03PS1) 10BryanDavis: test2wiki to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124505 [00:28:54] RoanKattouw_away: Are you {{done}} done now? I'd like to run some more scap tests [00:38:27] (03Abandoned) 10BryanDavis: l10nupdate: Add temporary debugging captures [operations/puppet] - 10https://gerrit.wikimedia.org/r/124467 (owner: 10BryanDavis) [00:38:40] (03PS2) 10BryanDavis: test2wiki to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124505 [00:39:44] (03Abandoned) 10BryanDavis: test2wiki to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124505 (owner: 10BryanDavis) [00:41:34] (03PS1) 10BryanDavis: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124506 [00:43:55] greg-g: Are you still on a bus? I'd like to scap group0 to 1.23wmf21 to test my band aid fix. I would be on the hook to revert immediately following if ExtensionMessages looks like it will cause a problem for l10nupdate. [00:44:03] bd808: Yes, sorry [00:44:43] RoanKattouw_away: :) thanks. I watched your idle time on tin climb until I felt safe. [00:45:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [00:46:57] * bd808 decides that greg-g won't have changed his mind in the last 1:30 and proceeds [00:48:38] (03CR) 10BryanDavis: [C: 032] "Approving to test band aid fix for ExtensionMessages generation problem. Will revert if ExtensionMessages doesn't look right after scap." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124506 (owner: 10BryanDavis) [00:48:45] (03Merged) 10jenkins-bot: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124506 (owner: 10BryanDavis) [00:50:53] !log bd808 Started scap: group0 to 1.23wmf21 (testing python change for mwversionsinuse) [00:50:58] Logged the message, Master [00:53:12] * bd808 sees l10n cache updating yet again for 1.23wmf21 and loses all confidence in his "fix" [00:53:51] !log bd808 scap aborted: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (duration: 02m 57s) [00:53:56] Logged the message, Master [00:54:30] !log bd808 Started scap: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (again) [00:54:35] Logged the message, Master [00:54:56] !log bd808 scap aborted: group0 to 1.23wmf21 (testing python change for mwversionsinuse) (again) (duration: 00m 25s) [00:55:01] Logged the message, Master [00:55:12] (03PS1) 10BryanDavis: Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124507 [00:55:34] (03CR) 10BryanDavis: [C: 032] Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124507 (owner: 10BryanDavis) [00:55:42] (03Merged) 10jenkins-bot: Revert "Group0 wikis to 1.23wmf21" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124507 (owner: 10BryanDavis) [00:56:51] !log bd808 Started scap: revert group0 to 1.23wmf21 (testwiki still on 1.23wmf21) [00:56:55] Logged the message, Master [01:01:33] (03PS3) 10Ori.livneh: Add EventLogging Kafka writer plug-in [operations/puppet] - 10https://gerrit.wikimedia.org/r/85337 [01:06:45] !log bd808 Finished scap: revert group0 to 1.23wmf21 (testwiki still on 1.23wmf21) (duration: 09m 54s) [01:06:53] Logged the message, Master [01:22:25] ori: working now [01:22:29] \o/ [02:07:07] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:07:07] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:07:08] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:07:08] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [02:15:58] !log LocalisationUpdate completed (1.23wmf20) at 2014-04-08 02:15:58+00:00 [02:16:06] Logged the message, Master [02:34:57] !log LocalisationUpdate completed (1.23wmf21) at 2014-04-08 02:34:56+00:00 [02:35:02] Logged the message, Master [02:45:57] PROBLEM - MySQL Idle Transactions on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:37] PROBLEM - MySQL InnoDB on db1047 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:48:57] RECOVERY - MySQL Idle Transactions on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [02:49:06] springle_: db1047 has been very sad lately [02:49:27] RECOVERY - MySQL InnoDB on db1047 is OK: OK longest blocking idle transaction sleeps for 0 seconds [03:00:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [03:08:06] With 1.23wmf21 not getting deployed to mediawiki.org last thursday, does that mean the deployment schedule for 1.23wmf22 will be off by a week? [03:11:07] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 8 03:11:04 UTC 2014 (duration 11m 3s) [03:11:11] Logged the message, Master [03:31:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [03:38:12] greg-g: still around? [03:53:36] greg-g: check your mail [04:03:35] !log upgrading libssl on ssl1001,ssl1002,ssl1003,ssl1004,ssl1005,ssl1006,ssl1007,ssl1008,ssl1009,ssl3001.esams.wikimedia.org,ssl3002.esams.wikimedia.org,ssl3003.esams.wikimedia.org [04:03:41] Logged the message, Master [04:03:57] TimStarling: is this the heartbleed.com thing? [04:04:07] * Jasper_Deng didn't know we used openssl [04:15:22] Jasper_Deng: yes [04:15:47] !log also upgraded libssl on cp4001-4019. Restarted nginx on these servers and also the previous list. [04:15:51] Logged the message, Master [04:37:40] !log upgrading libssl on virt1000 [04:37:44] Logged the message, Master [04:38:21] !log upgrading libssl on virt0 [04:38:26] Logged the message, Master [04:41:03] !log upgraded libssl on zirconium.wikimedia.org,neon.wikimedia.org,netmon1001.wikimedia.org,iodine.wikimedia.org,ytterbium.wikimedia.org,gerrit.wikimedia.org,virt1000.wikimedia.org,labs-ns1.wikimedia.org,stat1001.wikimedia.org [04:43:13] !log restarted apache on the above list, failed on labs-ns1, virt1000, ytterbium [04:43:18] Logged the message, Master [04:43:47] <^d> TimStarling: I'll poke ytterbium [04:44:00] <^d> Keep moving on to other boxes if you need. [04:44:35] <^d> Seems up now. [04:45:04] yeah, labs-ns1 and virt1000 are actually the same server [04:45:19] and apache is running there with stime after the upgrade [04:46:30] !log on dataset1001: upgraded libssl and restarted lighttpd [04:46:34] Logged the message, Master [04:53:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [05:08:07] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:08:07] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:08:07] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:08:07] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [05:25:10] (03PS1) 10Aude: Enable Wikibase on Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 [05:26:24] (03CR) 10Aude: [C: 04-2] "requires sites and site_identifiers tables to be added and populated on wikiquote" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude) [05:31:00] <_joe_> !log upgraded openssl on cp10* and cp30* servers as well [05:31:06] Logged the message, Master [05:39:29] !log restarted apache on fenari magnesium yterrbium antimony [05:39:33] Logged the message, Master [05:39:51] with some mispellings but people will get the point [05:47:01] !log shot many old apache processes running as stats user from 2013, on stat1001 (restarting apache runs it as www-data user) [05:47:06] Logged the message, Master [06:34:37] (03PS3) 10Matanya: dataset: fix module path [operations/puppet] - 10https://gerrit.wikimedia.org/r/119212 [06:37:44] (03PS3) 10Matanya: exim: fix scoping [operations/puppet] - 10https://gerrit.wikimedia.org/r/119496 [06:43:48] springle: did you hear from otto regarding https://gerrit.wikimedia.org/r/#/c/122406/ ? [06:45:27] matanya: no [06:45:41] :/ i need to chase him down, thanks [06:46:04] not sure otto knows about it? i emailed analytics lists directly [06:46:29] so far the answer is: probably fine to decom db67, but lets wait for enveryone to chime in [06:46:43] i'll bump it this week [06:47:05] thank you [07:30:44] (03PS1) 10Faidon Liambotis: base: add debian-goodies [operations/puppet] - 10https://gerrit.wikimedia.org/r/124524 [07:47:07] <_joe|away> !log restarted nginx on cp1044 and cp1043 [07:47:12] Logged the message, Master [07:53:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [07:53:07] (03CR) 10coren: [C: 032] base: add debian-goodies [operations/puppet] - 10https://gerrit.wikimedia.org/r/124524 (owner: 10Faidon Liambotis) [08:02:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [08:09:07] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:09:07] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:09:07] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:09:07] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [08:11:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [08:15:17] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [08:36:30] ori: still working? [09:03:47] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:04:07] hashar: help with setting up zuul for the apps? https://gerrit.wikimedia.org/r/#/c/124539/ [09:08:37] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:08:47] RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical [09:08:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [09:11:47] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:16:55] (03PS1) 10RobH: Replacing the unified certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124542 [09:24:34] (03CR) 10RobH: [C: 032] Replacing the unified certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124542 (owner: 10RobH) [09:29:47] RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical [09:33:47] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:36:37] RECOVERY - RAID on labstore3 is OK: OK: optimal, 12 logical, 12 physical [09:37:37] RECOVERY - Disk space on labstore3 is OK: DISK OK [09:39:19] YuviPanda: hello [09:39:25] hashar: hello! [09:40:00] PROBLEM - RAID on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:37] PROBLEM - Disk space on labstore3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:40:57] (03PS1) 10Andrew Bogott: Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 [09:44:12] and we lost YuviPanda [09:45:10] Noooo not our panda. :( [09:46:25] panda \O/ [09:46:28] PROBLEM - SSH on labstore3 is CRITICAL: Connection refused [09:46:28] PROBLEM - DPKG on labstore3 is CRITICAL: Connection refused by host [09:46:47] PROBLEM - puppet disabled on labstore3 is CRITICAL: Connection refused by host [09:47:00] mutante: https://gerrit.wikimedia.org/r/#/c/124560/ [09:47:43] ACKNOWLEDGEMENT - DPKG on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44. [09:47:44] ACKNOWLEDGEMENT - Disk space on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44. [09:47:44] ACKNOWLEDGEMENT - RAID on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44. [09:47:44] ACKNOWLEDGEMENT - SSH on labstore3 is CRITICAL: Connection refused daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44. [09:47:44] ACKNOWLEDGEMENT - puppet disabled on labstore3 is CRITICAL: Connection refused by host daniel_zahn will be decomed - The acknowledgement expires at: 2014-04-09 09:46:44. [09:49:57] so nice to see all ops in an europian time zone :) [09:50:37] PROBLEM - Host labstore3 is DOWN: PING CRITICAL - Packet loss = 100% [09:57:12] (03CR) 10Dzahn: [C: 04-1] Add eth1 checks to nova compute hosts. (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 (owner: 10Andrew Bogott) [10:00:49] ori: what is udpprofile::collector, and can i move it from db1014 to... somewhere else? [10:02:47] springle: oh, wow. is there any indication that continues to see activity? mediawiki's profiler class can be configured to write to a database, but i didn't know anyone was using it in production. is it not ancient? [10:04:56] mutante, cmjohnson: https://wikitech.wikimedia.org/wiki/Help:Git_rebase#Don.27t_panic [10:05:21] andrewbogott: 42 [10:05:57] springle: it can go away [10:06:34] springle: it was added in this commit: . the message reads: "testing graphite 0.910 on db1014". [10:07:04] yeah, asher stole db1014 for graphite [10:07:12] trying to steal it back :) [10:07:20] ori: thanks [10:07:46] springle: it's not in any way implicated in our current graphite setup, which exists solely on tungsten.eqiad.wmnet (and labs) [10:08:13] (03PS2) 10Andrew Bogott: Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 [10:08:18] mutante: ^ [10:09:24] (03PS1) 10Cmjohnson: adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 [10:11:07] (03CR) 10jenkins-bot: [V: 04-1] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson) [10:12:49] (03CR) 10Dzahn: [C: 031] Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 (owner: 10Andrew Bogott) [10:15:34] !log update & reboot samarium [10:15:38] Logged the message, Master [10:15:48] (03CR) 10Andrew Bogott: [C: 032] Add eth1 checks to nova compute hosts. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124560 (owner: 10Andrew Bogott) [10:16:26] (03PS1) 10Springle: Remove unused db1014 block. db1014 was renamed tungsten rt5871. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124575 [10:18:19] (03CR) 10Springle: [C: 032] Remove unused db1014 block. db1014 was renamed tungsten rt5871. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124575 (owner: 10Springle) [10:21:04] !log update & reboot barium [10:21:09] Logged the message, Master [10:23:09] (03PS1) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 [10:24:10] (03CR) 10jenkins-bot: [V: 04-1] add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn) [11:09:28] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:09:28] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:09:28] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:09:28] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [11:32:05] (03PS20) 10Matanya: etherpad: convert into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/107567 [11:32:32] akosiaris: in a meeting or this ^ can be handled ? [11:39:18] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:32:58] (03PS2) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 [12:39:13] matanya: in ops meeting [12:39:19] sorry [12:39:27] and please tell me you did not resubmit from your local repo [12:39:48] rebase* sorry [12:39:50] (03PS2) 10Cmjohnson: adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 [12:40:26] (03CR) 10Andrew Bogott: [V: 031] "This looks good -- we'll see if it makes new alarms go off :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn) [12:46:38] (03PS3) 10Cmjohnson: adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 [12:48:28] PROBLEM - DPKG on strontium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [12:49:28] RECOVERY - DPKG on strontium is OK: All packages OK [12:49:35] (03CR) 10Matanya: [C: 031] add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn) [12:50:21] paravoid: can you review please https://gerrit.wikimedia.org/r/124572 [12:50:38] mutante: https://rt.wikimedia.org/Ticket/Display.html?id=5064 [12:51:29] (03CR) 10Dzahn: [C: 031] "yep, if we want to monitor this on everything, then standard-packages sounds good to me" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson) [12:52:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [12:53:10] (03CR) 10Alexandros Kosiaris: [C: 032] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson) [12:55:34] can anyone around update Elasticsearch in apt? [12:55:55] and ack nagios errors (so they don't spam to irc) for a couple horus? [12:56:39] !log reedy updated /a/common to {{Gerrit|Id15ddc665}}: Revert "Group0 wikis to 1.23wmf21" [12:56:44] Logged the message, Master [12:57:23] (03PS1) 10Reedy: Non wikipedias to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124591 [12:59:03] * Reedy pokes qchris_away and ^d [13:01:42] Any idea why https://gerrit.wikimedia.org/changes/?q=status:merged+age%3A0d&o=DETAILED_ACCOUNTS&n=100 doesn't work? [13:02:00] (03CR) 10Cmjohnson: [C: 032] adding ethtool to standard-packages.pp to be able to monitor interface speed [operations/puppet] - 10https://gerrit.wikimedia.org/r/124572 (owner: 10Cmjohnson) [13:03:24] versus [13:03:24] http://review.cyanogenmod.org/changes/?q=status:open+age%3A0d&o=DETAILED_ACCOUNTS&n=100 [13:07:41] (03PS3) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 [13:12:48] (03PS4) 10Dzahn: add nrpe to base [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 [13:15:18] test [13:15:42] test akosiaris [13:15:43] apergos: :-) [13:15:51] manybubbles: [13:16:54] already pinged [13:17:06] (03PS1) 10coren: Tool Labs: forcibly upgrade libssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/124594 [13:19:25] (03CR) 10Dzahn: [C: 032] "RT #80 :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124576 (owner: 10Dzahn) [13:21:58] <_joe_> ori: If you're here, please let me know :) [13:26:57] _joe_: Couple of hours from now [13:27:05] Though, he is around early sometimes [13:27:31] <_joe_> Reedy: thanks [13:30:38] (03CR) 10RobH: [C: 031] Tool Labs: forcibly upgrade libssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/124594 (owner: 10coren) [13:31:20] ottomata: welcome! [13:31:34] can you help me get started today? [13:31:42] (03CR) 10coren: [C: 032] Tool Labs: forcibly upgrade libssl [operations/puppet] - 10https://gerrit.wikimedia.org/r/124594 (owner: 10coren) [13:31:50] manybubbles: We have an extension for that [13:31:51] * Reedy grins [13:31:57] Reedy: thanks! [13:32:01] I totally used it a while ago [13:32:27] Reedy: Because we're using /r/ to mark the reverse proxy ... [13:32:33] Reedy: https://gerrit.wikimedia.org/r/changes/?q=status:merged+age%3A0d&o=DETAILED_ACCOUNTS&n=100 [13:32:37] Reedy: ^ should work [13:32:47] Aha, sweet! [13:33:43] (03PS1) 10RobH: replace blog.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124595 [13:35:07] ottomata: I need Elasticsearch 1.1.0 shoved into apt [13:35:37] (03PS2) 10RobH: replace blog.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124595 [13:36:15] qchris: thanks [13:36:22] yw [13:37:04] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:33] !log restarting gitblit [13:37:33] (03CR) 10RobH: [C: 032] replace blog.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/124595 (owner: 10RobH) [13:37:37] Logged the message, Master [13:39:00] !log replacing the blog cert, if holmium crashes I didn't do it correctly. [13:39:01] (03PS1) 10Faidon Liambotis: Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 [13:39:03] manybubbles: ok! [13:39:03] Logged the message, RobH [13:39:04] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 305803 bytes in 9.337 second response time [13:39:08] thanks! [13:39:28] !log update & reboot tellurium [13:39:33] Logged the message, Master [13:39:47] (03CR) 10jenkins-bot: [V: 04-1] Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 (owner: 10Faidon Liambotis) [13:41:14] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [13:42:38] (03PS2) 10Faidon Liambotis: Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 [13:43:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Revert "Giving Nik shell access to analytics1004 to do some elasticsearch load testing" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 (owner: 10Faidon Liambotis) [13:44:28] (03CR) 10Manybubbles: "Is there a better place to run this?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124597 (owner: 10Faidon Liambotis) [13:45:14] RECOVERY - Host tellurium is UP: PING OK - Packet loss = 0%, RTA = 1.11 ms [13:46:13] !log upgraded libssl on holmium [13:46:18] Logged the message, RobH [13:48:49] ottomata: kafka upgrade doesn't work on an1004 [13:49:41] paravoid, analytics1004 (and analytics1003) were kafka test brokers, and were never productionized or puppetized [13:49:50] i thought I had removed kafka from analytics1004, actually [13:50:38] ottomata: can you install git fat on tin? [13:50:42] I cannot [13:50:46] hm, sure, why do you need git-fat there? [13:50:55] to git deploy [13:50:58] to Elasticsearch [13:51:07] the plugins [13:51:14] or is there another server [13:51:17] you don't need git-fat on tin though [13:51:23] the git-fat commands are run on deplo hsots [13:51:27] on the targets [13:51:46] huh, I'm used to running it on the server to check the jars got there. I'll just do it without and see [13:53:21] ottomata: that worked as you said it would [13:53:35] !log synced first Elasticsearch plugin to production Elasticsearch servers [13:53:39] Logged the message, Master [13:54:01] !log they'll pick it up during the rolling restart today to upgrade to 1.1.0 [13:54:05] Logged the message, Master [13:54:08] cool [13:54:18] manybubbles: , i was going to start reinstalling an elastic search server today [13:54:33] ottomata: not a _great_ day for it [13:54:37] because I'm upgrading to 1.1.0 [13:54:43] ok [13:54:45] that is on the deployment calendar and everything [13:55:05] maybe tomorrow? [13:57:09] sure [14:04:07] ottomata: please ping me when you get a chance to update apt [14:04:35] i was about to to do it, but am in standup now [14:04:36] um [14:04:41] q for akosiaris, if you are around [14:04:54] I should change VerifyRelease, right? [14:04:54] PROBLEM - DPKG on labstore4 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:04:59] i'm trying to find the right thing to change it to [14:05:14] i downloaded 1.1's Release.gpg and am doing what the reprepro man page says to do [14:05:17] but am not sure [14:05:23] the output doesn't look like what you have [14:05:54] RECOVERY - DPKG on labstore4 is OK: All packages OK [14:09:44] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:09:44] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:09:44] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:09:44] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [14:11:17] (03PS1) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 [14:18:13] (03PS2) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 [14:19:21] (03PS1) 10Ottomata: reprepro/updates - upgrading elasticsearch to 1.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124603 [14:20:08] (03CR) 10Ottomata: [C: 032 V: 032] reprepro/updates - upgrading elasticsearch to 1.1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124603 (owner: 10Ottomata) [14:23:54] PROBLEM - HTTPS on ssl1002 is CRITICAL: Connection refused [14:24:06] manybubbles: http://apt.wikimedia.org/wikimedia/pool/main/e/elasticsearch/ [14:24:09] look ok? [14:28:54] RECOVERY - HTTPS on ssl1002 is OK: OK - Certificate will expire on 01/20/2016 12:00. [14:29:45] ottomata: looks good - let me try elastic1001 [14:30:35] (03PS3) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 [14:30:57] mutante, ^ pls? [14:31:37] !log upgrading elastic1001 [14:31:42] Logged the message, Master [14:32:38] !log woops, just restarted elastic1002. silly me [14:32:42] Logged the message, Master [14:32:46] !log no harm done, just lost time [14:32:50] Logged the message, Master [14:33:53] ottomata: can you make nagios not bother us about Elasticsearch warning over the next few hours? [14:33:56] I'm paying attention [14:34:25] uh hm [14:35:43] i think so, how long manybubbles [14:35:45] 4 hours? [14:35:48] sure! [14:36:14] PROBLEM - NTP peers on linne is CRITICAL: NTP CRITICAL: Server not synchronized, Offset unknown [14:38:14] RECOVERY - NTP peers on linne is OK: NTP OK: Offset 0.016747 secs [14:44:43] andrewbogott: https://gerrit.wikimedia.org/r/#/c/77332/7/modules/base/manifests/monitoring/host.pp [14:44:51] (03PS4) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 [14:54:18] (03PS5) 10Andrew Bogott: Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 [14:54:59] (03PS3) 10Cmjohnson: add interface speed check for all hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124606 [15:01:42] mutante: can you review https://gerrit.wikimedia.org/r/124606 [15:02:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Great idea. Minor stuff here and there like making it parameterizable but looks nice." (036 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124606 (owner: 10Cmjohnson) [15:03:10] manybubbles: i think I just scheduled downtime in icinga for elastic search for the next ~4 hours [15:03:19] never done that before, so not sure what it will do [15:03:47] (03PS1) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 [15:04:54] ottomata: its cool! [15:04:56] thanks [15:07:45] (03CR) 10Ottomata: module to manage new python-diamond package (035 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush) [15:08:18] (03CR) 10Dzahn: [C: 031] Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 (owner: 10Andrew Bogott) [15:12:34] (03PS2) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 [15:13:35] (03CR) 10jenkins-bot: [V: 04-1] module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush) [15:15:36] (03PS3) 10Rush: module to manage new python-diamond package [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 [15:16:34] PROBLEM - Host virt1000 is DOWN: CRITICAL - Host Unreachable (208.80.154.18) [15:16:42] !log all ssl servers in eqiad have been updated with new cert and restarted [15:16:51] !log rolling updates on ssl3001-3003 presently [15:17:10] (03PS1) 10Dzahn: enable base monitoring for ALL hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124609 [15:17:24] PROBLEM - Host labs-ns1.wikimedia.org is DOWN: CRITICAL - Host Unreachable (208.80.154.19) [15:18:04] RECOVERY - Host virt1000 is UP: PING OK - Packet loss = 0%, RTA = 0.55 ms [15:19:03] (03CR) 10Andrew Bogott: [C: 032] Install and use check_ssl_cert tool to validate certs. [operations/puppet] - 10https://gerrit.wikimedia.org/r/124601 (owner: 10Andrew Bogott) [15:19:04] RECOVERY - Host labs-ns1.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [15:19:07] apergos: https://gerrit.wikimedia.org/r/#/c/124609/1 [15:19:46] ugly, eh.. since i have to change all those lines because of indentation :p [15:22:25] (03CR) 10ArielGlenn: [C: 031] enable base monitoring for ALL hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124609 (owner: 10Dzahn) [15:22:39] (03CR) 10Dzahn: [C: 032] enable base monitoring for ALL hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/124609 (owner: 10Dzahn) [15:23:46] (03CR) 10Ottomata: module to manage new python-diamond package (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124608 (owner: 10Rush) [15:27:31] PROBLEM - HTTPS on cp4009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:41] PROBLEM - HTTPS on ssl3003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:41] PROBLEM - HTTPS on ssl1006 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:41] PROBLEM - HTTPS on cp4014 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:51] PROBLEM - HTTPS on ssl1004 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:51] PROBLEM - HTTPS on ssl1005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:51] PROBLEM - HTTPS on cp4008 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:51] PROBLEM - HTTPS on cp4004 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:51] PROBLEM - HTTPS on cp4015 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:52] PROBLEM - HTTPS on cp4001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:52] PROBLEM - HTTPS on cp4017 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:53] PROBLEM - HTTPS on amssq47 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:53] PROBLEM - HTTPS on ssl1002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:54] PROBLEM - HTTPS on ssl1001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:54] PROBLEM - HTTPS on cp4005 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:27:55] PROBLEM - HTTPS on cp4012 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:01] PROBLEM - HTTPS on cp4016 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:01] PROBLEM - HTTPS on sodium is CRITICAL: SSL_CERT CRITICAL lists.wikimedia.org: invalid CN (lists.wikimedia.org does not match *.wikimedia.org) [15:28:11] PROBLEM - HTTPS on ssl1007 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:11] PROBLEM - HTTPS on iodine is CRITICAL: SSL_CERT CRITICAL ticket.wikimedia.org: invalid CN (ticket.wikimedia.org does not match *.wikimedia.org) [15:28:11] PROBLEM - HTTPS on ssl3002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:11] PROBLEM - HTTPS on ssl3001 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:11] PROBLEM - HTTPS on cp4018 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:12] PROBLEM - HTTPS on ssl1008 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:12] PROBLEM - HTTPS on ssl1009 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:13] PROBLEM - HTTPS on ssl1003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:13] PROBLEM - HTTPS on cp4013 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:14] PROBLEM - HTTPS on cp4003 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:14] PROBLEM - HTTPS on cp4007 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:15] PROBLEM - HTTPS on cp4011 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:15] PROBLEM - HTTPS on cp4010 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:21] PROBLEM - HTTPS on cp4020 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:21] PROBLEM - HTTPS on cp4006 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:31] PROBLEM - HTTPS on cp4002 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:28:31] PROBLEM - HTTPS on cp4019 is CRITICAL: SSL_CERT CRITICAL *.wikipedia.org: invalid CN (*.wikipedia.org does not match *.wikimedia.org) [15:30:02] holy fun :) [15:30:37] :o [15:32:08] aude: getting to your email :) [15:32:13] ok [15:32:25] want to see if it's ok to do today [15:32:35] anytime works for us, i suppose [15:34:45] aude: tl;dr of email: yep, looks good [15:34:50] ok [15:35:07] we were smart to put i18n stuff a while ago :) [15:35:42] PROBLEM - RAID on holmium is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [15:35:52] PROBLEM - DPKG on fenari is CRITICAL: NRPE: Command check_dpkg not defined [15:36:01] the https failures are me muching with monitoring, nothing to worry about [15:36:02] PROBLEM - Disk space on fenari is CRITICAL: NRPE: Command check_disk_space not defined [15:36:12] PROBLEM - RAID on fenari is CRITICAL: NRPE: Command check_raid not defined [15:36:22] PROBLEM - puppet disabled on fenari is CRITICAL: NRPE: Command check_puppet_disabled not defined [15:36:57] mutante: fenari is not happy :-D [15:38:21] hashar: thanks, that's cause we just added more monitoring [15:38:33] RT #80 :) [15:38:48] mutante: yeah I noticed your puppet change. Guess fenari is missing some bits [15:41:12] hashar: wasn't running nagios-nrpe-server [15:41:52] greg-g: re: SSL certs, andrewbogott is on that one [15:41:57] ops monitoring sprint over here [15:42:11] mutante: ahh, good to know who's on point for that, thanks [15:42:23] wasn't sure if it'd be a opsen party thing or not [15:42:44] it is. ops in Athens [15:43:05] that check is new, in that it checks for validity of cert, not just expiry [15:43:18] and wikimedia vs. wikipedia thing [15:43:30] * greg-g nods [15:44:52] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 438.266663 [15:45:02] (03PS1) 10Andrew Bogott: When checking unified certs, check for *.wikipedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/124616 [15:45:32] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 434.533325 [15:46:21] (03CR) 10Andrew Bogott: [C: 032] When checking unified certs, check for *.wikipedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/124616 (owner: 10Andrew Bogott) [15:46:22] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:45:20 PM UTC [15:53:10] RECOVERY - RAID on fenari is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 [15:53:17] hashar: ^ :) [15:53:20] RECOVERY - puppet disabled on fenari is OK: OK [15:53:26] nice [15:53:40] RECOVERY - Disk space on fenari is OK: DISK OK [15:53:41] RT #80 ftw [15:53:48] With any luck there'll be another flood of OKs in a minute... [15:53:50] RECOVERY - DPKG on fenari is OK: All packages OK [15:54:10] PROBLEM - puppet disabled on bast1001 is CRITICAL: NRPE: Command check_puppet_disabled not defined [15:54:10] PROBLEM - Disk space on cp3003 is CRITICAL: NRPE: Command check_disk_space not defined [15:54:10] PROBLEM - Disk space on dobson is CRITICAL: Connection refused by host [15:54:10] PROBLEM - DPKG on pdf2 is CRITICAL: Connection refused by host [15:54:20] PROBLEM - puppet disabled on iron is CRITICAL: NRPE: Command check_puppet_disabled not defined [15:54:20] PROBLEM - RAID on dobson is CRITICAL: Connection refused by host [15:54:20] PROBLEM - RAID on cp3003 is CRITICAL: NRPE: Command check_raid not defined [15:54:20] PROBLEM - Disk space on pdf2 is CRITICAL: Connection refused by host [15:54:30] PROBLEM - puppet disabled on dobson is CRITICAL: Connection refused by host [15:54:30] PROBLEM - RAID on pdf2 is CRITICAL: Connection refused by host [15:54:30] PROBLEM - DPKG on iodine is CRITICAL: NRPE: Command check_dpkg not defined [15:54:30] PROBLEM - puppet disabled on pdf2 is CRITICAL: Connection refused by host [15:54:40] PROBLEM - Disk space on iodine is CRITICAL: NRPE: Command check_disk_space not defined [15:54:40] PROBLEM - puppet disabled on cp3003 is CRITICAL: NRPE: Command check_puppet_disabled not defined [15:54:40] PROBLEM - DPKG on pdf3 is CRITICAL: Connection refused by host [15:54:48] that's not what I meant [15:54:50] PROBLEM - RAID on iodine is CRITICAL: NRPE: Command check_raid not defined [15:54:50] PROBLEM - Disk space on pdf3 is CRITICAL: Connection refused by host [15:54:50] PROBLEM - DPKG on tridge is CRITICAL: NRPE: Command check_dpkg not defined [15:54:50] PROBLEM - DPKG on bast1001 is CRITICAL: NRPE: Command check_dpkg not defined [15:54:51] PROBLEM - puppet disabled on iodine is CRITICAL: NRPE: Command check_puppet_disabled not defined [15:54:51] PROBLEM - RAID on pdf3 is CRITICAL: Connection refused by host [15:54:51] PROBLEM - Disk space on tridge is CRITICAL: NRPE: Command check_disk_space not defined [15:55:00] PROBLEM - Disk space on bast1001 is CRITICAL: NRPE: Command check_disk_space not defined [15:55:00] PROBLEM - puppet disabled on pdf3 is CRITICAL: Connection refused by host [15:55:10] PROBLEM - Disk space on iron is CRITICAL: NRPE: Command check_disk_space not defined [15:55:10] PROBLEM - RAID on bast1001 is CRITICAL: NRPE: Command check_raid not defined [15:55:10] PROBLEM - DPKG on dobson is CRITICAL: Connection refused by host [15:55:10] PROBLEM - DPKG on cp3003 is CRITICAL: NRPE: Command check_dpkg not defined [15:55:10] PROBLEM - DPKG on virt1000 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:55:10] PROBLEM - puppet disabled on tridge is CRITICAL: NRPE: Command check_puppet_disabled not defined [15:55:41] ahhh, so today is going to be a worthless -operations channel day, more than normal, due to the sprint? :) [15:56:03] We're about to all go to dinner though. [15:56:09] So things should quiet down shortly. [15:56:10] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:55:50 PM UTC [15:56:19] But the channel will still be useless if you want to talk to ops :) [15:56:50] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [15:57:03] will start nagios-nrpe-server on those [15:57:10] PROBLEM - Puppet freshness on lvs1003 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 12:56:15 PM UTC [15:58:42] RECOVERY - HTTPS on ssl3001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [15:58:42] RECOVERY - HTTPS on ssl1006 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [15:58:52] RECOVERY - HTTPS on ssl1007 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [15:58:52] RECOVERY - HTTPS on ssl1002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [15:59:32] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:59:52] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [16:00:04] back in 5 min or so [16:00:06] (03Abandoned) 10Physikerwelt: WIP: Enable orthogonal MathJax config [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/110240 (owner: 10Physikerwelt) [16:00:42] PROBLEM - DPKG on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:00:42] PROBLEM - Disk space on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:00:52] PROBLEM - RAID on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:01:02] PROBLEM - puppet disabled on mchenry is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [16:02:22] PROBLEM - Puppet freshness on ms6 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:02:03 PM UTC [16:04:37] back [16:08:22] PROBLEM - Puppet freshness on amslvs3 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:07:31 PM UTC [16:09:27] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:09:07 PM UTC [16:09:27] PROBLEM - Puppet freshness on lvs4003 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:08:32 PM UTC [16:09:27] RECOVERY - HTTPS on cp4020 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:27] RECOVERY - HTTPS on cp4006 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:27] RECOVERY - HTTPS on cp4013 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:37] RECOVERY - HTTPS on cp4009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:37] RECOVERY - HTTPS on cp4010 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:37] RECOVERY - HTTPS on ssl3003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:47] RECOVERY - HTTPS on ssl3002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:47] RECOVERY - HTTPS on ssl1004 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:09:56] ottomata: ping [16:09:57] RECOVERY - HTTPS on cp4012 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:10:07] RECOVERY - HTTPS on cp4016 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:10:07] RECOVERY - HTTPS on ssl1008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:10:07] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [16:10:07] RECOVERY - HTTPS on cp4018 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:10:17] RECOVERY - HTTPS on ssl1009 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:11:23] ottomata: ping ping [16:12:47] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [16:12:49] pong pong [16:13:05] paravoid [16:13:08] wassupp [16:13:14] what's with stat1's puppet? [16:13:18] why is it admin disabled? [16:13:47] because it is going to be decomed very soon [16:13:56] and i wanted to make puppet changes that would apply to stat1003 but not mess with what was on stat1 [16:14:05] and I didn't want to re-write a bunch of statistics.pp stuff :/ [16:14:07] <_joe_> ori: are you around? seems like graphite is *not* working [16:14:24] ottomata: that's bad [16:14:27] PROBLEM - Puppet freshness on lvs1002 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:13:54 PM UTC [16:14:35] paravoid: even if we are going to decom it soon? [16:14:36] ottomata: can you remove the "include statistics*" stuff and enable it again? [16:14:40] yes [16:14:42] yeah probably can [16:14:47] because it's messing with monitoring and all that [16:15:06] ah i see it [16:15:20] paravoid, what is the differnece between the 3 numbers in each severity category in icinga? [16:15:25] ottomata: disabling puppet for more than a few hours max is almost always a really bad idea [16:15:31] mark, ok, noted. [16:15:36] thanks [16:16:27] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:16:04 PM UTC [16:16:27] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [16:17:07] <_joe_> :/ [16:17:27] PROBLEM - Puppet freshness on dataset1001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:16:39 PM UTC [16:18:10] mark, can you help with the current network ACL problems? [16:18:22] sorry, what's that? [16:18:25] analytics nodes can't talk to apt [16:18:27] PROBLEM - Puppet freshness on lvs4001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:17:50 PM UTC [16:18:30] nor statsd.eqiad.wmnet [16:18:32] https://rt.wikimedia.org/Ticket/Display.html?id=4433 [16:18:37] I added to the bottom of that ticket [16:18:51] ok [16:18:59] i think vanadium was having the same trouble, is it on the vlan too? [16:19:27] PROBLEM - Puppet freshness on amslvs1 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:19:10 PM UTC [16:19:31] still working on wikiquote [16:19:35] we can look at getting rid of those ACLs perhaps [16:19:41] but we'll need to discuss what you're doing with firewalling [16:20:18] (03PS1) 10Ottomata: Disabling statistics roles on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124621 [16:20:18] the fingerprint of the wikis SSL cert apparently changed, but it is not a new issued cert but with the same dates as the previous one that i saved. Is that okay that the fingerprint changed? [16:20:34] mark, yeah, hm, not sure, i kind of like them [16:20:35] se4598: yes [16:20:45] especially since anyone with hadoop access can launch whatever mapreduce jobs they want [16:21:37] (03CR) 10Ottomata: [C: 032 V: 032] Disabling statistics roles on stat1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124621 (owner: 10Ottomata) [16:21:37] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [16:21:44] hmmmm [16:21:48] that's weird [16:21:59] checking on that 5xx thing in a sec [16:22:05] that's surely my fault... [16:22:27] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:21:21 PM UTC [16:22:27] PROBLEM - Puppet freshness on lvs1001 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:21:26 PM UTC [16:22:27] PROBLEM - Puppet freshness on lvs4002 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:22:07 PM UTC [16:22:53] hmm, graphite down? [16:23:04] ottomata: statsd access for analytics seems already there [16:23:07] maybe that 5xx thing is not my fault! [16:23:26] yeah, mark, i think we already had these set up too [16:23:27] PROBLEM - Puppet freshness on virt2 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:22:28 PM UTC [16:23:37] RECOVERY - Puppet freshness on stat1 is OK: puppet ran at Tue Apr 8 16:23:30 UTC 2014 [16:23:43] but it seems that they aren't working right now, starting yesterday when I tried [16:24:02] (03PS1) 10Hashar: beta: reenable fatalmonitor script on eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/124624 [16:24:13] and carbon is in there already too [16:24:15] mark, unless pings just aren't allowed and i'm checking wrong? [16:24:24] pings may not be allowed no [16:24:27] ori and I both had trouble runnign apt-get update because we coudln't talk to carbon [16:24:31] check again? [16:24:35] yeah checking [16:24:48] and i was trying to run sqstat on analytics1003 [16:24:52] so we can decom emery [16:24:59] but it couldn't talk to statsd [16:25:38] hm. [16:25:44] yeah totally working now [16:25:57] ooooook. [16:25:59] weird. [16:26:00] <_joe_> ottomata: graphite is borked [16:26:04] i think faidon did it earlier [16:26:05] (03CR) 10Hashar: "puppet is broken on deployment-bastion.eqiad.wmflabs, can't deploy the change right now :-/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124624 (owner: 10Hashar) [16:26:21] oh, fixed the acl problem? [16:26:33] maybe something else was just not working, and I assumed because I couldn't ping it was an ACL thing? [16:26:55] ping is not a good way to test that [16:27:10] yeah, i just saw the packets being filtered from ping [16:27:11] we allow specific protocols/ports, ping uses different ones [16:27:14] aye [16:27:30] yeah, just figured if i couldn't at least ping then probably other stuff was blcoked too, but ja [16:27:57] but yeah, ori couldn't use apt on vanadium either, so dunno... [16:28:10] and sqstat couldnt' talk to tungsten, so hm [16:28:12] but ok! [16:28:16] :) [16:28:22] we're going for dinner in a bit [16:28:44] mark [16:28:45] hm [16:28:53] so sqstat is trying to talk to tungsten on 2003 [16:28:56] !log Jenkins: killed jenkins-slave java process on gallium and repooled gallium slave. It was no more registered in Zuul :-/ [16:28:57] RECOVERY - puppet disabled on iron is OK: OK [16:28:57] is that open? [16:29:01] Logged the message, Master [16:29:07] RECOVERY - Disk space on iron is OK: DISK OK [16:29:09] can't seem to reach it from an03 [16:29:34] ganglia seems upset [16:29:40] protocol udp; [16:29:40] destination-port 8125; [16:29:45] tables added [16:29:51] so port 2003 isn't [16:29:54] ah ok [16:30:03] that's why then, could you add? [16:30:13] ok [16:30:40] i'm going to see if reqstats gets flaky when we move it to analytics1003 [16:30:51] it was either flaky because erbium is busy [16:30:57] or because the multicast firehose is just too lossy [16:31:37] !log added sites and site_identifiers core tables on wikiquote [16:31:41] Logged the message, Master [16:32:22] 2003 should work now [16:33:36] RECOVERY - DPKG on iodine is OK: All packages OK [16:33:36] RECOVERY - Disk space on iodine is OK: DISK OK [16:33:36] RECOVERY - puppet disabled on cp3003 is OK: OK [16:33:39] ah just noticed it is udp, mark, will that work still? [16:33:46] RECOVERY - HTTPS on cp4014 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:46] RECOVERY - RAID on cp3003 is OK: OK: optimal, 2 logical, 2 physical [16:33:46] RECOVERY - RAID on iodine is OK: OK: no disks configured for RAID [16:33:46] RECOVERY - HTTPS on ssl1005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:46] RECOVERY - HTTPS on cp4003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:47] yes [16:33:51] ok cool [16:33:52] thanks [16:33:53] ok go eat [16:33:55] thank you! [16:33:56] RECOVERY - DPKG on bast1001 is OK: All packages OK [16:33:56] RECOVERY - puppet disabled on iodine is OK: OK [16:33:56] RECOVERY - HTTPS on cp4002 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:56] RECOVERY - HTTPS on amssq47 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:56] RECOVERY - HTTPS on cp4004 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:57] RECOVERY - HTTPS on cp4001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:57] RECOVERY - HTTPS on cp4017 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:58] RECOVERY - HTTPS on cp4015 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:58] RECOVERY - HTTPS on cp4008 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:59] RECOVERY - HTTPS on ssl1001 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:33:59] RECOVERY - HTTPS on cp4005 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:34:00] RECOVERY - Disk space on bast1001 is OK: DISK OK [16:34:00] RECOVERY - HTTPS on cp4019 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:34:06] RECOVERY - RAID on bast1001 is OK: OK: no RAID installed [16:34:06] RECOVERY - DPKG on cp3003 is OK: All packages OK [16:34:06] RECOVERY - HTTPS on ssl1003 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:34:06] RECOVERY - HTTPS on cp4007 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:34:16] RECOVERY - puppet disabled on bast1001 is OK: OK [16:34:16] RECOVERY - Disk space on cp3003 is OK: DISK OK [16:34:16] RECOVERY - HTTPS on cp4011 is OK: SSL_CERT OK - X.509 certificate for *.wikipedia.org from DigiCert High Assurance CA-3 valid until Jan 20 12:00:00 2016 GMT (expires in 652 days) [16:35:36] PROBLEM - Puppet freshness on lvs4004 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 01:35:09 PM UTC [16:35:46] PROBLEM - HTTPS on cp1044 is CRITICAL: SSL_CERT CRITICAL *.wikimedia.org: invalid CN (*.wikimedia.org does not match *.wikipedia.org) [16:35:56] PROBLEM - HTTPS on cp1043 is CRITICAL: SSL_CERT CRITICAL *.wikimedia.org: invalid CN (*.wikimedia.org does not match *.wikipedia.org) [16:36:48] (03PS1) 10Ottomata: Putting sqstat back on analytics1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124630 [16:37:16] (03CR) 10Ottomata: [C: 032 V: 032] Putting sqstat back on analytics1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/124630 (owner: 10Ottomata) [16:38:30] (03PS1) 10Springle: invalid MariaDB variable name: user_stat [operations/puppet] - 10https://gerrit.wikimedia.org/r/124632 [16:40:40] (03CR) 10Springle: [C: 032] invalid MariaDB variable name: user_stat [operations/puppet] - 10https://gerrit.wikimedia.org/r/124632 (owner: 10Springle) [16:46:50] (03PS1) 10RobH: replace misc-web-lb cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/124634 [16:48:11] (03CR) 10RobH: [C: 032 V: 032] replace misc-web-lb cert [operations/puppet] - 10https://gerrit.wikimedia.org/r/124634 (owner: 10RobH) [16:49:09] sorry, being slow... populating sites table [16:49:20] (03PS1) 10Alexandros Kosiaris: Removing ethtool package from other places [operations/puppet] - 10https://gerrit.wikimedia.org/r/124637 [16:49:22] suppose no hurry [16:50:08] (03CR) 10Dzahn: [C: 031] Removing ethtool package from other places [operations/puppet] - 10https://gerrit.wikimedia.org/r/124637 (owner: 10Alexandros Kosiaris) [16:52:03] (03CR) 10Dzahn: [C: 032] "now included in base" [operations/puppet] - 10https://gerrit.wikimedia.org/r/124637 (owner: 10Alexandros Kosiaris) [16:53:08] (03CR) 10Cmcmahon: [C: 031] "Thanks for putting this back." [operations/puppet] - 10https://gerrit.wikimedia.org/r/124624 (owner: 10Hashar) [16:53:36] RECOVERY - Puppet freshness on virt2 is OK: puppet ran at Tue Apr 8 16:53:29 UTC 2014 [16:53:46] RECOVERY - Puppet freshness on dataset1001 is OK: puppet ran at Tue Apr 8 16:53:39 UTC 2014 [16:55:06] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [16:55:28] rats [16:56:36] RECOVERY - Puppet freshness on amslvs2 is OK: puppet ran at Tue Apr 8 16:56:30 UTC 2014 [16:56:46] RECOVERY - Puppet freshness on lvs1003 is OK: puppet ran at Tue Apr 8 16:56:45 UTC 2014 [16:59:04] waiting for jenkins [17:01:46] RECOVERY - Puppet freshness on ms6 is OK: puppet ran at Tue Apr 8 17:01:37 UTC 2014 [17:01:48] (03PS2) 10Manybubbles: Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 [17:03:06] !log aude synchronized php-1.23wmf20/extensions/Wikidata 'Update Wikidata build, to allow populating sites table on wikiquote' [17:03:10] Logged the message, Master [17:05:20] RECOVERY - Puppet freshness on lvs4004 is OK: puppet ran at Tue Apr 8 17:05:14 UTC 2014 [17:05:30] PROBLEM - RAID on dataset1001 is CRITICAL: CRITICAL: 1 failed LD(s) (Partially Degraded) [17:06:40] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection refused [17:07:40] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 226 bytes in 0.012 second response time [17:08:20] RECOVERY - Puppet freshness on amslvs3 is OK: puppet ran at Tue Apr 8 17:08:15 UTC 2014 [17:08:30] RECOVERY - Puppet freshness on lvs4003 is OK: puppet ran at Tue Apr 8 17:08:25 UTC 2014 [17:08:44] (03CR) 10Chad: [C: 032] Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 (owner: 10Manybubbles) [17:08:53] (03Merged) 10jenkins-bot: Turn on experimental highlighting in beta [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124003 (owner: 10Manybubbles) [17:09:40] RECOVERY - Puppet freshness on lvs1006 is OK: puppet ran at Tue Apr 8 17:09:30 UTC 2014 [17:10:10] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:10:10] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:10:10] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:10:10] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [17:10:19] (03CR) 10QChris: "Prerequisite got merged." [operations/puppet] - 10https://gerrit.wikimedia.org/r/121546 (owner: 10Ottomata) [17:10:52] ^demon|away: are you deploying stuff? [17:11:14] i'll need to sneak in some point for a config change, but not yet [17:11:29] (03PS1) 10Ottomata: Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 [17:11:38] (03PS2) 10Ottomata: Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 [17:11:40] (03CR) 10jenkins-bot: [V: 04-1] Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 (owner: 10Ottomata) [17:11:50] (03CR) 10Ottomata: [C: 032 V: 032] Moving sqstat back to emery :/ [operations/puppet] - 10https://gerrit.wikimedia.org/r/124641 (owner: 10Ottomata) [17:12:28] aude: no, he just merged something for beta [17:12:34] ok [17:12:41] probably need 10 more minutes [17:12:50] done populating tables, now checking they are ok [17:13:00] then can do the config change and then done :) [17:13:19] <^demon|away> aude: Nope, just merged that for Nik for beta. [17:13:21] <^demon|away> Like he said :) [17:13:22] going slow and careful since i'm still newish [17:13:25] doign this stuff [17:13:32] <^demon|away> Someone should sync it eventually for consistency, but no biggie. [17:13:53] i can do [17:14:04] so can I [17:14:29] hoo: want to check the sites tables and site_identifiers for wikiquote? [17:14:30] RECOVERY - Puppet freshness on lvs1002 is OK: puppet ran at Tue Apr 8 17:14:22 UTC 2014 [17:14:36] they look ok to me [17:15:30] RECOVERY - Puppet freshness on lvs1005 is OK: puppet ran at Tue Apr 8 17:15:22 UTC 2014 [17:16:02] (03CR) 10Aude: "sites table and site_identifiers are added and populated" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude) [17:16:10] RECOVERY - Puppet freshness on lvs1004 is OK: puppet ran at Tue Apr 8 17:16:02 UTC 2014 [17:16:28] !log finished upgrading elastic1001-1006. starting on 1007. yay progress. [17:16:32] Logged the message, Master [17:16:34] enwikiqoute looks good to me [17:16:39] alright [17:16:40] sites and site_identifiers [17:16:44] strip protocals and all [17:16:52] yep [17:16:58] https://gerrit.wikimedia.org/r/#/c/124516/ want to merge [17:17:07] i can deploy it and sync the cirrus thing [17:17:19] thanks1 [17:17:22] ok, also looks good on WD [17:17:30] ok [17:17:45] let me sync cirrus [17:17:52] go ahead [17:17:53] Oh, today is the day [17:18:06] it's *the* day :) [17:18:10] RECOVERY - Puppet freshness on lvs4001 is OK: puppet ran at Tue Apr 8 17:18:03 UTC 2014 [17:19:18] aude: You also sorted the wikidataclient dblist? :P [17:19:53] yes [17:20:04] Ok, looks good to me, can approve whenever you want [17:20:05] they will get sorted eventually [17:20:13] doing chad's thing [17:20:30] RECOVERY - Puppet freshness on amslvs1 is OK: puppet ran at Tue Apr 8 17:20:23 UTC 2014 [17:21:30] RECOVERY - Puppet freshness on lvs1001 is OK: puppet ran at Tue Apr 8 17:21:24 UTC 2014 [17:21:50] RECOVERY - Puppet freshness on amslvs4 is OK: puppet ran at Tue Apr 8 17:21:45 UTC 2014 [17:22:30] RECOVERY - Puppet freshness on lvs4002 is OK: puppet ran at Tue Apr 8 17:22:21 UTC 2014 [17:22:43] !log aude synchronized wmf-config/CirrusSearch-labs.php 'config change for beta, to enable highlighting' [17:22:47] Logged the message, Master [17:23:06] hoo: ready [17:23:45] (03CR) 10Hoo man: [C: 032] "Preparation finished, so do this! \o/" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude) [17:23:49] yay! [17:23:51] there you go ;) [17:23:53] (03Merged) 10jenkins-bot: Enable Wikibase on Wikiquote [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124516 (owner: 10Aude) [17:27:20] aude: About to sync or shall I take it? [17:27:21] sync dblist then wmf-config? [17:27:31] * Nemo_bis waiting [17:27:43] no other way [17:27:52] other way round sounds sane [17:28:02] wmf-config then dblist is good [17:28:06] wmf-config changes will work w/o the rest [17:28:10] right [17:28:20] that' what ree-dy did for wikisource [17:28:52] doing [17:28:55] :) [17:28:59] !log aude synchronized wmf-config 'config changes to enable Wikibase on Wikiquote' [17:29:04] Logged the message, Master [17:29:12] (03PS1) 10Matthias Mullie: Increase Flow cache version [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124646 [17:29:52] !log aude synchronized wikidataclient.dblist 'Enable Wikibase on Wikiquote' [17:29:57] Logged the message, Master [17:30:01] oO [17:30:02] :) [17:30:12] alright time to check it's all good [17:30:17] on that [17:31:13] oh well... I think we have to bump wgCacheEpoch once again [17:31:14] aude: ^ [17:31:36] huh [17:31:45] ah, yes [17:32:00] shall I patch or will you? [17:32:26] https://www.wikidata.org/wiki/Q189119#sitelinks-wikiquote [17:32:34] Nemo_bis: Yes, the usual stuff [17:32:34] go ahead [17:33:06] it says list of values is complete [17:33:09] i assume caching [17:33:16] on Q60 [17:33:57] debug=true, i can add wikiquote [17:34:23] yep, I did action=purge [17:34:23] (03PS1) 10Hoo man: Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124648 [17:34:24] yep [17:34:31] aude: ^ [17:34:35] ok [17:35:21] !log restarted gmetad on nickel to fix ganglia [17:35:26] Logged the message, Master [17:35:33] (03CR) 10Aude: [C: 032] Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124648 (owner: 10Hoo man) [17:35:40] (03Merged) 10jenkins-bot: Bump wgCacheEpoch for Wikidata after enabling Wikiquote langlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124648 (owner: 10Hoo man) [17:37:00] aude: Syncing? I have to sync a touch out [17:37:10] doing [17:37:12] ok [17:37:18] !log aude synchronized wmf-config/Wikibase.php 'bump wgCacheEpoch for wikidata after enabling wikiquote site links' [17:37:19] just being careful [17:37:22] Logged the message, Master [17:37:28] !log hoo synchronized php-1.23wmf20/extensions/Wikidata/extensions/Wikibase/lib/resources/wikibase.Site.js 'touch' [17:37:32] Logged the message, Master [17:37:34] that should purge the sites cache [17:37:43] "13:37 < aude> just being careful" +1 ;) [17:37:44] in resource loader [17:37:47] :) [17:38:25] still says complete [17:38:30] mh :/ [17:38:45] sites module has always been a pain [17:40:24] maybe php-1.23wmf20/extensions/Wikidata/extensions/Wikibase/lib/includes/modules/SitesModule.php ? [17:40:43] aude: Wont help, RL does timestamps based on the JS scripts [17:40:50] hmmm, ok [17:41:13] works for me [17:41:16] now at least [17:41:35] trying in firefox [17:41:39] might be my caching [17:41:42] \o/ Just added the first link [17:41:46] https://www.wikidata.org/wiki/Q40904#sitelinks-wikiquote [17:41:48] already did one :) [17:41:54] with debug=true [17:41:59] Cheating :D [17:42:11] heh [17:42:23] looks good in firefox [17:42:30] i have to assume it's my cache [17:42:31] I did one ten minutes ago already :P [17:42:35] :P [17:42:36] yay [17:42:45] Nemo_bis: with debug true, I guess?! [17:42:50] lol Heisenberg [17:42:55] 19.34 < Nemo_bis> yep, I did action=purge [17:43:01] :P [17:43:01] ah [17:43:50] Is there a procedure to delete gerrit repositories? [17:45:00] i can add links in wikidata now in chrome [17:45:09] aude: https://en.wikiquote.org/w/index.php?title=Werner_Heisenberg&action=info mh [17:45:14] why is it not showing up? [17:45:34] Guest64226 / krinkle : probably you can ask on the same gerrit queue page as usual [17:45:53] ah, I see [17:45:57] unless it's not "your" repository, in which case maybe a bug is better [17:46:11] dispatching is ... :S [17:47:21] hmmm [17:47:28] https://www.wikidata.org/wiki/Special:DispatchStats [17:47:44] i did action=purge on https://en.wikiquote.org/wiki/New_York_City [17:47:46] aude: Can we safely skip theses changes? If not just waiting is also fine [17:47:54] it's catching up rather quickly AFAIS [17:47:55] removed dewikiquote [17:48:08] we can wait [17:48:16] * bd808|deploy waits in line to do a group0 to 1.23wmf21 scap [17:48:28] give us 5 more minutes to poke [17:48:43] aude: Sounds good [17:48:59] i think we're ok though... [17:49:32] or nothing we solve in 5 min, but didn't break anything [17:50:51] aude: I can bump the chd_seen fields [17:51:12] ok [17:52:05] Just looking for the right change id [17:53:43] got that [17:54:37] something is weird with wikiquote... like it's not actually enabled now [17:54:45] but sure i saw it was [17:55:29] * aude thinks this happened with wikisource [17:56:19] !log changed the Wikidata wb_changes_dispatch position of all wikiquote wikis to 118158153 [17:56:23] Logged the message, Master [17:56:39] enwikiquote is in wikidataclient.dblist [17:56:42] 20140408172900 [17:57:03] that was the timestamp, should be a few moments before anything happened regarding wikiquote [17:57:12] ok [17:57:39] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 540.333313 [17:58:28] still https://en.wikiquote.org/w/index.php?title=Werner_Heisenberg&action=info [17:58:56] Wikidata is not even loaded there... wtf [17:58:59] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 645.666687 [17:59:03] right, [17:59:05] i'm sure it was [17:59:25] do i have to sync dblist again? [17:59:37] did we somehow undo it? [18:00:58] no, looks good on a random mw* machine [18:01:09] PROBLEM - Disk space on virt1000 is CRITICAL: DISK CRITICAL - free space: / 1694 MB (2% inode=86%): [18:01:14] ah [18:01:50] !log hoo synchronized wmf-config/InitialiseSettings.php 'Touch to clear config. cache' [18:01:54] Logged the message, Master [18:01:55] ok [18:02:09] it's back! [18:02:11] Sorry, I forgot about that [18:02:33] was about to try that [18:02:37] :) [18:02:41] touch all the wikidata things :) [18:02:43] * bd808|deploy wants to fix https://bugzilla.wikimedia.org/show_bug.cgi?id=58618 so that's automatic [18:02:56] i think we are done! [18:03:19] i am sure this happened on wikisource or previously where it was enabled and then not [18:03:38] * aude puzzled but we're good now [18:04:13] Yep, looks good to me [18:04:23] aude, hoo: All clear for me to mess with /a/common on tin and then scap? [18:04:37] Yep, go ahead... we're done for now :) [18:04:47] Cool [18:05:08] done [18:06:11] (03PS1) 10BryanDavis: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124655 [18:06:50] * greg-g crosses fingers and knocks on wood [18:07:03] (03CR) 10BryanDavis: [C: 032] Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124655 (owner: 10BryanDavis) [18:07:05] * aude too! [18:07:46] greg-g: Aaron merged my fix so in theory I should only need one scap. I'll verify the file after the first scap to be certain [18:08:21] * greg-g nods [18:08:28] (03Merged) 10jenkins-bot: Group0 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124655 (owner: 10BryanDavis) [18:10:36] !log bd808 Started scap: group0 wikis to 1.23wmf21 (with patch for bug 63659) [18:10:39] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:10:41] Logged the message, Master [18:11:25] l10n cache did not rebuild which is a great sign [18:11:58] Unable to open /usr/local/apache/common-local/wikiversions.cdb. [18:11:58] https://pl.wikipedia.org/w/index.php?title=Dyskusja_wikiprojektu:%C5%9Ar%C3%B3dziemie&oldid=prev&diff=39218000 [18:12:01] i get a "Unable to open /usr/local/apache/common-local/wikiversions.cdb." [18:12:10] ...and same here. [18:12:12] [2014-04-08 18:11:37] Fatal error: Unable to open /usr/local/apache/common-local/wikiversions.cdb. [18:12:15] uh-oh [18:12:19] Yeah. fuck [18:12:21] yeah, you got it [18:12:22] here the same [18:12:26] It will be fixed in a few moments [18:12:30] thats everything [18:12:31] well shit [18:12:45] fuuuuck [18:12:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [18:12:57] There's my first crash all of the wikis [18:12:59] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:13:00] SNAFU? [18:13:05] wtf [18:13:13] down on wm [18:13:21] damn it, I was actually reading an article and I reloaded it to test [18:13:23] It was my "fix" for the scap problem [18:13:25] now I can't read it while I wait [18:13:29] PROBLEM - Apache HTTP on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.007 second response time [18:13:29] PROBLEM - Apache HTTP on mw1055 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.013 second response time [18:13:29] PROBLEM - Apache HTTP on mw1150 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.004 second response time [18:13:29] PROBLEM - Apache HTTP on mw1101 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.005 second response time [18:13:29] PROBLEM - Apache HTTP on mw1177 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.009 second response time [18:13:29] PROBLEM - Apache HTTP on mw1138 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.003 second response time [18:13:30] PROBLEM - Apache HTTP on mw1187 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time [18:13:30] PROBLEM - Apache HTTP on mw1220 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time [18:13:31] PROBLEM - Apache HTTP on mw1197 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.013 second response time [18:13:31] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - check plugin (check_job_queue) or PHP errors - [18:13:33] Whoa [18:13:34] * aude cries [18:13:39] PROBLEM - Apache HTTP on mw1213 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.018 second response time [18:13:39] PROBLEM - Apache HTTP on mw1113 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.012 second response time [18:13:39] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.008 second response time [18:13:42] PROBLEM - Apache HTTP on mw1200 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.006 second response time [18:13:42] PROBLEM - Apache HTTP on mw1035 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.022 second response time [18:13:42] PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.011 second response time [18:13:42] PROBLEM - Apache HTTP on mw1090 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.010 second response time [18:13:42] PROBLEM - Apache HTTP on mw1154 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal server error - 50485 bytes in 0.007 second response time [18:13:52] It will be fixed soon… scap will fix it at the end [18:13:54] !log bd808 Finished scap: group0 wikis to 1.23wmf21 (with patch for bug 63659) (duration: 03m 18s) [18:13:59] Logged the message, Master [18:14:00] alright [18:14:01] Should be fixed now [18:14:04] fixed [18:14:15] * greg-g breathes again [18:14:22] can whoever's in charge of icinga-wm bring it back to life? [18:14:35] Damn it. :P [18:14:37] jackmcbarn: it'll again automatically, I *believe* [18:14:38] Someone [18:14:39] so what happened? [18:14:47] Oh, you know about it? [18:14:48] greg-g: You accidentally a verb. [18:14:49] ok [18:14:50] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.066 second response time [18:14:50] RECOVERY - Apache HTTP on mw1092 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.073 second response time [18:14:51] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.084 second response time [18:14:51] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.111 second response time [18:14:51] Patch https://gerrit.wikimedia.org/r/#/c/124627/ [18:14:52] RECOVERY - Apache HTTP on mw1163 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.062 second response time [18:14:52] RECOVERY - Apache HTTP on mw1217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 809 bytes in 0.059 second response time [18:15:07] Marybelle: :) [18:15:16] I'll write up the email. I know exactly what I fucked up [18:15:21] bd808|deploy: thanks, I was just about to report "Unable to open /usr/local/apache/common-local/wikiversions.cdb." - glad to see it's under control [18:15:29] * aude breathes [18:15:54] what's going on? [18:16:08] we are all at dinner [18:16:23] fixed now [18:16:24] it's ok [18:16:25] paravoid: My fault. Should be fixed now [18:16:31] okay [18:16:35] paravoid: go back to dinner, all's ok again :) [18:16:36] scap temporarily broke everything though [18:16:36] do you need anything? [18:16:39] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 183.266663 [18:16:39] ok [18:16:44] manual page us if something happens [18:16:52] paravoid: nope, known ef up [18:16:57] paravoid: will do, enjoy! [18:17:05] ciao [18:18:17] (03PS2) 10Gergő Tisza: Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 [18:18:56] (03CR) 10Gergő Tisza: "Updated to display feedback survey on beta enwiki." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 (owner: 10Gergő Tisza) [18:19:29] greg-g: I just reverted my patch to scap that caused that cascade of horribleness [18:19:36] :) [18:19:44] One the plus side, group0 is on wmf21 now [18:19:50] lol [18:19:58] literal-lol [18:20:09] * aude scared to change it back [18:20:20] "Don't. Touch. Any. Thing." [18:20:25] i suppose if bd808|deploy 's patch is reverted then ok [18:20:39] well, we still have the previous issue which it was trying to fix ;) [18:20:59] 1 step forward, 1 step back [18:21:23] So yes we are temporarily back to needing to double-scap, but I'll make a patch that doesn't melt the world after lunch [18:22:25] bd808|deploy: :) [18:23:15] wikiquote etc all looks fine, so i'm going home / eating [18:23:20] back in hour [18:23:26] k, I'll do the same [18:23:33] quite late dinner for berlin [18:23:47] so I told my wife we broke the internet. she told me facebook was working.... [18:24:18] Nemo_bis: It's never to late for food :P [18:24:41] ^ [18:28:38] hoo: well, I'd call death for starvation, pellagra etc. "too late" :P [18:29:07] Nemo_bis: :P To late as in time of the day... [18:29:08] :D [18:30:17] hoo: http://p.defau.lt/?md_cbLJuORDNsGkhY6_NAg :P [18:30:55] at least the other errors are gone now, I guess [18:31:28] manybubbles: :( [18:31:42] * greg-g goes to lunch for real [18:32:34] hoo: yeah, i submitted a patch for hhvm to fix that other issue btw [18:32:49] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.144 [18:34:15] ori: Oh... nice that it's actually done in PHP :) [18:35:34] yeah yeah yeah, elasticsearch 1012 is being upgraded [18:37:56] hoo: which component should that be filed under? [18:39:25] ori: already done https://bugzilla.wikimedia.org/show_bug.cgi?id=63691 [18:39:39] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 639.299988 [18:39:40] oh cool, thanks! [18:42:09] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 530.733337 [18:42:20] ori: Any idea who to poke about https://gerrit.wikimedia.org/r/121709 ? [18:43:46] (03CR) 10Matanya: add interface speed check for all hosts (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/124606 (owner: 10Cmjohnson) [18:44:08] (03PS2) 10Ori.livneh: Change wgServer and wgCanonicalServer for arbcom wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121709 (owner: 10Hoo man) [18:44:53] (03CR) 10Ori.livneh: [C: 032] Change wgServer and wgCanonicalServer for arbcom wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121709 (owner: 10Hoo man) [18:45:06] !log ori updated /a/common to {{Gerrit|I4b18e4ce8}}: Change wgServer and wgCanonicalServer for arbcom wikis [18:45:11] Logged the message, Master [18:45:28] heh :) [18:45:39] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:45:50] !log ori synchronized wmf-config/InitialiseSettings.php 'I4b18e4ce8: Change wgServer and wgCanonicalServer for arbcom wikis' [18:45:55] Logged the message, Master [18:53:40] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:56:09] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:57:39] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 172.800003 [18:58:59] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:59:00] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:00] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:00] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:00] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:09] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.10 [18:59:09] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:09] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:10] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [18:59:29] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1895: active_shards: 5202: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 409 [19:00:03] blhe [19:00:11] it recovered in a few seconds [19:00:16] not sure why it did that [19:07:39] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 341.200012 [19:12:00] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:00] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:00] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:00] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:10] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:11] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:11] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:12:11] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:13:16] thats right [19:13:18] horrible check [19:13:36] no errors in the logs associated with those warnings [19:18:49] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [19:20:55] https://en.wikipedia.org/wiki/Wikipedia:VPT#Heartbleed_bug.3F [19:23:39] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 531.166687 [19:24:29] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.12 [19:24:49] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:50] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:50] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:59] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:59] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:59] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:59] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:59] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:24:59] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5197: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 414 [19:25:09] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 635.799988 [19:25:11] * Jamesofur kicks icinga-wm [19:26:39] PROBLEM - DPKG on elastic1015 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:28:39] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:29:38] huh: it is being fixed by ops [19:31:39] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:36:39] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:37:49] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:49] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:50] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:59] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:59] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:59] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:59] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:59] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:37:59] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:38:00] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:38:07] again? [19:38:09] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [19:38:10] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:38:10] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:38:10] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.13 [19:38:10] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:38:29] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1894: active_shards: 5204: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 407 [19:38:39] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 224.199997 [19:39:39] RECOVERY - DPKG on elastic1015 is OK: All packages OK [19:40:19] oh shut up [19:40:52] I'm doing rolling restarts [19:41:47] got it: labswiki_content_1394813391 [19:41:53] that thing is configured without replicas [19:46:40] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 341.066681 [19:48:00] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:01] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:01] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:01] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:10] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:10] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:10] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:10] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:30] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:43] and, more noise! [19:48:49] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:49] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:49] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:48:59] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303 [19:48:59] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303 [19:48:59] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 15: number_of_data_nodes: 15: active_primary_shards: 1894: active_shards: 5308: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 303 [19:49:22] bit me labswiki! [19:52:34] * bd808|LUNCH cheers manybubbles on [19:52:53] it'll spam us again in a few minutes [19:52:59] labswiki recovered a long time ago [19:53:05] it was only out for ~30 seconds each time [19:53:20] but ganglia wants all the shards on all the wikis to be recovered before it is happy [19:53:59] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:53:59] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:53:59] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 1896: active_shards: 5611: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [19:56:15] !log upgraded all elasticsearch servers except elastic1008. that is coming now. [19:56:20] Logged the message, Master [19:58:20] !log finished upgrading to Elasticsearch 1.1.0. The process went well with no issues other then some knocking out search in labs 3 times for 30 seconds a piece. And logging lots of nasty warnings to irc. I've started to the process to fix search in labs so it won't happen again. [19:58:25] Logged the message, Master [20:05:39] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 420.066681 [20:08:09] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.900024 [20:10:29] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:10:29] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:10:29] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:10:29] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [20:10:39] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:12:39] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:16:56] Does someone here know about dns issues with wmflabs-domains or related stuff that happened recently? [20:19:39] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:20:41] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 176.399994 [20:22:09] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:26:39] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 368.466675 [20:28:02] re:heartbleed, I think we'll be wanting a new corp certificate... do you guys have a favorite vendor for star certs these days? [20:28:21] it's almost due for a re-up anyway, so it's worth the effort [20:29:53] r [20:48:39] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 642.700012 [20:51:39] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:51:39] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [20:52:09] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 537.099976 [20:59:46] greg-g: don't believe you [20:59:58] http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/000666.html [21:00:04] This is the work of the Beast [21:00:11] greg-g: Do you still want to try group1 to 1.23wmf21 today or have we had enough excitement? [21:00:53] * apergos reminds folks that all ops are out at a bar except for those who are about to go to sleep :-D [21:01:06] bd808: we're back to "if you run scap, run it twice" world, right? [21:01:10] apergos: :) [21:01:23] odder: which part? :) [21:01:36] greg-g: Yes, but for group1 to 1.23wmf21 we only need to run sync-wikiversions [21:01:49] right [21:02:09] the world looks sane on phase0? [21:02:11] * greg-g looks [21:02:34] greg-g: all of it - notice the number immediately preceding .html [21:02:39] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 232.46666 [21:02:48] odder: haha [21:03:39] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:03:54] this is neat: https://graphite.wikimedia.org/render/?title=HTTP%204xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.4xx,%224xxx%20resp/min%22%29%29,%22blue%22%29 [21:04:36] I think that's what ori told me yesterdayt to not worry about [21:05:09] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:05:25] bd808: if we do, we do now, so we have 2 hours before SWAT of settle bug report time. May I take your whole day? [21:06:36] greg-g: I'm yours to command. :) [21:06:39] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 269.866669 [21:06:42] http://heartbleed.com/ [21:06:48] Q&A [21:06:55] :-P [21:07:09] bd808: go forth, please [21:09:36] (03PS1) 10BryanDavis: Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124744 [21:11:12] (03CR) 10BryanDavis: [C: 032] Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124744 (owner: 10BryanDavis) [21:11:20] (03Merged) 10jenkins-bot: Group1 wikis to 1.23wmf21 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124744 (owner: 10BryanDavis) [21:12:17] !log bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.23wmf21 [21:12:23] Logged the message, Master [21:12:47] greg-g: Have you guys already killed all user sessions? [21:12:52] Can't see a server admin log entry [21:15:44] greg-g: I did a https://commons.wikimedia.org/wiki/Commons:Village_pump#Users_are_being_forced_to_log_out [21:18:21] Thanks odder, I left a note about it on en VPT since I saw a question about the bug in general [21:18:48] Maybe I'll cross-post that to Meta too [21:19:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [21:20:14] !log bd808 Purged l10n cache for 1.23wmf18 [21:20:19] Logged the message, Master [21:21:46] !log bd808 Purged l10n cache for 1.23wmf19 [21:21:50] Logged the message, Master [21:21:54] hoo: in process [21:22:55] :) [21:23:09] hoo: it takes longer than you'd imagine, maybe :) [21:23:37] greg-g: group1 to 1.23wmf21 is {{done}} [21:23:40] greg-g: just change the cookie name? (like last time) [21:24:09] se4598: I'm defering to chris on it (not sure what his exact process is, honestly) [21:24:14] bd808|deploy: ty [21:24:53] mh, the tokens will be still valid I think, wasn't a good idea [21:25:14] se4598: Yeah I think that's why it takes a while [21:26:45] greg-g: Well given how many users we have and that we probably don't want to hammer the DBs to much, I can imagine this to take some time [21:26:52] * greg-g nods [21:28:16] csteipp: Why not run one process per shard? [21:29:24] Jamesofur: if you're keeping track of things, I alerted Commons and Meta; perhaps someone would need to alert the other big Wikipedias [21:29:35] Dunno if the message to tech-ambassadors will be enough; may be. [21:30:35] (03PS2) 10MaxSem: Put a safeguard on GeoData's usage of CirrusSearch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121874 [21:30:37] (03PS1) 10MaxSem: Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 [21:30:39] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:30:39] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 535.0 [21:30:54] (03CR) 10jenkins-bot: [V: 04-1] Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 (owner: 10MaxSem) [21:31:32] se4598: Assuming attacker has the login token, they could use the new name and again spoof the user [21:31:39] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:31:46] (03PS2) 10MaxSem: Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 [21:32:09] odder: yeah, I'll see if we can poke people, we're going to send out SM messages as well in a couple minutes [21:32:19] with a recommendation to password reset [21:33:09] SM? [21:33:22] sorry, Social Media (Twitter/Facebook/G+ etc) [21:33:42] TMA, Too Many Abbreviations [21:33:45] :) [21:33:59] yup lol [21:34:09] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 539.133362 [21:34:10] I abuse them, I even make up my own and forget that they are just in my head [21:34:23] https://twitter.com/Wikimedia/status/453646877397757953 [21:34:49] Jamesofur: EUS IAA. TA IANAL. [21:34:58] *EYS :p [21:35:42] thanks HaeB, retweeted [21:40:46] woah, new code on wikidata? [21:40:46] Jamesofur: using mass-message might be a good idea [21:41:15] aude: yep, all ok? [21:41:26] HaeB: ^ what do you think? (about MM) [21:41:48] wdyt? [21:42:08] greg-g: itjdi [21:42:12] so we're confident? [21:42:39] PROBLEM - Varnishkafka Delivery Errors on cp3012 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 187.866669 [21:42:53] aude: in that it won't break at 2:00 utc? yeah [21:43:06] aude: the only thing we're still not confident about is scap on thursday [21:44:19] alright [21:44:39] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:44:40] PROBLEM - Varnishkafka Delivery Errors on cp3011 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 320.200012 [21:44:55] Jamesofur, matanya: i think for the session ending, massmessage would be overkill. regarding the password reset, it's a judgment call (how high one estimates the risk for users who don't change it) [21:45:24] HaeB: it depends on user rights as well [21:45:27] aude: The bug that caused all the 1.23wmf21 l10n issues is https://bugzilla.wikimedia.org/show_bug.cgi?id=63659 [21:46:31] are there any other major sites who notified all users? [21:46:54] not that I've seen yet, but I have a feeling some are still going through the fixing process [21:46:55] interesting [21:46:59] (to recommend a password chanage) [21:47:10] eg. just got stuff from CloudBees [21:47:15] github also logged me out [21:47:37] would also be interesting to know how quick the wikis were fixed after the news broke yesterday [21:47:40] latimes has an article about resetting your password, but that's different [21:48:09] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:48:13] last night (PT) i filed a RT ticket for the blog, which was vulnerable at the time, but at that point the wikis tested ok already [21:48:36] The wikis auto update OpenSSL via puppet [21:49:00] hoo: well ya ;) the question is when we updated puppet ;) [21:49:24] Jamesofur: The servers do that themselves [21:49:39] per https://wikitech.wikimedia.org/wiki/Server_admin_log , the blog (holmium) was pretty late in the game [21:49:50] The timeline is all in SAL from last night [21:49:51] Yesterday I posted about that to the internal ops list, but forgot to poke a root to do a apt-cache clean and force puppet run [21:50:08] "04:03 Tim: upgrading libssl on ssl1001,ssl1002,ssl1003,ssl1004,ssl1005,ssl1006,ssl1007,ssl1008,ssl1009,ssl3001.esams.wikimedia.org,ssl3002.esams.wikimedia.org,ssl3003.esams.wikimedia.org" - is that the entry for the wikis? [21:50:37] Mostly yes [21:53:39] RECOVERY - Varnishkafka Delivery Errors on cp3012 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:53:39] RECOVERY - Varnishkafka Delivery Errors on cp3011 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:53:59] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [21:54:55] (03PS1) 10Jean-Frédéric: Add Musées de la Haute-Saône to wgCopyUploadsDomains [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124754 [22:01:11] greg-g, poking you because I'm not sure who's on point for the i18n / scap stuff -- but I recall getting pinged a couple of days ago (on a centralnotice keyword) saying that the i18n update was failing due to exceptions on CN (and others). I'm wondering if CN's fail was due to being on a deployment branch that did not have the JSON updates (until just now). [22:01:46] shouldn't be [22:01:57] there's backward compat in l10nupdate [22:02:17] mwalker: see https://bugzilla.wikimedia.org/show_bug.cgi?id=63659 for all the gorey details [22:02:33] * mwalker puts on tyvek suit [22:02:38] :) [22:30:59] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [22:33:06] greg-g: Could I push a small centralauth update soon? [22:33:44] yeah, now is fine, 30 minutes until swat [22:34:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:36:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:37:04] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 34.533333 [22:37:34] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 260.733337 [22:38:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:40:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:42:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:44:14] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 625.166687 [22:44:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:45:36] marktraceur: I see in deploy-calendar that you have changeset which especially activates MediaViewer on en-beta. You(r pc) may get hit by https://bugzilla.wikimedia.org/show_bug.cgi?id=63709 [22:46:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:47:22] se4598: Is there a fix? [22:47:50] I'm guessing it's an SSL problem [22:48:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:48:43] se4598: Replied on bug [22:49:09] (03PS1) 10BryanDavis: Create symlink for compile-wikiversions in /usr/local/bin [operations/puppet] - 10https://gerrit.wikimedia.org/r/124763 [22:49:23] marktraceur: We in #wikimedia-labs haven't one. And thats not about https but dns resolve, so I don't understand what do you mean by https? [22:49:35] Oh, hm [22:49:37] Never mind, sorry [22:50:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:52:04] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:52:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:52:34] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:52:56] marktraceur: currently the fix is.....: it may work if you try multiple times or wait some time (minutes, hours) ;P [22:54:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:56:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:56:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [22:58:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 10:30:08 PM UTC [22:58:41] greg-g: csteipp: got both core changes ready [22:58:53] I mean changes to the deploy branch [22:59:52] hoo: Cool.. one sec and I'll merge and deploy it [23:00:12] I can also jump in, am on tin still anyway [23:00:14] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Tue Apr 8 23:00:04 UTC 2014 [23:02:24] PROBLEM - Puppet freshness on mw1109 is CRITICAL: Last successful Puppet run was Tue 08 Apr 2014 11:00:04 PM UTC [23:05:24] stupid puppet [23:06:33] * Jasper_Deng always wondered what Puppet does anyways [23:07:09] pulls the strings ;) [23:07:20] (or, probably better 'is the strings' ) [23:07:26] Jasper_Deng: Playing with the servers :D [23:08:20] Technically, the sysadmins are a puppet in the WMFs plans, right? :p [23:08:37] !log csteipp synchronized php-1.23wmf21/extensions/CentralAuth/maintenance 'Push maintenance script for token reset' [23:08:39] or we're all just puppets in their plans, duh [23:08:41] Logged the message, Master [23:09:04] Jamesofur: You're the past of the puppets :p [23:09:09] *master of the [23:09:57] greg-g: CentralAuth updates are out, so swat can go ahead if they were waiting on me [23:10:01] ;) the user with said name may dislike me claiming the title [23:10:40] mwalker: ori ebernhardson ^ [23:10:46] also, what the heck, oit_display ? [23:10:54] :) [23:11:10] PROBLEM - Puppet freshness on lvs3001 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:11:10] PROBLEM - Puppet freshness on lvs3002 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:11:10] PROBLEM - Puppet freshness on lvs3003 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:11:10] PROBLEM - Puppet freshness on lvs3004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 12:00:00 AM UTC [23:11:51] oh [23:11:54] yes; it's 4! [23:13:25] SUL doesn't work? [23:14:02] csteipp, ^ [23:14:03] Danny_B: We are logging out all users [23:14:10] see http://lists.wikimedia.org/pipermail/wikitech-ambassadors/2014-April/000666.html [23:14:32] csteipp, warn ppl with a site notice? [23:14:35] hoo: you know that this isn't merged? https://gerrit.wikimedia.org/r/124756 [23:15:00] se4598: not this important at the very moments [23:15:03] * moment [23:15:23] Danny_B: SUL should work... You should just be logged out. If you can't login, let me know [23:15:53] csteipp: will we get logged out each time you hit a wiki we've visited recently? or just the once per user in theory [23:16:15] If you're a global user, just once (right now as I logout all the centralauth users) [23:16:32] If you have multiple ununified local accounts, each will get logged out [23:16:51] csteipp: i have to log in on every single project although i have central username [23:16:54] [23:17:30] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 135.300003 [23:17:55] marktraceur, MaxSem I'm going to +2 and confirm https://gerrit.wikimedia.org/r/#/c/124036/2 , https://gerrit.wikimedia.org/r/#/c/121874/2 , https://gerrit.wikimedia.org/r/#/c/124747/ [23:18:30] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 173.666672 [23:18:32] it would be wonderful if you all could +1 that so that I know you've looked and said this is good to me [23:18:35] 'kay [23:18:53] csteipp: +1 to notice ppl with central notice [23:18:57] (03CR) 10MarkTraceur: [C: 031] Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 (owner: 10Gergő Tisza) [23:19:00] +1 ourselves? [23:19:16] doesn't sound very assuring:) [23:19:21] nah; you're probably OK MaxSem :p [23:19:27] but I don't know who Gergo is [23:19:44] but mark was sponsoring the patch [23:19:53] he's tgr :P [23:20:00] (03CR) 10Mwalker: [C: 032] Put a safeguard on GeoData's usage of CirrusSearch [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/121874 (owner: 10MaxSem) [23:20:08] (03CR) 10Mwalker: [C: 032] Enable $wgGeoDataDebug on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124747 (owner: 10MaxSem) [23:20:21] (03CR) 10Mwalker: [C: 032] Add setting to show a survey for MediaViewer users on some sites [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/124036 (owner: 10Gergő Tisza) [23:20:27] greg-g: missed your ping; still need me? [23:21:00] dont think so [23:23:33] interesting; sync-common doesn't log to IRC? [23:23:34] Danny_B: That doesn't sound right.. At the risk of sounding cliche, can you log out and log back in, and see if that helps? [23:23:55] marktraceur, MaxSem can you tell if your configuration stuff got pushed? [23:24:15] mwalker, mine's noop on prod [23:24:25] Ditto, but will check on beta [23:24:26] checking if prod still works... [23:24:35] also; marktraceur I presume you want https://gerrit.wikimedia.org/r/#/c/124510/ to go to wmf20 and wmf21? [23:24:38] Danny_B, hoo : we're still thinking about massmessage instead (more for the password changing advice) [23:24:43] mwalker: Sorry, only 21 [23:25:24] mwalker: Confirmed, beta has the configuration we wanted [23:26:36] mwalker, lgtm [23:27:40] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: No output from Graphite for target(s): reqstats.5xx [23:28:34] csteipp: log out from any currently logged project, log back to it and then try if sul works on other? [23:29:14] Danny_B: Yeah [23:29:22] csteipp: ok, sec [23:29:38] Hmm... Danny_B What's you're wiki username? [23:30:51] RECOVERY - Puppet freshness on mw1109 is OK: puppet ran at Tue Apr 8 23:30:43 UTC 2014 [23:30:55] csteipp: Danny B. [23:31:17] csteipp: seems to work now, will let you know if i'll spot another disconnection [23:31:27] Danny_B: Cool, thanks [23:32:03] yw [23:32:15] thanks for care [23:33:30] RECOVERY - Varnishkafka Delivery Errors on cp3019 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:34:31] !log mwalker synchronized php-1.23wmf21/extensions/MultimediaViewer/ 'Updating MultimediaViewer for {{gerrit|124510}}' [23:34:35] Logged the message, Master [23:35:16] marktraceur, ^ if you would test what you need to test for that [23:35:26] I'm not seeing any fatals or exceptions which is good :) [23:35:31] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:35:32] mwalker: Works [23:35:32] Ta [23:35:39] cool; greg-g SWAT done [23:58:30] PROBLEM - Varnishkafka Delivery Errors on cp3019 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 179.666672 [23:59:04] "Firefox can't find the server at en.wikipedia.beta.wmflabs.org." [23:59:08] why? [23:59:14] (03CR) 10Aaron Schulz: [C: 031] Create symlink for compile-wikiversions in /usr/local/bin [operations/puppet] - 10https://gerrit.wikimedia.org/r/124763 (owner: 10BryanDavis) [23:59:31] jackmcbarn: https://bugzilla.wikimedia.org/show_bug.cgi?id=63709 probably