[00:30:41] (03PS2) 10Ori.livneh: apache: when sourcing env-enabled/*, redirect stdout to stderr [operations/puppet] - 10https://gerrit.wikimedia.org/r/156060 [00:30:43] (03PS1) 10Ori.livneh: mediawiki::sync: run sync-common unless InitialiseSettings.php exists [operations/puppet] - 10https://gerrit.wikimedia.org/r/156064 [00:58:24] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:59:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [01:00:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [01:54:26] (03PS1) 10Yuvipanda: tools: Add texlive to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/156067 [02:10:55] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:11:25] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:11:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:12:25] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:15:17] !log LocalisationUpdate completed (1.24wmf17) at 2014-08-25 02:14:14+00:00 [02:15:26] Logged the message, Master [02:21:14] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [02:21:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [02:27:01] !log LocalisationUpdate completed (1.24wmf18) at 2014-08-25 02:25:58+00:00 [02:27:11] Logged the message, Master [02:30:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:31:34] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:33:25] RECOVERY - mailman_ctl on sodium is OK: PROCS OK: 1 process with UID = 38 (list), regex args /mailman/bin/mailmanctl [02:33:55] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:35:14] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [02:35:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [02:50:54] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:51:54] RECOVERY - mailman_qrunner on sodium is OK: PROCS OK: 8 processes with UID = 38 (list), regex args /mailman/bin/qrunner [02:57:24] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:58:15] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [03:00:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [03:02:14] PROBLEM - MySQL Processlist on db1068 is CRITICAL: CRIT 82 unauthenticated, 0 locked, 0 copy to table, 0 statistics [03:03:14] RECOVERY - MySQL Processlist on db1068 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 1 statistics [03:15:19] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 25 03:14:13 UTC 2014 (duration 14m 12s) [03:15:31] Logged the message, Master [04:14:15] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: / 1067 MB (3% inode=96%): [04:14:17] URGENT: what's the code at the office? [04:52:24] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:59:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [05:01:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [05:26:44] (03PS1) 10Legoktm: Revoke centralnotice-admin right from mlwiki, mlwikisource, mlwiktionary sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156075 [05:28:05] (03PS1) 10Hoo man: Remove centralnotice-admin rights from 3 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156076 [05:28:24] (03CR) 10Legoktm: [C: 04-1] "Apparently all local sysops have this right? Why is it explicitly assigned to these groups?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156075 (owner: 10Legoktm) [05:32:01] (03PS2) 10Hoo man: Remove centralnotice-admin right assignments on 3 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156076 [05:32:57] (03CR) 10Legoktm: [C: 031] Remove centralnotice-admin right assignments on 3 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156076 (owner: 10Hoo man) [05:33:13] (03Abandoned) 10Legoktm: Revoke centralnotice-admin right from mlwiki, mlwikisource, mlwiktionary sysops [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156075 (owner: 10Legoktm) [05:34:21] (03CR) 10Hoo man: [C: 032] "Basically a no-op (the functionality isn't active on these wikis at all)." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156076 (owner: 10Hoo man) [05:34:25] (03Merged) 10jenkins-bot: Remove centralnotice-admin right assignments on 3 wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156076 (owner: 10Hoo man) [05:35:43] !log hoo Synchronized wmf-config/InitialiseSettings.php: {{gerrit|156076}} - Remove centralnotice-admin right assignments on 3 wikis - Basically a noop (duration: 00m 06s) [05:35:49] Logged the message, Master [06:29:44] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:32] (03PS1) 10Withoutaname: Remove unused variables and commented-out code from CommonSettings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156078 (https://bugzilla.wikimedia.org/29902) [06:45:34] PROBLEM - puppet last run on es1001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:46:44] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:02:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [07:02:34] RECOVERY - puppet last run on es1001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:24:09] (03PS1) 10KartikMistry: Fix Parsoid API variable for ContentTranslation [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 [07:30:22] * _joe_ back [07:35:32] (03CR) 10Santhosh: Fix Parsoid API variable for ContentTranslation (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 (owner: 10KartikMistry) [08:16:44] (03PS1) 10Withoutaname: Move some permissions from abusefilter.php to InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156081 [08:17:47] (03PS2) 10Withoutaname: Move some permissions from abusefilter.php to InitialiseSettings.php [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156081 (https://bugzilla.wikimedia.org/58247) [08:18:34] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Mon 25 Aug 2014 06:17:36 UTC [08:32:58] (03Abandoned) 10Giuseppe Lavagetto: Revert "Move ulsfo public traffic to eqiad temporarily for net maintenance" [operations/dns] - 10https://gerrit.wikimedia.org/r/153574 (owner: 10Giuseppe Lavagetto) [08:47:56] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As stated in the debian policy manual, "Provides" will not work when a package declares an explicit version relationship:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153772 (owner: 10Giuseppe Lavagetto) [08:48:54] (03PS4) 10Giuseppe Lavagetto: mediawiki: HAT appserver should turn off mod_php [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 [08:58:24] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:59:24] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [09:03:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [09:09:08] (03CR) 10Filippo Giunchedi: [C: 031] exim templates - deprecated variable syntax [operations/puppet] - 10https://gerrit.wikimedia.org/r/154371 (owner: 10Dzahn) [09:16:23] (03CR) 10Giuseppe Lavagetto: "I would go on with this patch given the difficulties we have in separating the php packages appropriately." [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 (owner: 10Giuseppe Lavagetto) [09:16:42] (03PS5) 10Giuseppe Lavagetto: ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 [09:17:14] (03CR) 10Filippo Giunchedi: apache: when sourcing env-enabled/*, redirect stdout to stderr (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156060 (owner: 10Ori.livneh) [09:33:39] (03CR) 10Filippo Giunchedi: mediawiki: HAT appserver should turn off mod_php (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/153577 (owner: 10Giuseppe Lavagetto) [09:34:08] so if we upgrade from trusty, we'll rename HAT I guess? [09:34:56] HAX! [09:34:59] awesome [09:35:03] <_joe_> well, once it's deployed, I think it's HA! [09:35:04] i like it much better now [09:35:29] <_joe_> and in a not-too-distant-future, HN I hope [09:35:45] after hax, must have hax [09:38:14] good morning [09:42:32] (03CR) 10Filippo Giunchedi: mediawiki::sync: run sync-common unless InitialiseSettings.php exists (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156064 (owner: 10Ori.livneh) [09:44:00] (03CR) 10Filippo Giunchedi: remove HTTPS config from gitblit template (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/154973 (owner: 10Dzahn) [09:50:13] (03PS2) 10Alexandros Kosiaris: wikimedia.community: Add Google webmasters tools verification [operations/dns] - 10https://gerrit.wikimedia.org/r/152269 [09:53:54] (03CR) 10Alexandros Kosiaris: "Sharing Daniel's thought I decided to split wikimedia.community to its own file and put the google webmasters tools there." [operations/dns] - 10https://gerrit.wikimedia.org/r/152269 (owner: 10Alexandros Kosiaris) [09:58:12] (03CR) 10Filippo Giunchedi: git buildpackage basic configuration (031 comment) [operations/debs/vips] - 10https://gerrit.wikimedia.org/r/113098 (owner: 10Hashar) [10:15:34] !log setup cross-confederation BGP sessions from AS65001 (eqiad) to AS65002 (codfw) [10:15:40] Logged the message, Master [10:18:36] PROBLEM - puppet last run on elastic1016 is CRITICAL: CRITICAL: Puppet last ran 14444 seconds ago, expected 14400 [10:19:34] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Mon 25 Aug 2014 06:17:36 UTC [10:23:14] (03CR) 10Hashar: [C: 04-1] "Per Santhosh." [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 (owner: 10KartikMistry) [10:24:53] (03CR) 10Hashar: "I like the idea. For what it is worth, /var/log/puppet.log has been made to only be readable by root, I guess because of security concerns" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [11:04:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [11:10:35] PROBLEM - Number of mediawiki jobs queued on tungsten is CRITICAL: CRITICAL: Anomaly detected: 37 data above and 0 below the confidence bounds [11:11:04] PROBLEM - Number of mediawiki jobs running on tungsten is CRITICAL: CRITICAL: Anomaly detected: 36 data above and 0 below the confidence bounds [11:25:15] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [11:33:00] (03PS2) 10KartikMistry: Fix Parsoid API variable for ContentTranslation [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 [11:36:11] (03CR) 10KartikMistry: Fix Parsoid API variable for ContentTranslation (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156079 (owner: 10KartikMistry) [11:37:43] <_joe_> I'm off to lunch, bbl [11:40:32] (03PS1) 10Mark Bergsma: Add codfw subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/156090 [11:41:16] (03CR) 10jenkins-bot: [V: 04-1] Add codfw subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/156090 (owner: 10Mark Bergsma) [11:42:31] (03PS2) 10Mark Bergsma: Add codfw subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/156090 [11:43:13] (03CR) 10jenkins-bot: [V: 04-1] Add codfw subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/156090 (owner: 10Mark Bergsma) [11:43:15] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [11:43:55] (03PS3) 10Mark Bergsma: Add codfw subnets [operations/puppet] - 10https://gerrit.wikimedia.org/r/156090 [11:45:55] (03CR) 10Jalexander: [C: 031] wikimedia.community: Add Google webmasters tools verification [operations/dns] - 10https://gerrit.wikimedia.org/r/152269 (owner: 10Alexandros Kosiaris) [12:18:53] (03PS3) 10Aklapper: Create copy of upstream file (for followup custom change) [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/155732 (https://bugzilla.wikimedia.org/69747) [12:20:34] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Mon 25 Aug 2014 06:17:36 UTC [12:23:07] (03PS1) 10Aklapper: Work around Bugzilla XML RPC bug with special Unicode characters [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/156100 (https://bugzilla.wikimedia.org/69747) [12:27:05] !log restarting zuul [12:27:11] Logged the message, Master [12:33:35] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 819.39585195 [12:33:43] I wanted to have emails POSTed to the bouncehandler API ( which is now in beta ) and have added a new router to the prod config ( https://gerrit.wikimedia.org/r/#/c/155753/6/templates/exim/exim4.conf.SMTP_IMAP_MM.erb ). I am still unsure to which 'url' I should be POSTing to. [12:33:53] It would be great if someone can take a look ? [12:35:09] (03PS1) 10Hashar: gerrit: allow . in Jenkins jobs names [operations/puppet] - 10https://gerrit.wikimedia.org/r/156103 [12:45:32] !log hard stopped/restarted Zuul (workflow config error) [12:45:38] Logged the message, Master [12:47:32] (03PS2) 10Andrew Bogott: tools: Add texlive to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/156067 (owner: 10Yuvipanda) [12:58:49] * YuviPanda waves at chasemp [12:58:53] how're you doing? [13:03:39] (03CR) 10Andrew Bogott: [C: 032] tools: Add texlive to exec_environ [operations/puppet] - 10https://gerrit.wikimedia.org/r/156067 (owner: 10Yuvipanda) [13:05:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [13:08:52] (03CR) 10Nemo bis: "The bug was extremely precise." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154448 (https://bugzilla.wikimedia.org/69055) (owner: 10TTO) [13:10:43] (03PS2) 10Nemo bis: Set wgUploadNavigationUrl for eowiki to Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154448 (https://bugzilla.wikimedia.org/69055) (owner: 10TTO) [13:11:48] (03CR) 10Nemo bis: [C: 031] Set wgUploadNavigationUrl for eowiki to Commons [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154448 (https://bugzilla.wikimedia.org/69055) (owner: 10TTO) [13:13:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [13:28:26] (03PS6) 10Giuseppe Lavagetto: ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 [13:29:15] <_joe_> hashar: is zuul/jenkins working? [13:29:27] <_joe_> it is, nevermind :) [13:29:34] <_joe_> I've seen you hard restarting it [13:29:37] looking [13:29:40] (03CR) 10Giuseppe Lavagetto: [C: 032] ssl proxies: use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/152248 (owner: 10Giuseppe Lavagetto) [13:29:46] <_joe_> it works :) [13:29:55] _joe_: yeah I did a bunch of hard restarts sorry :( [13:30:03] <_joe_> np [13:34:23] mark: anything i can do with codfw buildup, or shell is required for it all ? [13:34:54] preparing puppet for codfw is something you can submit patchstes for ;) [13:38:45] <_joe_> something tells me it's the right time to complete my hiera patch [13:39:12] i wonder why? [13:40:25] <_joe_> I had another situation in which I would've loved to have it live [13:41:02] <_joe_> it was not related to codfw puppet changes, if that was the impression :) [13:41:02] (03CR) 10Manybubbles: [C: 031] WIP: Collection of fun bash scripts for managing elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/155679 (owner: 10Chad) [13:41:15] hehe ok ;) [13:41:58] * YuviPanda wonders whom to poke for https://rt.wikimedia.org/Ticket/Display.html?id=8163 [13:42:43] andrew bogott or coren i'd say :) [13:43:39] mark: I asked both of them and they told me to file an RT ticket and a 'magical RT fairy' would descend and figure out how to do it :) [13:43:49] although I'm unsure if I poke andrewbogott_afk, Coren was the one I poked [13:44:07] that's entirely an openstack thing :) [13:44:13] oh [13:44:29] oh no it's not [13:44:31] hmm [13:44:32] mark: Coren thought it involved some network config elsewhere as well, since labmon is elsewhere [13:44:40] yeah right [13:45:07] well that would probably be me then [13:45:10] but i won't have time for that this week [13:45:31] mark: cool, as long as it is somewhere on your radar :) [13:45:36] i'll take the ticket [13:45:47] mark: so you're the magical RT fairy :) [13:46:01] well really the rt duty person is [13:46:19] * YuviPanda looks at /topic, then looks at _joe_ [13:46:24] (03CR) 10Ottomata: "I think we can abandon this. matanya and I talked about this in IRC a long time ago. The way to make misc/statistics.pp into a module wo" [operations/puppet] - 10https://gerrit.wikimedia.org/r/123601 (owner: 10Matanya) [13:46:26] no longer needed now [13:47:09] :) [13:47:44] (03Abandoned) 10Matanya: [WIP] stats.wikimedia.org: strip out from misc into a module [operations/puppet] - 10https://gerrit.wikimedia.org/r/123601 (owner: 10Matanya) [13:48:36] (03CR) 10Giuseppe Lavagetto: [C: 031] mediawiki::sync: run sync-common unless InitialiseSettings.php exists (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156064 (owner: 10Ori.livneh) [13:50:22] (03CR) 10Giuseppe Lavagetto: apache: when sourcing env-enabled/*, redirect stdout to stderr (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156060 (owner: 10Ori.livneh) [13:50:47] (03CR) 10Giuseppe Lavagetto: [C: 031] Allow puppetmaster to send reports to logstash [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [13:52:50] (03CR) 10Mark Bergsma: [C: 04-1] "Yeah, what are the security implications here? I'd like to see an analysis before letting this go in..." [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [13:53:24] (03PS4) 10Giuseppe Lavagetto: Don't include ::diamond in apache::monitoring [operations/puppet] - 10https://gerrit.wikimedia.org/r/152943 (owner: 10BryanDavis) [13:54:16] (03CR) 10Giuseppe Lavagetto: [C: 032] "Since include ::diamond is the first line in the diamond::collector define, this is surely redundant." [operations/puppet] - 10https://gerrit.wikimedia.org/r/152943 (owner: 10BryanDavis) [13:56:37] (03PS4) 10Filippo Giunchedi: WIP: Collection of fun bash scripts for managing elasticsearch [operations/puppet] - 10https://gerrit.wikimedia.org/r/155679 (owner: 10Chad) [13:57:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] "changed a couple of minor things, LGTM!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155679 (owner: 10Chad) [13:58:15] (03PS1) 10Ottomata: Run refinery-drop-webrequest-partitions every 4 hours [operations/puppet] - 10https://gerrit.wikimedia.org/r/156111 [14:05:11] (03PS1) 10Nemo bis: Enable DynamicPageList on mediawiki.org [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156113 (https://bugzilla.wikimedia.org/69974) [14:07:09] (03CR) 10Aklapper: "CC'ing bblack and jgreen as I'm an idiot when it comes to Perl. That patch is supposed to just replace some specific problematic Unicode c" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/156100 (https://bugzilla.wikimedia.org/69747) (owner: 10Aklapper) [14:21:34] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Mon 25 Aug 2014 06:17:36 UTC [14:30:17] (03PS1) 10Cmjohnson: Adding mgmt dns for pdu's in codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156116 [14:30:19] (03PS7) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [14:30:26] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [14:33:53] (03CR) 10BryanDavis: "Currently the security implication is moot. This patch set implements a puppet class that provisions the reporter but it is only applied b" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [14:39:09] ^d: confirmed that noatime can be turned on on the fly with mount -o remount,noatime [14:39:42] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns for pdu's in codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156116 (owner: 10Cmjohnson) [14:47:20] manybubbles: re elasticsearch disk i/o were the machines struggling on disk bandwidth or iops? [14:48:00] godog: bandwidth mostly - we were maxing out around 130MB/sec per disk [14:48:46] manybubbles: ah, reads or writes? [14:48:52] reads [14:49:14] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [14:51:16] <^d> godog: \o/ [15:00:28] Is there anyone who can change a mailing list password for me? [15:02:07] manybubbles: ack, I'll probably bug you later about it, I'm assuming there isn't a practical way to reproduce it? [15:02:27] (03CR) 10Alexandros Kosiaris: [C: 032] syslog: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/156022 (owner: 10Matanya) [15:02:43] godog: I can probably reproduce it - but not with the a node on command [15:02:49] I can replay enwiki's search traffic and see what that does [15:03:39] (03PS1) 10Mark Bergsma: Fix tab alignment [operations/dns] - 10https://gerrit.wikimedia.org/r/156124 [15:04:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 8 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [15:04:38] (03CR) 10Mark Bergsma: [C: 032] Fix tab alignment [operations/dns] - 10https://gerrit.wikimedia.org/r/156124 (owner: 10Mark Bergsma) [15:05:15] manybubbles: it'd be nice to see it in action yeah! [15:05:31] what is in action? :) [15:06:06] grrr, i'm trying to pxe boot elastic1019 [15:06:09] godog: well, sure - one moment [15:06:10] its dchp-ing just fine [15:06:13] won't netboot though... [15:06:24] No boot device available. [15:06:24] Current boot mode is set to BIOS. [15:06:24] Please ensure compatible bootable media is available. [15:06:24] Use the system setup program to change the boot mode as needed. [15:06:25] Strike F1 to retry boot, F2 for system setup, F11 for BIOS boot manager. [15:06:34] PROBLEM - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC [15:06:35] don't see anything in bios or system setup that is relevant... [15:06:43] pssh, duhh :p [15:07:14] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:07:20] ACKNOWLEDGEMENT - Puppet freshness on analytics1003 is CRITICAL: Last successful Puppet run was Fri 22 Aug 2014 20:33:50 UTC ottomata qchris and I are experimenting with kafkatee and webstatscollector here. [15:07:51] mark: elasticsearch machines maxing out the ssds doing reads at 130MB/s :)) [15:07:59] that's the old SSDs? [15:08:17] yeah, 130MB/s for a worn intel 320 or older sounds about right [15:08:26] how many iops? [15:08:31] manybubbles: elastic1016 /var/log/elasticsearch is filling up / [15:08:39] can I dleete old rotate log files? [15:08:46] e.g. 6.6G Aug 23 06:27 production-search-eqiad_index_indexing_slowlog.log.3 [15:09:02] <^d> Yes. [15:09:09] <^d> I cleared those logs from 13/15 over the weekend. [15:09:20] yeah, probably jsut weekend crazyness logs filling up [15:09:21] <^d> Ran out of / space [15:09:22] we should check all of them [15:09:34] freed some space [15:10:01] oh you just did [15:10:02] ok [15:10:15] RECOVERY - Disk space on elastic1016 is OK: DISK OK [15:10:24] cmjohnson1: [15:10:32] ever seen this before during attempt to pxe boot? [15:10:32] [iBoot-09]: No Target information. [15:10:45] yes and I don't what it means [15:10:53] i ran into with elastic1019 [15:11:20] haa, that's the one i'm trying right now [15:11:29] is that an HP? [15:11:47] no [15:12:01] R420 [15:12:30] try disabling iSCSI boot if that's enabled [15:14:00] looking... [15:14:12] or anything iscsi really [15:14:13] don't need that [15:15:02] i don't see anything iscsi related... [15:15:19] somehow it's related to the new ssds. that's about all I have concluded. We didn't have this problem before [15:15:24] we swapped them [15:15:56] we don't do this in other elastics, but I could try creating a virtual disk...? [15:15:57] dunno. [15:16:15] in PERC H310 Mini configs [15:16:48] did you check both the main bios and the raid controller? [15:16:50] wait [15:16:52] these are H310s? [15:16:53] argh [15:17:15] can you put in a procurement ticket to get all H310s replaced by H710s? [15:17:21] we can't expect any decent performance from H310s [15:17:56] (03PS1) 10Manybubbles: Turn down slow logging for updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/156126 [15:18:41] ok, [15:18:51] elastic1017-1019 are the repurposed old solr boxes [15:18:54] so maybe that's why they have these? [15:18:58] not sure if the other elastics do or not [15:19:00] cmjohnson1: do you know? [15:19:05] no they're new boxes [15:19:08] oh [15:19:11] they are? [15:19:21] i think...lemme dbl check [15:19:28] Hi all, is there up to date notes on how you deploy MW and supporting infrastructure such as twmemproxy, memcache, redis? [15:19:34] i really hope we didn't buy any H310s new [15:19:35] ... mostly what is the current way to configure :) [15:20:04] ottomata you're right...so yeah that would explain the controller [15:20:12] We have two memcached, configured through php settings. I heard that twmemproxy would solve my two caches racing problems [15:20:34] hm, ok, well 1017 is in production [15:20:48] i'll put in a ticket to get new controllers for those 3 cmjohnson1 [15:20:52] before trying to continue with 1019 [15:22:26] mark: are they a lot worse btw? currently elasticsearch boxes don't use any hw raid [15:23:18] (possibly for that reason? I don't know) [15:24:50] i don't know the reason for that either, many of our boxes don't use hw raid...perhaps for ease of setup (partman is automatic...virtual disks are not?) [15:25:04] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=8191 [15:25:45] godog: worse than that [15:25:50] <_joe_> H310 is a terrible raid controller [15:25:55] ottomata: I will make sure this gets ordered asap [15:26:03] H310s make all writes sequential whether they're raided or not [15:26:10] <_joe_> I've read it doesn't support readahead [15:26:11] so no parallel operations across disks [15:26:27] making them useless for anything that needs any i/o performance [15:26:45] <_joe_> how can they ever ship that shit is beyond my comprehension. [15:26:50] indeed [15:27:00] cmjohnson1: thanks, just to double check, are we sure that the other es nodes are not H310s? [15:27:23] haha wat [15:27:32] cmjohnson1: can you also make a ticket to check for _all_ remaining H310s we have? [15:27:35] <_joe_> that's probably a marketing stunt: are you whining about PERCs? Now we show you what a truly crappy controller is like [15:27:49] not that we want to replace them all, but i certainly want to replace them for all boxes that need performant io [15:27:52] <_joe_> so now we all love PERCs [15:28:01] it is a PERC [15:28:07] ottomata: I will check [15:28:09] <_joe_> well, not a REAL one [15:28:27] so if you order the CHEAPER server with just on-board SATA [15:28:28] it will be much faster [15:28:38] <_joe_> that is completely braindead [15:28:49] we had H310s in varnish at some point, that was horrible [15:29:00] thanks cmjohnson1 [15:29:01] <_joe_> I can imagine... [15:29:06] we replaced those with H710s, that's better [15:29:13] but since we buy them with onboard SATA, much faster still ;) [15:29:23] <_joe_> dell servers used to have no such surprises once [15:30:31] <_joe_> but I'm not buying hardware directly since 2011 so I'm completely in the dark about recent servers [15:30:52] ottomata: yeah it'd be a nice addition to have partman create virtual disks too :( that involves megacli tho [15:30:59] yeah we've had 2 large issues with dell [15:31:07] one of them was the H310s [15:33:18] (03CR) 10Filippo Giunchedi: [C: 031] Turn down slow logging for updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/156126 (owner: 10Manybubbles) [15:34:45] (03CR) 10Rush: "note for reviewers, this crashes the api interface itself so a local workaround that doesn't patch bugzilla itself isn't possible. bleh" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/156100 (https://bugzilla.wikimedia.org/69747) (owner: 10Aklapper) [15:58:35] !log enabled elasticsearch shard allocation row awareness (via rest api) [15:58:41] Logged the message, Master [16:00:59] !log deploying tiny OpenStackManager upgrade on wikitech [16:01:06] Logged the message, Master [16:02:40] what raid controller does zinc have? [16:04:06] <^d> ottomata: Stuff's moving about :) [16:04:13] see it ja, [16:04:15] 28 relocating shards [16:04:28] why's this only show 2? [16:04:28] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=es_relocating_shards&s=by+name&c=Elasticsearch+cluster+eqiad&h=&host_regex=&max_graphs=0&tab=m&vn=&hide-hf=false&sh=1&z=small&hc=4 [16:05:01] hm, zinc's quite a different beast [16:05:12] ottomata: probably an effect of the api getting slower? we had to reduce the timeout otherwise it'd lock up gmond [16:05:32] <^d> ottomata: No clue on that. But state API agrees it's 28 :) [16:05:34] hm [16:05:48] <^d> curl -s localhost:9200/_cluster/state | jq -c '.nodes as $nodes | .routing_table.indices[].shards[][] | select(.relocating_node) | {index, shard, from: $nodes[.node].name, to: $nodes[.relocating_node].name}' [16:06:01] ottomata: essentially this https://github.com/elasticsearch/elasticsearch/issues/7385 [16:06:50] !log wikitech deployment finished. Note that the OpenStackManager submodule is off of the MediaWiki branch because… the whole submodule setup there is a bit broken on account of a git bug that uses absolute paths to manage submodules. [16:06:56] Logged the message, Master [16:07:35] * andrewbogott is pretty embarrassed about that last SAL [16:07:58] * yuvipanda pats andrewbogott [16:08:35] I am pretty sure I know how to fix it, and pretty sure that it'll result in downtime. Since I'm trying to get wikitech on the deployment train anyway, best to save the downtime for that. [16:09:14] <^d> wikitech on the train? [16:09:17] * ^d does a happy dance [16:09:32] https://gerrit.wikimedia.org/r/#/c/155789/ reviews welcome! [16:10:25] I also sort of want to upgrade git everywhere to fix that absolute-path bug, but surely the deployment system already works around it. [16:14:50] andrewbogott: I'm wondering why you don't put half of those config stuff into the actual initialise file instead of it's own [16:15:07] Perhaps some; but most could live in that main file instead of an override. [16:15:50] (03CR) 10Chad: "I like us moving towards having it on the train :)" (037 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:16:08] <^d> JohnLewis: That was the jist of what I just commented :p [16:16:33] ^d :p [16:17:50] ^d, JohnLewis, two reasons, neither of which I'm married to: [16:18:05] 1) Adding 100 new settings to initialise just so I can customize them for wikitech seems rude [16:18:29] <^d> I'm not saying all of them, just the ones that are already managed elsewhere :) [16:18:37] <^d> Still probably need a specific file for the weird stuff. [16:18:38] 1) I meant existing ones for now. :) [16:18:42] 2) Mingling all those settings into a different file and different format will make it more or less impossible to evaluate whether or not the old config and the new config are the same. [16:19:04] So, I'm tempted to have one big ugly pile of settings for the first pass, and then have incremental patches that move a few at a time, thus being reviewable. [16:19:12] <^d> I don't think (2) is a big deal. [16:19:22] <^d> This is what review and version control are for. [16:19:35] andrewbogott: at least make it a wgConf. That'll make me happy at least :) [16:19:41] bd808: mark tells me that beta runs on its own puppetmaster - so our exim patch for prod can be implemented only on beta right ? [16:19:46] You're volunteering to check each of 100 settings and verify that they correspond? [16:20:13] <^d> andrewbogott: It's not 100 settings. See my response to (1). [16:20:19] <^d> It's like...10 or 15. [16:20:27] (03CR) 10BryanDavis: "> I think we should move more stuff to the main InitialiseSettings where it belongs rather than its own file." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:20:33] andrewbogott: hey I have spare time :p I'm volunteering to do anything [16:21:00] <^d> bd808: Yeah, no realm. I dislike that idea. [16:21:01] ok, fair enough. [16:21:29] If no realm then how do I switch on the settings that aren't used elsewhere? [16:21:37] (which, the realm idea won't work as written anyway) [16:21:47] (03CR) 10John F. Lewis: "The only real reason I can see labs existing is we're running two separate clusters off the same config but labs is obviously different th" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:22:22] tonythomas: Beta does have it's own puppetmaster where we can apply and test puppet patches. I have no idea about if the exim that is managed by that actually gets messages from the outside world routed in to it. [16:22:27] why Gerrit not have an 'edit comment' :( [16:22:34] PROBLEM - Puppet freshness on elastic1016 is CRITICAL: Last successful Puppet run was Mon 25 Aug 2014 06:17:36 UTC [16:22:56] (03CR) 10Nemo bis: "Doesn't it need to be added to the .dblist files? And shouldn't it be called wikitechwiki?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:23:03] <^d> andrewbogott: We could do something in CommonSettings to include those settings [16:23:07] <^d> no big deal. [16:23:12] ok, great. [16:23:35] (03CR) 10John F. Lewis: "I should probably clarify; I mean merging all the current cluster config (user rights that stuff) to the main file but leaving the 'wikite" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:23:42] I will try to implement these suggestions after I get my morning pastry :) [16:23:47] bd808: oh. anyway, - the patches will go to operations/puppet itself right? [16:23:59] tonythomas: yes [16:24:37] tonythomas: We can cherry-pick patches that are proposed in gerrit into the beta puppetmaster for testing [16:25:11] oh. great. and any thoughts on how we will be refrering beta API post URL ? [16:25:29] just http://deployment.wikimedia.beta.wmflabs.org/w/api.php will be good ? [16:25:39] !log Jenkins: disconnecting and reconnecting Gearman plugin from https://integration.wikimedia.org/ci/configure [16:25:46] Logged the message, Master [16:26:18] tonythomas: Probably? That's a live api endpoint in the beta cluster. Where will you be pointing it in prod? At meta? [16:26:45] hoo told me something like http://text-lb.wikimedia.org/api.php [16:27:15] We will need the extension installed in one of the wiki (meta, I think ) and point it to there [16:27:56] text-lb/api is a 404 so I think that's probably not right [16:28:42] well [16:28:45] there's the host header [16:28:51] which needs to the wiki hostname [16:28:55] and there's the server to connect to [16:29:03] which can be text-lb. [16:29:14] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I don't see the config template being referenced from within the module so I don't see why it should be in the module. Instead it is being" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155111 (owner: 10Dzahn) [16:29:15] ah sure [16:29:36] or even the internal apache cluster directly, bypassing varnish [16:29:59] something like appservers.svc.$mw_primary.net [16:31:05] (03PS7) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:32:07] mark: we will need something like appservers.svc.$mw_primary.net for beta too ? [16:32:29] The beta equivalent of text-lb would be deployment-cache-text02. We don't have pybal in beta to give a better internal address. [16:32:30] or just the beta URL as in https://gerrit.wikimedia.org/r/#/c/155753/7/templates/exim/exim4.conf.SMTP_IMAP_MM.erb ? [16:32:57] indeed [16:34:33] * mark is off for dinner [16:47:20] bd808: strange. but http://deployment-cache-text02/api.php 404's :( [16:47:37] !log reboot ms-be1011, xfsaild errors in dmesg [16:47:43] Logged the message, Master [16:48:14] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2019: active_shards: 6055: relocating_shards: 29: initializing_shards: 1: unassigned_shards: 3 [16:48:14] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2019: active_shards: 6055: relocating_shards: 29: initializing_shards: 1: unassigned_shards: 3 [16:48:15] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2019: active_shards: 6055: relocating_shards: 29: initializing_shards: 1: unassigned_shards: 3 [16:48:15] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2019: active_shards: 6055: relocating_shards: 29: initializing_shards: 1: unassigned_shards: 3 [16:48:17] !log Jenkins pooled in a new slave [https://integration.wikimedia.org/ci/computer/wdjenkins-node1/ wdjenkins-node1] that will be used to run Wikidata jenkins jobs. Work in progress with addshore. It is not running jobs yet. [16:48:23] Logged the message, Master [16:49:34] PROBLEM - swift-account-replicator on ms-be1011 is CRITICAL: Connection refused by host [16:49:34] PROBLEM - swift-object-replicator on ms-be1011 is CRITICAL: Connection refused by host [16:49:41] tonythomas: That's going to be the same as http://text-lb.wikimedia.org/api.php ; Both need a host header to tell them how to actually handle the request [16:49:55] PROBLEM - swift-object-updater on ms-be1011 is CRITICAL: Connection refused by host [16:50:15] PROBLEM - swift-account-reaper on ms-be1011 is CRITICAL: Connection refused by host [16:50:15] PROBLEM - swift-account-auditor on ms-be1011 is CRITICAL: Connection refused by host [16:50:24] PROBLEM - SSH on ms-be1011 is CRITICAL: Connection refused [16:50:52] bd808: the page says Did you mean to type http://text-lb.wikimedia.org/wiki/api.php? :D [16:50:54] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2022: active_shards: 6061: relocating_shards: 32: initializing_shards: 2: unassigned_shards: 5 [16:50:54] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2022: active_shards: 6061: relocating_shards: 32: initializing_shards: 2: unassigned_shards: 5 [16:50:54] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2022: active_shards: 6061: relocating_shards: 32: initializing_shards: 2: unassigned_shards: 5 [16:50:55] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2022: active_shards: 6061: relocating_shards: 32: initializing_shards: 2: unassigned_shards: 5 [16:50:58] wiki/ is not there right ? [16:52:24] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2024: active_shards: 6069: relocating_shards: 30: initializing_shards: 2: unassigned_shards: 3 [16:52:25] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2024: active_shards: 6069: relocating_shards: 30: initializing_shards: 2: unassigned_shards: 3 [16:52:25] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2024: active_shards: 6069: relocating_shards: 30: initializing_shards: 2: unassigned_shards: 3 [16:53:14] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2025: active_shards: 6071: relocating_shards: 29: initializing_shards: 2: unassigned_shards: 4 [16:53:14] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2025: active_shards: 6071: relocating_shards: 29: initializing_shards: 2: unassigned_shards: 4 [16:53:14] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2025: active_shards: 6071: relocating_shards: 29: initializing_shards: 2: unassigned_shards: 4 [16:53:14] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2025: active_shards: 6071: relocating_shards: 29: initializing_shards: 2: unassigned_shards: 4 [16:53:58] (03CR) 10BryanDavis: "I'm not sure about the wiki name (wikitech vs wikitechwiki) I think that depends on the backing database name." (032 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [16:54:32] !log stopping puppet on cp3021.  Testing an increase of http://kafka.queue.buffering.max.ms/ in order to avoid dropping messages during broker metadata  change (e.g. leader elections) [16:54:39] Logged the message, Master [16:55:42] (03PS8) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [16:55:44] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6082: relocating_shards: 30: initializing_shards: 1: unassigned_shards: 3 [16:55:44] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6082: relocating_shards: 30: initializing_shards: 1: unassigned_shards: 3 [16:55:44] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6082: relocating_shards: 30: initializing_shards: 1: unassigned_shards: 3 [16:55:44] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6082: relocating_shards: 30: initializing_shards: 1: unassigned_shards: 3 [16:55:44] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6082: relocating_shards: 30: initializing_shards: 1: unassigned_shards: 3 [16:55:45] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6082: relocating_shards: 30: initializing_shards: 1: unassigned_shards: 3 [16:55:51] uh oh [16:56:16] manybubbles: ^ [16:56:26] huh [16:56:57] its at yellow now [16:57:07] unassigned shards? [16:57:42] its that we're rebuilding indexes, I think [16:58:02] when a new index is created the cluster is red for a few seconds while it is fully created [16:58:10] hm ok [16:58:12] that's annoying [16:58:21] this is that open bug about using better monitoring [16:58:33] I think godog had a solution for it. i remember reviewing it [16:59:54] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 30: initializing_shards: 0: unassigned_shards: 0 [16:59:54] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 30: initializing_shards: 0: unassigned_shards: 0 [16:59:54] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 30: initializing_shards: 0: unassigned_shards: 0 [16:59:54] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 30: initializing_shards: 0: unassigned_shards: 0 [17:00:14] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 29: initializing_shards: 0: unassigned_shards: 0 [17:00:14] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 29: initializing_shards: 0: unassigned_shards: 0 [17:00:14] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 29: initializing_shards: 0: unassigned_shards: 0 [17:00:14] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2029: active_shards: 6087: relocating_shards: 29: initializing_shards: 0: unassigned_shards: 0 [17:00:15] RECOVERY - swift-account-reaper on ms-be1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [17:00:15] RECOVERY - swift-account-auditor on ms-be1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [17:00:25] RECOVERY - SSH on ms-be1011 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.4 (protocol 2.0) [17:00:25] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6084: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [17:00:25] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6084: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [17:00:25] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2028: active_shards: 6084: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [17:00:34] RECOVERY - swift-account-replicator on ms-be1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [17:00:35] RECOVERY - swift-object-replicator on ms-be1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [17:00:54] RECOVERY - swift-object-updater on ms-be1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [17:02:21] yep that'd be alarming on percentage of non-active shards, pending deployment :( [17:02:29] likely this week tho [17:03:30] ok phew [17:04:43] (03PS9) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [17:05:03] ^d, bd808|MEETING, Nemo_bis, does wgDBname need to == the wiki name in the config? [17:05:08] Because unfortunately wgDBname is 'labswiki' [17:06:21] <^d> The config key is the db name. [17:06:33] bd808|MEETING: done https://gerrit.wikimedia.org/r/#/c/155753/ [17:06:37] <^d> So those should all be labswiki, not wikitech. [17:06:44] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5176.39775079 [17:06:53] ok, thanks. [17:08:48] Um… 'deployment.wikimedia.beta.wmflabs.org' => 'labswiki', <- what's that about? [17:08:58] hashar, is there something in beta that's also called 'labswiki'? [17:09:06] <^d> not a clue. [17:13:29] !log rebooting ms-be1002 to pick up updated kernel [17:13:31] (03CR) 10Yuvipanda: "Ticket: https://rt.wikimedia.org/Ticket/Display.html?id=8195" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155137 (owner: 10Yuvipanda) [17:13:35] Logged the message, Master [17:16:53] ^d: when bryan says 'config based on the wiki name' is the wiki name $site? [17:17:03] * andrewbogott is still unclear what order all this loads [17:17:56] <^d> wiki name is almost always the db name [17:18:12] <^d> always even? [17:18:55] <^d> We reference config by things like "enwiki" and "frwikisource", not their domain name. [17:21:20] (03CR) 10Chad: "Let's go ahead and get this live, logs are filling up disks." [operations/puppet] - 10https://gerrit.wikimedia.org/r/156126 (owner: 10Manybubbles) [17:22:37] (03CR) 10Manybubbles: "I'm reasonably sure this is caused by my change to spend more time indexing that _should_ save time querying. I'm still slowly rolling it" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156126 (owner: 10Manybubbles) [17:23:19] (03CR) 10Manybubbles: "I tried to apply this directly to the running cluster but I wasn't able to - Chad or Andrew or Filippo or whoever - can you take a look at" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156126 (owner: 10Manybubbles) [17:23:56] (03PS1) 10Ottomata: Set varnishkafka queue_buffering_max_ms to 5 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/156148 [17:24:34] !log reboot ms-be1004 to pick up kernel upgrade [17:24:36] (03CR) 10Ottomata: [C: 032 V: 032] Turn down slow logging for updates [operations/puppet] - 10https://gerrit.wikimedia.org/r/156126 (owner: 10Manybubbles) [17:24:40] Logged the message, Master [17:25:15] (03PS2) 10Ottomata: Set varnishkafka queue_buffering_max_ms to 5 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/156148 [17:25:16] manybubbles: done [17:25:49] (03CR) 10Ottomata: [C: 032 V: 032] Set varnishkafka queue_buffering_max_ms to 5 seconds [operations/puppet] - 10https://gerrit.wikimedia.org/r/156148 (owner: 10Ottomata) [17:26:13] ottomata: so without a restart we'll have to find a way to apply it using curl [17:26:39] oh, what was not working? [17:26:48] the setting wasn't taking via curl? [17:27:04] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [17:29:12] ottomata: yeah - I tried it before I made the puppet change and I couldn't figure it out and moved on [17:29:20] <^d> manybubbles: I've still got to restart 1, 3 and 8 as it is. [17:29:43] ^d: let me know before you do that so I can stop my rebuild [17:32:22] <^d> I should probably wait for the shuffling from awareness to settle down anyway. [17:32:31] <^d> It's still got 31 relocating. [17:33:20] (03PS1) 10Filippo Giunchedi: filippo: set prompt on interactive shells [operations/puppet] - 10https://gerrit.wikimedia.org/r/156151 [17:33:54] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] filippo: set prompt on interactive shells [operations/puppet] - 10https://gerrit.wikimedia.org/r/156151 (owner: 10Filippo Giunchedi) [17:34:46] godog: do you know https://github.com/nojhan/liquidprompt ? [17:34:51] i like! [17:35:12] <^d> manybubbles: You know, for a 33mb shard, I seem to see pcdwiki_content awfully often. [17:35:29] ^d: in which log? [17:35:41] <^d> Not in logs, in allocation and so forth. [17:35:45] <^d> Seems to like moving around a fair bit, [17:36:01] <^d> And it's rather slow initializing. [17:36:07] odd that you'd see it a lot - well odd you'd see it more then anything else small [17:36:18] <^d> Confirmation bias perhaps. [17:36:54] ^d: mayhaps. not too many recent edits: https://pcd.wikipedia.org/wiki/Sp%C3%A9cial:Modifications_r%C3%A9centes [17:37:21] ottomata: ah that looks very nice, especially the last command's exit code [17:37:38] *starred* [17:43:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [17:44:44] RECOVERY - Puppet freshness on elastic1016 is OK: puppet ran at Mon Aug 25 17:44:36 UTC 2014 [17:45:35] RECOVERY - puppet last run on elastic1016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [17:48:55] PROBLEM - Varnishkafka log producer on cp4014 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [17:49:14] (03PS2) 10Matanya: hadoop: Give deskana access to Hive [operations/puppet] - 10https://gerrit.wikimedia.org/r/155137 (owner: 10Yuvipanda) [17:49:50] matanya: ty [17:49:55] :) [17:50:36] Hey Coren, if a tools-lab tool talks to the cluster, do you know what the cluster things their source IP is? [17:50:59] *thinks* [17:51:04] csteipp: *possibly* 10.0.0.x? [17:51:16] when I was doing stuff with checkuser that is where a bunch of 'em seemed to be coming from, but I didn't verify [17:51:21] err, 10.0.0.0/24 [17:51:53] it should be that I'd think ^ [17:52:22] Cool, I'll try that out. Thansk! [17:52:38] (03CR) 10Andrew Bogott: Random stab at getting wikitech config in here. (038 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [17:54:04] yuvipanda: it is possible to set up a uploadwizcampaign to allow upload by url for all whitelisted urls? [17:54:12] if i am right you have devloaped the extension :D [17:54:43] Steinsplitter: no, it isn't, sadly. And I'm unsure how much work it would be to add them. [17:55:45] (03PS4) 10Andrew Bogott: Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [17:55:54] (03CR) 10jenkins-bot: [V: 04-1] Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [17:58:17] ottomata: Hey, on the ssd testing, can we just put the new ssds into elastic1016 [17:58:24] rather than block that testing for the new controllers [17:58:37] (we will still get updated controller quotes, i just rather get the ssd testing done sooner than later) [17:58:59] (03PS5) 10Andrew Bogott: Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [17:59:07] (03CR) 10jenkins-bot: [V: 04-1] Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [18:02:42] csteipp: It should be one of of the -exec nodes; there is no natting as a rule unless you're leaving 10.0.0.0/8 [18:03:16] csteipp: If it's from a public IP, then it's also probably one of the labs public IPs (exec nodes have public IPs) [18:04:44] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6099: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [18:04:44] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6099: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [18:04:44] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6099: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [18:04:44] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6099: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [18:04:44] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6099: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [18:04:45] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6099: relocating_shards: 28: initializing_shards: 0: unassigned_shards: 0 [18:06:54] RECOVERY - Varnishkafka log producer on cp4014 is OK: PROCS OK: 1 process with command name varnishkafka [18:17:04] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [18:19:44] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6097: relocating_shards: 26: initializing_shards: 1: unassigned_shards: 3 [18:19:44] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6097: relocating_shards: 26: initializing_shards: 1: unassigned_shards: 3 [18:19:44] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6097: relocating_shards: 26: initializing_shards: 1: unassigned_shards: 3 [18:19:44] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6097: relocating_shards: 26: initializing_shards: 1: unassigned_shards: 3 [18:19:44] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6097: relocating_shards: 26: initializing_shards: 1: unassigned_shards: 3 [18:19:44] PROBLEM - ElasticSearch health check on elastic1016 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2033: active_shards: 6097: relocating_shards: 26: initializing_shards: 1: unassigned_shards: 3 [18:21:27] i think gangllia is going weird...or check_ganglia in icinga [18:21:31] in ops meeting, will check it out asap... [18:23:07] <^d> pdcwiki_content_1408990772 0 r UNASSIGNED [18:23:08] <^d> pdcwiki_content_1408990772 0 r UNASSIGNED [18:23:08] <^d> pcdwiki_general_1408989893 0 r UNASSIGNED [18:23:09] <^d> pcdwiki_general_1408989893 0 r UNASSIGNED [18:23:11] <^d> manybubbles: ^^^ [18:23:14] <^d> I don't think I'm crazy [18:24:09] ^d: I see the trouble, or something [18:31:52] <^d> pcdwiki is so small. Could it just be nuked and recreated? [18:32:01] <^d> manybubbles: ^ [18:32:19] ^d: we're having some trouble with the rebuilds I'm doing now [18:32:27] some operations are timing out and failing [18:32:29] not sure why [18:32:44] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 0: unassigned_shards: 0 [18:32:44] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 0: unassigned_shards: 0 [18:32:44] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 0: unassigned_shards: 0 [18:32:44] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 0: unassigned_shards: 0 [18:32:44] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 0: unassigned_shards: 0 [18:32:44] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 0: unassigned_shards: 0 [18:32:54] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:32:54] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:32:54] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:32:54] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:33:04] <^d> manybubbles: Well it's still moving a ton of shards about from rack awareness. [18:33:05] ^d: but I'm basically just rebuilding pcdwiki [18:33:20] and that is why half of those just decided they were critical - the index was building [18:35:27] <^d> 1 unassigned shard is not critical. Stupid script. [18:35:32] ^d: yeah [18:36:00] also - the reason it unassigned is because we enabled rack awareness a little while ago and everything is still shuffling [18:36:09] and ganglia is down [18:36:42] ganglia dooown, ah hm, restarting... [18:38:25] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:38:25] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:38:25] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 21: initializing_shards: 1: unassigned_shards: 1 [18:38:53] ok - I'mma leave elasticsearch alone and let it settle after the rack awareness [18:39:00] the rebuilds can wait [18:40:25] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 22: initializing_shards: 0: unassigned_shards: 0 [18:40:25] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 22: initializing_shards: 0: unassigned_shards: 0 [18:40:25] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2032: active_shards: 6096: relocating_shards: 22: initializing_shards: 0: unassigned_shards: 0 [18:40:55] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2031: active_shards: 6093: relocating_shards: 24: initializing_shards: 0: unassigned_shards: 0 [18:40:55] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2031: active_shards: 6093: relocating_shards: 24: initializing_shards: 0: unassigned_shards: 0 [18:40:55] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2031: active_shards: 6093: relocating_shards: 24: initializing_shards: 0: unassigned_shards: 0 [18:40:55] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 17: number_of_data_nodes: 17: active_primary_shards: 2031: active_shards: 6093: relocating_shards: 24: initializing_shards: 0: unassigned_shards: 0 [18:51:34] PROBLEM - RAID on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:55] PROBLEM - Disk space on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:04] PROBLEM - check configured eth on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:14] PROBLEM - check if dhclient is running on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:24] PROBLEM - puppet last run on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:52:24] PROBLEM - DPKG on rhenium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:57:01] (03CR) 10BBlack: "Should this be on the main cluster or the misc cluster?" [operations/dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [18:58:18] (03CR) 10Hashar: "I have mentioned the potential security based on https://gerrit.wikimedia.org/r/#/c/150273/ which made /var/log/puppet.log root only. Re" [operations/puppet] - 10https://gerrit.wikimedia.org/r/143788 (https://bugzilla.wikimedia.org/60690) (owner: 10BryanDavis) [19:01:06] (03PS10) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [19:04:24] PROBLEM - NTP on rhenium is CRITICAL: NTP CRITICAL: No response from NTP server [19:17:21] (03PS2) 10Ori.livneh: mediawiki::sync: run sync-common unless InitialiseSettings.php exists [operations/puppet] - 10https://gerrit.wikimedia.org/r/156064 [19:17:49] (03CR) 10Ori.livneh: [C: 032 V: 032] mediawiki::sync: run sync-common unless InitialiseSettings.php exists [operations/puppet] - 10https://gerrit.wikimedia.org/r/156064 (owner: 10Ori.livneh) [19:24:37] (03PS6) 10Andrew Bogott: Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [19:24:44] (03CR) 10jenkins-bot: [V: 04-1] Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [19:25:21] (03CR) 10Ori.livneh: [C: 032] hhvm: dedupe mysql config key [operations/puppet] - 10https://gerrit.wikimedia.org/r/156016 (owner: 10Ori.livneh) [19:26:45] (03PS7) 10Andrew Bogott: Random stab at getting wikitech config in here. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [19:29:08] (03CR) 10Ori.livneh: apache: when sourcing env-enabled/*, redirect stdout to stderr (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156060 (owner: 10Ori.livneh) [19:29:44] RECOVERY - Disk space on rhenium is OK: DISK OK [19:29:54] RECOVERY - check configured eth on rhenium is OK: NRPE: Unable to read output [19:30:04] RECOVERY - check if dhclient is running on rhenium is OK: PROCS OK: 0 processes with command name dhclient [19:30:14] RECOVERY - puppet last run on rhenium is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:30:15] RECOVERY - DPKG on rhenium is OK: All packages OK [19:30:15] RECOVERY - NTP on rhenium is OK: NTP OK: Offset 0.003604888916 secs [19:30:22] (03PS3) 10Ori.livneh: apache: when sourcing env-enabled/*, redirect stdout to stderr [operations/puppet] - 10https://gerrit.wikimedia.org/r/156060 [19:30:24] RECOVERY - RAID on rhenium is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [19:33:12] (03CR) 10Ori.livneh: [C: 032] apache: when sourcing env-enabled/*, redirect stdout to stderr [operations/puppet] - 10https://gerrit.wikimedia.org/r/156060 (owner: 10Ori.livneh) [19:33:39] (03PS11) 1001tonythomas: Added the bouncehandler router to catch in all bounce emails [operations/puppet] - 10https://gerrit.wikimedia.org/r/155753 [19:35:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 3 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [19:46:05] (03CR) 10Ori.livneh: [C: 032] jobrunner: use trebuchet package provider [operations/puppet] - 10https://gerrit.wikimedia.org/r/155859 (owner: 10Ori.livneh) [20:01:01] time to deploy a fresh version of parsoid. [20:01:02] (03PS1) 10Ori.livneh: Fix deployment server override variable name [operations/puppet] - 10https://gerrit.wikimedia.org/r/156175 [20:01:38] * Nemo_bis read "french" [20:01:40] (03CR) 10Ori.livneh: [C: 032 V: 032] "trivial" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156175 (owner: 10Ori.livneh) [20:02:32] Nemo_bis: pronounced 'par-swah' [20:03:04] RECOVERY - Unmerged changes on repository puppet on virt0 is OK: No changes to merge. [20:03:04] :-) [20:06:42] !log deployed parsoid version 5b5a5ed5 [20:06:48] Logged the message, Master [20:11:04] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:11:04] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:15:11] (03PS1) 1020after4: add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 [20:15:51] (03CR) 10jenkins-bot: [V: 04-1] add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 (owner: 1020after4) [20:18:09] (03CR) 10BryanDavis: add twentyafterfour to deployment group (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 (owner: 1020after4) [20:21:58] (03PS2) 10Ori.livneh: mediawiki: avoid installing php packages on HAT servers [operations/puppet] - 10https://gerrit.wikimedia.org/r/153772 (owner: 10Giuseppe Lavagetto) [20:22:47] (03PS2) 1020after4: add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 [20:23:26] (03CR) 10jenkins-bot: [V: 04-1] add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 (owner: 1020after4) [20:25:33] (03PS3) 1020after4: add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 [20:26:24] PROBLEM - Disk space on elastic1014 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19777 MB (3% inode=99%): [20:27:04] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [20:27:04] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [20:33:21] (03PS1) 10RobH: setup new mgmt ip for install2001 in codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 [20:35:38] (03PS2) 10RobH: setup new mgmt ip for install2001 in codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 [20:36:15] bd808 and ^d, can I get another round of reviews for https://gerrit.wikimedia.org/r/#/c/155789/ before y'all head home? [20:36:35] (03CR) 10RobH: [C: 031] "This looks right to me, but as I'm adding another mgmt subnet for servers, I'd like Mark to review." [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 (owner: 10RobH) [20:36:45] andrewbogott: you bet [20:36:49] thx [20:38:57] <^d> Looking. [20:42:55] (03CR) 10Chad: "Much better! Couple of more inline picks over config." (037 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [20:43:59] (03CR) 10Chad: Random stab at getting wikitech config in here. (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [20:45:02] (03CR) 10Dzahn: "since 10.193.0.0/16 = 10.193.0.1 thru 10.193.255.254 , and ifthat entire /16 is for mgmt, wouldn't that also include the server network "1" [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 (owner: 10RobH) [20:45:26] oh damn it [20:48:24] PROBLEM - puppet last run on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:48:53] (03Abandoned) 10Dzahn: etherpad - move Apache config into module [operations/puppet] - 10https://gerrit.wikimedia.org/r/155111 (owner: 10Dzahn) [20:49:52] (03PS3) 10RobH: setup new mgmt ip for install2001 in codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 [20:53:18] ori, MaxSem: FYI I added a patch to SWAT, hope that's okay [20:53:24] PROBLEM - RAID on analytics1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:37] how dare you! [20:53:39] :P [20:54:14] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [20:54:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [20:55:10] (03CR) 10Dzahn: [C: 032] remove HTTPS config from gitblit template [operations/puppet] - 10https://gerrit.wikimedia.org/r/154973 (owner: 10Dzahn) [20:55:34] Krenair, if it's a backport, of what? [20:55:43] (03CR) 10Dzahn: "this shouldn't do anything because gitblit is proxied" [operations/puppet] - 10https://gerrit.wikimedia.org/r/154973 (owner: 10Dzahn) [20:55:45] eh, see it [20:55:50] click the change-id :p [20:56:25] there's a cherry-pick button for that, it leaves a nice link to original rev [20:56:55] I used that... [20:57:05] Patch Set 1: Cherry Picked [20:57:05] This patchset was cherry picked to change: Ie9a6b3f017912d0f3493da09a267cf32852392af [20:58:14] RECOVERY - RAID on analytics1003 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [20:59:11] (03PS8) 10Andrew Bogott: Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [21:02:18] (03PS1) 10Gage: Remove special-case test, enable Hadoop->Logstash in the role [operations/puppet] - 10https://gerrit.wikimedia.org/r/156185 [21:05:01] (03CR) 10Ottomata: [C: 032] Remove special-case test, enable Hadoop->Logstash in the role [operations/puppet] - 10https://gerrit.wikimedia.org/r/156185 (owner: 10Gage) [21:05:04] PROBLEM - Unmerged changes on repository puppet on virt0 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [21:06:39] (03PS2) 10Gage: Remove special-case test, enable Hadoop->Logstash in the role [operations/puppet] - 10https://gerrit.wikimedia.org/r/156185 [21:07:01] (03CR) 10BryanDavis: Add wikitech config. (036 comments) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [21:07:45] (03PS4) 10Dzahn: gitblit apache template - retab [operations/puppet] - 10https://gerrit.wikimedia.org/r/155829 [21:08:14] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:08:25] (03CR) 10Gage: [C: 032] Remove special-case test, enable Hadoop->Logstash in the role [operations/puppet] - 10https://gerrit.wikimedia.org/r/156185 (owner: 10Gage) [21:08:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:09:14] (03CR) 10BryanDavis: Add wikitech config. (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [21:11:09] (03CR) 10Dzahn: [C: 032] "just the clean-up on top of the recent change, as promised" [operations/puppet] - 10https://gerrit.wikimedia.org/r/155829 (owner: 10Dzahn) [21:13:24] RECOVERY - Disk space on elastic1014 is OK: DISK OK [21:23:30] (03PS4) 10RobH: setup new mgmt ip for install2001 in codfw [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 [21:26:37] (03PS1) 10Dzahn: git,svn - remove duplicate NameVirtualHost *:80 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156191 [21:28:42] (03CR) 10Mark Bergsma: [C: 031] "It's not a separate subnet, just a separate $ORIGIN inside the same subnet. But that's fine." [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 (owner: 10RobH) [21:29:18] (03CR) 10RobH: [C: 032] "Indeed, my terminology is incorrect, thx for review!" [operations/dns] - 10https://gerrit.wikimedia.org/r/156181 (owner: 10RobH) [21:29:30] (03CR) 10Dzahn: "root@antimony:/etc/apache2# grep -r "NameVirtualHost \*:80" *" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156191 (owner: 10Dzahn) [21:34:21] (03CR) 10Jeremyb: "you pick. just make sure it's somewhere with the /unified/ cert. (not star)" [operations/dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [21:36:41] (03PS1) 10Rush: wmfusercontent.org is for user data [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 [21:37:53] (03CR) 10Dzahn: "usually (so far), redirects are on the main cluster, they are redirects.conf/redirects.dat which is now _not_ apache-config repo anymore, " [operations/dns] - 10https://gerrit.wikimedia.org/r/154222 (https://bugzilla.wikimedia.org/44731) (owner: 10Jeremyb) [21:39:49] (03CR) 10Rush: "WIP need to consult bblack" [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 (owner: 10Rush) [21:40:48] (03CR) 10Dzahn: [C: 04-1] wmfusercontent.org is for user data (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 (owner: 10Rush) [21:42:30] (03CR) 10BBlack: "^ What he said :)" [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 (owner: 10Rush) [21:43:45] (03PS1) 10Ottomata: Temporarily disable webstatscollector puppetization on analytics1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156198 [21:44:15] (03CR) 10Ottomata: [C: 032 V: 032] Temporarily disable webstatscollector puppetization on analytics1003 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156198 (owner: 10Ottomata) [22:06:27] (03PS9) 10Andrew Bogott: Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [22:06:31] (03CR) 10jenkins-bot: [V: 04-1] Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [22:09:55] (03CR) 10Dzahn: [C: 032] noc.wm.org - use ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/153976 (owner: 10Dzahn) [22:13:19] (03PS10) 10Andrew Bogott: Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [22:13:24] (03CR) 10Dzahn: "nop" [operations/puppet] - 10https://gerrit.wikimedia.org/r/153976 (owner: 10Dzahn) [22:13:47] (03CR) 10jenkins-bot: [V: 04-1] Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 (owner: 10Andrew Bogott) [22:16:31] (03PS1) 10Ori.livneh: mediawiki: one action per line in rsyslog config [operations/puppet] - 10https://gerrit.wikimedia.org/r/156199 [22:16:39] (03PS11) 10Andrew Bogott: Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [22:18:50] (03CR) 10Greg Grossmeier: [C: 031] add twentyafterfour to deployment group [operations/puppet] - 10https://gerrit.wikimedia.org/r/156177 (owner: 1020after4) [22:20:25] (03PS12) 10Andrew Bogott: Add wikitech config. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/155789 [22:22:19] (03CR) 10Dzahn: [C: 032] "yep, thanks! identical to existing upstream file" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/155732 (https://bugzilla.wikimedia.org/69747) (owner: 10Aklapper) [22:22:38] (03PS1) 10RobH: setting public ip addresses for install2001 [operations/dns] - 10https://gerrit.wikimedia.org/r/156200 [22:23:53] (03CR) 10RobH: [C: 031] "looks good to me but I need a second set of eyes" [operations/dns] - 10https://gerrit.wikimedia.org/r/156200 (owner: 10RobH) [22:28:02] (03PS1) 10QChris: Add replication lag settings [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/156201 [22:30:26] (03PS1) 10QChris: Configure wikimetrics' replication lag checking [operations/puppet] - 10https://gerrit.wikimedia.org/r/156202 [22:31:03] (03CR) 10jenkins-bot: [V: 04-1] Configure wikimetrics' replication lag checking [operations/puppet] - 10https://gerrit.wikimedia.org/r/156202 (owner: 10QChris) [22:31:57] (03CR) 10QChris: Add replication lag settings (031 comment) [operations/puppet/wikimetrics] - 10https://gerrit.wikimedia.org/r/156201 (owner: 10QChris) [22:33:17] (03CR) 10Dzahn: [V: 032] "yep, thanks! identical to existing upstream file" [wikimedia/bugzilla/modifications] - 10https://gerrit.wikimedia.org/r/155732 (https://bugzilla.wikimedia.org/69747) (owner: 10Aklapper) [22:33:32] (03CR) 10QChris: "(Jenkins' V-1 is expected as the submodule's commit" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156202 (owner: 10QChris) [22:35:21] (03CR) 10QChris: "@milimetric:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156202 (owner: 10QChris) [22:37:43] (03CR) 10Gage: [C: 031] setting public ip addresses for install2001 [operations/dns] - 10https://gerrit.wikimedia.org/r/156200 (owner: 10RobH) [22:38:10] (03CR) 10RobH: [C: 032] setting public ip addresses for install2001 [operations/dns] - 10https://gerrit.wikimedia.org/r/156200 (owner: 10RobH) [22:39:04] (03CR) 10Ori.livneh: [C: 032] "tested; straightforward" [operations/puppet] - 10https://gerrit.wikimedia.org/r/156199 (owner: 10Ori.livneh) [22:39:53] (03CR) 10BryanDavis: mediawiki: one action per line in rsyslog config (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156199 (owner: 10Ori.livneh) [22:40:30] bd808: i'll fix in a follow-up [22:41:21] ori: coolio. I think beta wants those routed to deployment-bastion but I may be wrong [22:42:02] bd808: i need to rebase https://gerrit.wikimedia.org/r/#/c/154710/ ; i provisioned a 'udplog' instance in labs so the udplog service alias should work on labs and production both [22:42:14] so i'll actually be hard-coding it [22:42:36] cheeky [22:44:07] ori: Do you think/know if that will fix the long standing bug in beta where udp2log comes up at boot on deployment-bastion running the wrong config and has to be manually restarted? [22:44:35] ori: https://bugzilla.wikimedia.org/show_bug.cgi?id=38995 [22:45:15] no, but i might try to fix that anyway [22:45:17] thanks for flagging it [22:45:27] "no" as in: it won't fix that [22:45:30] cool! fix all the dumb things [22:47:17] I have a question or two regarding how you are configuring memcached and redis guys [22:47:33] bd808 told me that you are using both redis and memcached and nutcracker [22:47:45] I was wondering how you recommend to set them up [22:48:22] I'm in https://git.wikimedia.org/tree/operations%2Fpuppet/54b68e4bc24ee1f7724c9c46c89ba991780d6e27/modules and searching around for it :) [22:48:52] (03CR) 10QChris: [C: 04-1] gerrit: allow . in Jenkins jobs names (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156103 (owner: 10Hashar) [22:54:15] PROBLEM - HTTP 5xx req/min on labmon1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:54:54] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:58:22] (03PS1) 10Vogone: Add 'movefile' to 'eliminator' user group on jawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156206 (https://bugzilla.wikimedia.org/70007) [23:00:11] I'll do the swat [23:00:31] thanks [23:00:55] renoirb: https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/memcached.pp & https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/redisdb.pp [23:01:52] (03CR) 10Rxy: [C: 031] "Thanks for send a patch. :)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156206 (https://bugzilla.wikimedia.org/70007) (owner: 10Vogone) [23:05:58] awight, is there a core commit for your change? [23:06:13] MaxSem: ah thx, no lemme prepare one now. [23:06:54] MaxSem: wmf17 and wmf18, right? [23:07:14] RECOVERY - HTTP 5xx req/min on labmon1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:07:17] awight, depends on where you want it:) [23:07:28] Krenair, about to deploy your change [23:07:48] ok [23:07:55] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [23:09:49] MaxSem: ok, I've prepared https://gerrit.wikimedia.org/r/#/c/156208/ and https://gerrit.wikimedia.org/r/#/c/156209/ [23:09:55] Shall I CR+2 them? [23:10:56] k thx [23:15:11] !log maxsem Synchronized php-1.24wmf18/includes/htmlform/HTMLCheckField.php: https://gerrit.wikimedia.org/r/#/c/156015/ (duration: 00m 05s) [23:15:17] Logged the message, Master [23:15:34] Krenair, ^^ please test [23:15:53] Looks good to me [23:16:23] thanks [23:17:05] (03PS1) 10RobH: updating install-server module for new codfw rows and install2001 params [operations/puppet] - 10https://gerrit.wikimedia.org/r/156210 [23:17:08] !log maxsem Synchronized php-1.24wmf18/extensions/CentralNotice/: https://gerrit.wikimedia.org/r/#/c/156188/ (duration: 00m 04s) [23:17:15] Logged the message, Master [23:17:55] awight, ^^ please test while I'm doing wmf17 [23:18:15] (03CR) 10Dzahn: [C: 032] Remove brion from the "dataset-admin" group [operations/puppet] - 10https://gerrit.wikimedia.org/r/153034 (owner: 10Hoo man) [23:19:04] !log maxsem Synchronized php-1.24wmf17/extensions/CentralNotice/: https://gerrit.wikimedia.org/r/#/c/156188/ (duration: 00m 05s) [23:19:10] Logged the message, Master [23:20:17] MaxSem: jfyi, I cannot test wmf18 because the deployed wikis don't have CentralNotice infrastructure mode. [23:20:46] awight, 17 is done too now [23:21:13] MaxSem: ok, testing now. [23:21:16] AndyRussG: ^^ [23:22:32] MaxSem: we need a i18n msg sync--I guess that happens automatically, later? [23:23:05] eh, I can scap now [23:23:28] !log maxsem Started scap: SWAT: CentralNotice update [23:23:35] Logged the message, Master [23:25:16] (03PS2) 10Rush: wmfusercontent.org is for user data [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 [23:30:18] (03CR) 10Dzahn: [C: 031] "lgtm, except you'll probably never need the "langlist" stuff, that is to create all the actual language names in wiki projects" [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 (owner: 10Rush) [23:30:58] good point [23:31:35] (03PS3) 10Rush: wmfusercontent.org is for user data [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 [23:32:14] (03PS2) 10RobH: updating install-server module for new codfw rows and install2001 params second ps: fixing two ip address mistakes/typos from bblack's review of patchset (thanks!) [operations/puppet] - 10https://gerrit.wikimedia.org/r/156210 [23:32:39] (03CR) 10Dzahn: [C: 031] wmfusercontent.org is for user data [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 (owner: 10Rush) [23:32:54] chasemp: do we actually have the domain already? [23:33:00] no [23:33:02] <^d> I thought we were going to do people.wm.o [23:33:13] people was for the public_html dirs ? [23:33:22] this is for attachments on bugtracker [23:33:22] <^d> This isn't the same? [23:33:29] i guess no, but it could? [23:33:37] <^d> Oh. Why does it have to be a new top level? [23:33:37] RobH: they emailed it was purchased? [23:33:41] no wait, the point is to have separate domain [23:33:42] <^d> We use bz-attachment.wm.o [23:33:46] not just subdomain [23:33:52] wmfusercontent.org, cool. (like google, flickr and github have, too). Will upload.* move there as well? [23:33:56] ^d: javascript being able to set the parent as origin [23:34:03] chasemp: not that i know of [23:34:03] <^d> hmm, ok [23:34:05] means all kinds of security bugaloo [23:34:14] RobH: I saw they did, did it not hit the ticket? [23:34:31] yeah, javascript can set cookies and localStorage for *.parent.domain [23:34:31] oh, cool, i dunno [23:34:34] im in installs erver land [23:34:36] Hello, [23:34:36] This name is now registered. [23:34:36] Domain Name:WMFUSERCONTENT.ORG [23:34:52] chasemp: cool [23:34:55] ignore meeeee [23:34:59] i see that now,sorry [23:35:08] (I was just shocked cuz i sent the email a couple of hours ago) [23:35:11] ^d: I pestered csteipp about it :) basically this is why google uses a separate top level for this kind of thing, etc [23:35:18] yeah dude they were quick, love it [23:35:26] <^d> Yeah, I knew it's why we used a separate subdomain. [23:35:44] <^d> Guess a full domain is doubleplusgood. [23:35:46] ^d: per csteipp, completely separate domain is even more secure [23:35:49] <^d> :) [23:35:51] <^d> Yeah [23:35:52] ok yep so that was my thought as well, but I guess still leaves the door open for a bunch of attacks [23:36:26] Though *.domain cookies/storage is relatively low risk since those have lower presedence than more specific values (e.g. setting a cookie for *.wikimedia.org doesn't overwrite meta.wikimedia.org, and cookies for meta.wikimedia.org are not visible on different.wikimedia.org) - and we don't set auth cookies on *.wikimedia.org anyway because of private wikis and chapter wikis [23:36:27] I realized previouse job we used .net for content and .com for serving http, which served same purpose [23:36:43] <^d> All makes sense. I was also just confused with this and people.wm.o. Separate things! [23:36:51] ah yes [23:36:56] people.wm still cool for public_html, yep [23:37:03] the fenari replacement that is [23:37:09] (03CR) 10BBlack: [C: 031] updating install-server module for new codfw rows and install2001 params second ps: fixing two ip address mistakes/typos from bblack's revie [operations/puppet] - 10https://gerrit.wikimedia.org/r/156210 (owner: 10RobH) [23:37:19] <^d> "This webpage is not available" :( [23:37:21] shell users should just not upload evil stuff :p [23:37:37] <^d> Not setup yet? [23:37:46] changeset is still pending [23:37:54] also, nothing planned for top level [23:37:54] wmfusercontent? that was just even the DNS change to add a zone on our side [23:38:00] (03CR) 10RobH: [C: 032] updating install-server module for new codfw rows and install2001 params second ps: fixing two ip address mistakes/typos from bblack's revie [operations/puppet] - 10https://gerrit.wikimedia.org/r/156210 (owner: 10RobH) [23:38:31] (03PS4) 10Rush: wmfusercontent.org is for user data [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 [23:38:38] (03CR) 10Rush: [C: 032 V: 032] wmfusercontent.org is for user data [operations/dns] - 10https://gerrit.wikimedia.org/r/156196 (owner: 10Rush) [23:38:48] <^d> chasemp: At least having a landing page saying what the subdomain is for would be nice. [23:39:01] 'tis true [23:39:20] "this web page is to satisfy the twitch in ^d's eye. please see subdomains" :) [23:39:25] meh, if they wanna know they can download the puppet repo and parse it! [23:39:45] <^d> I'm not even saying do anything fancy like account/directory listing. [23:40:00] <^d> Just a "This domain serves shell users' public_html directories" [23:40:04] i voted to kill the userdirectories html stuff [23:40:05] <^d> Better than a 404 :) [23:40:10] ;] [23:40:13] * RobH is no tron. [23:40:30] apergos uploaded puppet stuff for the public html's [23:40:55] hm, is SWAT still in progress? [23:41:12] RobH: we got it done sub-1-day man thanks [23:41:19] (03CR) 10Chad: [C: 031] git,svn - remove duplicate NameVirtualHost *:80 [operations/puppet] - 10https://gerrit.wikimedia.org/r/156191 (owner: 10Dzahn) [23:41:20] chasemp: did you run the update on ns0 already? [23:41:26] yes [23:42:05] <^d> https://gerrit.wikimedia.org/r/#/c/149890/ [23:42:06] (03CR) 10TTO: "Forgive my ignorance, but how come we're not using uselang anymore? If I go to, say, frwikinews and click "Importer un fichier", I get sen" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/154448 (https://bugzilla.wikimedia.org/69055) (owner: 10TTO) [23:42:44] chasemp: eh, that won't work [23:42:49] phab.wmfusercontent.org. 3600 IN CNAME misc-web-lb.eqiad.wmfusercontent.org. [23:42:55] we did not mean that [23:43:05] dhcpd log is full of cruft now [23:43:05] hmmmm true [23:43:06] we mean't misc-web-b in eqiad [23:43:12] since codfw network joined [23:43:40] bd808|BUFFER: apache2 log on fluorine is back [23:44:12] hm, is SWAT still in progress? [23:44:15] * Vogone hides [23:44:19] jouncebot: status [23:46:03] (03CR) 10Krinkle: [C: 04-1] public_html directory service, see RT #6862 (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [23:46:25] jouncebot is still dead? hah [23:46:26] I'm asking because the change I've scheduled (https://wikitech.wikimedia.org/wiki/Deployments#Week_of_Aug_25th) is the only one nobody has dealt with yet … from tomorrow on I'll be away for a couple of weeks, so someone else would need to take care of that change [23:46:36] (03PS1) 10Dzahn: wmfusercontent.org - use misc-web-lb in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/156212 [23:46:40] mutante: is there a better way to do that in the syntax? or resort to A record? [23:46:53] chasemp: i think it's valid like so ^ [23:46:58] bblack: ? [23:46:59] ori: SWAT question ^ [23:47:02] MaxSem: SWAT question ^ [23:47:39] that won't end up as misc-web-lb.eqiad.wikimedia.org.wmfusercontent.org [23:47:54] does it look for a valid tld and then assume no magic? [23:47:56] (03CR) 10Jeremyb: [C: 04-1] wmfusercontent.org - use misc-web-lb in eqiad (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/156212 (owner: 10Dzahn) [23:48:15] (03PS2) 10Dzahn: wmfusercontent.org - use misc-web-lb in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/156212 [23:48:50] (03PS3) 10Dzahn: wmfusercontent.org - use misc-web-lb in eqiad [operations/dns] - 10https://gerrit.wikimedia.org/r/156212 [23:49:11] (03PS3) 10Chad: gerrit - use apache::site [operations/puppet] - 10https://gerrit.wikimedia.org/r/153849 (owner: 10Dzahn) [23:49:13] (03PS1) 10Chad: Gerrit: Remove no_apache parameter, unused [operations/puppet] - 10https://gerrit.wikimedia.org/r/156213 [23:49:49] jeremyb: was the dot -1 a syntax error? [23:50:06] MaxSem: ok, our CentralNotice testing is complete and everything looks good. Thanks! [23:50:31] chasemp, what do you mean? [23:51:06] the -1 you put on /r/156212 curious if that was a syntax error or a human error since the linter didn't complain [23:51:08] yea, trailing dot is true [23:51:22] it wouldn't stop the zone from being served [23:51:31] but it would point to the wrong place [23:51:45] gotcha, that was what I meant, I understand now thanks [23:51:46] misc-web-lb.eqiad.wikimedia.org.wmfusercontent.org vs. misc-web-lb.eqiad.wikimedia.org [23:51:52] the linter should catch, if we can [23:53:26] !log maxsem Finished scap: SWAT: CentralNotice update (duration: 29m 58s) [23:53:32] Logged the message, Master [23:54:13] MatmaRex: what was the swat question? [23:54:22] (03PS2) 10Chad: public_html directory service, see RT #6862 [operations/puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [23:54:30] (03CR) 10Chad: public_html directory service, see RT #6862 (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [23:54:32] (03CR) 10jenkins-bot: [V: 04-1] public_html directory service, see RT #6862 [operations/puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [23:54:34] ori: [01:44] hm, is SWAT still in progress? [23:54:37] ori: [01:46] I'm asking because the change I've scheduled (https://wikitech.wikimedia.org/wiki/Deployments#Week_of_Aug_25th) is the only one nobody has dealt with yet … from tomorrow on I'll be away for a couple of weeks, so someone else would need to take care of that change [23:54:48] oh sure, i'll deploy it [23:55:03] which change is it exactly? [23:55:16] https://gerrit.wikimedia.org/r/#/q/156206,n,z , got it [23:55:30] (03PS1) 10Dzahn: add people.wm.org -> misc varnish, public_html's [operations/dns] - 10https://gerrit.wikimedia.org/r/156214 [23:55:41] (03PS3) 10Chad: public_html directory service, see RT #6862 [operations/puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [23:55:43] (03CR) 10Rush: [C: 031] "assumign this syntax works, seems like what we want :)" [operations/dns] - 10https://gerrit.wikimedia.org/r/156212 (owner: 10Dzahn) [23:56:00] mutante: I would roll out w/ it? worst case it's wrong again and revert and wait for bblack [23:56:04] <^d> mutante: PS3 has it rebased, with the index.html fixes Krinkle requested. [23:56:15] <^d> If you could do the apache::site change, I think it'll be good to go. [23:56:45] (03CR) 10Ori.livneh: [C: 032] Add 'movefile' to 'eliminator' user group on jawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156206 (https://bugzilla.wikimedia.org/70007) (owner: 10Vogone) [23:56:51] (03Merged) 10jenkins-bot: Add 'movefile' to 'eliminator' user group on jawiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/156206 (https://bugzilla.wikimedia.org/70007) (owner: 10Vogone) [23:57:12] MatmaRex: thanks for the ping [23:57:28] chasemp: yea, agree [23:58:05] !log ori Synchronized wmf-config/InitialiseSettings.php: Add 'movefile' to 'eliminator' user group on jawiki (duration: 00m 03s) [23:58:11] Logged the message, Master [23:58:12] Vogone: ^^ [23:58:18] (03CR) 10Krinkle: public_html directory service, see RT #6862 (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/149890 (owner: 10ArielGlenn) [23:58:18] chasemp: "The canonical name that a CNAME record points to can be anywhere in the DNS, whether local or on a remote server in a different DNS zone." [23:58:30] ori: thanks, works for me :) [23:58:38] mutante: seems good to me [23:59:03] (03CR) 10Dzahn: [C: 032] ""The canonical name that a CNAME record points to can be anywhere in the DNS, whether local or on a remote server in a different DNS zone." [operations/dns] - 10https://gerrit.wikimedia.org/r/156212 (owner: 10Dzahn) [23:59:23] hrmm