[00:03:21] (03CR) 10Dzahn: [C: 04-1] "not yet like this.. uhm" [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [00:09:53] (03PS1) 10Gage: hadoop: avoid ganglia diskstat duplicate delaration [puppet] - 10https://gerrit.wikimedia.org/r/169275 [00:15:00] paravoid: smtp is wiki-mail.wikimedia.org [00:15:45] sorry, I jumped into a conversation [00:15:53] ok, can we changed it to polonium.wikimedia.org? [00:17:24] (03PS2) 10Dzahn: Revert "contacts/outreach - move to module" [puppet] - 10https://gerrit.wikimedia.org/r/169261 [00:18:15] change* [00:18:28] (03CR) 10Dzahn: [C: 032] "yep, i don't see the reason for the mystery error, but can't just leave it broken.. shrug" [puppet] - 10https://gerrit.wikimedia.org/r/169261 (owner: 10Dzahn) [00:18:54] and I suppose there's no easy to let puppet handle that... [00:21:16] (03PS1) 10Dzahn: Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169276 [00:21:20] (03CR) 10jenkins-bot: [V: 04-1] Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169276 (owner: 10Dzahn) [00:22:53] (03PS2) 10Dzahn: Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169276 [00:22:58] (03CR) 10jenkins-bot: [V: 04-1] Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169276 (owner: 10Dzahn) [00:25:34] (03PS1) 10Dzahn: Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169278 [00:25:40] (03CR) 10jenkins-bot: [V: 04-1] Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169278 (owner: 10Dzahn) [00:25:58] (03CR) 10Dzahn: "'stab'!!!" [puppet] - 10https://gerrit.wikimedia.org/r/169276 (owner: 10Dzahn) [00:28:53] (03CR) 10Dzahn: "i wish we could just delete the entire thing!! it's not more than an Apache site anyways.. as if this is puppetized :p" [puppet] - 10https://gerrit.wikimedia.org/r/169278 (owner: 10Dzahn) [00:30:38] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 10, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 1, number_of_data_nodes: 3 [00:31:11] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 10, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 1, number_of_data_nodes: 3 [00:31:18] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 10, timed_out: False, active_primary_shards: 42, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 105, initializing_shards: 1, number_of_data_nodes: 3 [00:31:39] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [00:32:58] <^d> pool5 is the only one that's been complaining today since falling back. [00:33:26] (03PS2) 10Dzahn: Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169278 [00:33:59] <^d> lol, there is no pool5? [00:34:10] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.999 second response time on port 8123 [00:34:18] (03CR) 10Dzahn: [C: 032] Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169278 (owner: 10Dzahn) [00:35:19] <^d> pool5 half exists, 19 and 20 are in it. [00:35:30] <^d> Also in pool4, if you believe lucene.pp [00:36:38] RECOVERY - puppet last run on zirconium is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:38:11] (03CR) 10Dzahn: [C: 04-2] "crap" [puppet] - 10https://gerrit.wikimedia.org/r/169276 (owner: 10Dzahn) [00:38:14] (03Abandoned) 10Dzahn: Revert "contacts - fix template path for module" [puppet] - 10https://gerrit.wikimedia.org/r/169276 (owner: 10Dzahn) [00:40:11] (03CR) 10Dzahn: "i think i changed my mind about touching fr logging classes" [puppet] - 10https://gerrit.wikimedia.org/r/168723 (owner: 10Dzahn) [00:40:20] (03Abandoned) 10Dzahn: fundraising logging - move out of misc [puppet] - 10https://gerrit.wikimedia.org/r/168723 (owner: 10Dzahn) [00:41:28] (03PS1) 10Spage: Allow wgMainCacheType setting to be overridden [puppet] - 10https://gerrit.wikimedia.org/r/169288 (https://bugzilla.wikimedia.org/72600) [00:42:09] (03PS2) 10Faidon Liambotis: exim: blackhole noreply@phabricator emails [puppet] - 10https://gerrit.wikimedia.org/r/169008 (owner: 10Rush) [00:42:11] (03PS2) 10Faidon Liambotis: mail: kill role::mail::oldmx, not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/169264 [00:42:13] (03PS6) 10Faidon Liambotis: mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 [00:42:15] (03PS2) 10Faidon Liambotis: exim: kill mail.wikipedia.org/mail.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169263 [00:42:17] (03PS1) 10Faidon Liambotis: phabricator: s/noreply/no-reply/ [puppet] - 10https://gerrit.wikimedia.org/r/169290 [00:42:19] (03PS1) 10Faidon Liambotis: phabricator: don't hardcode mail smarthost [puppet] - 10https://gerrit.wikimedia.org/r/169291 [00:42:24] paravoid: on no-reply vs noreply :) pedant to your hearts content my good sir. I have no real love for either / any [00:42:42] but it has to match in role::phab::[main|?] [00:42:49] see above [00:42:54] both of those commits, please review :) [00:43:35] !log ori Synchronized php-1.25wmf5/extensions/WikimediaEvents/WikimediaEventsHooks.php: I4adffaa26: Actually unset the HHVM cookie (duration: 00m 03s) [00:43:39] !log ori Synchronized php-1.25wmf4/extensions/WikimediaEvents/WikimediaEventsHooks.php: I4adffaa26: Actually unset the HHVM cookie (duration: 00m 03s) [00:43:41] Logged the message, Master [00:43:46] Logged the message, Master [00:44:09] (03CR) 10Rush: [C: 031] "awesome" [puppet] - 10https://gerrit.wikimedia.org/r/169291 (owner: 10Faidon Liambotis) [00:44:19] (03PS3) 10Faidon Liambotis: exim: blackhole noreply@phabricator emails [puppet] - 10https://gerrit.wikimedia.org/r/169008 (owner: 10Rush) [00:44:21] (03PS3) 10Faidon Liambotis: mail: kill role::mail::oldmx, not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/169264 [00:44:23] (03PS7) 10Faidon Liambotis: mail: remove secondary MX role from sodium [puppet] - 10https://gerrit.wikimedia.org/r/143887 [00:44:25] (03PS3) 10Faidon Liambotis: exim: kill mail.wikipedia.org/mail.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169263 [00:44:27] (03PS2) 10Faidon Liambotis: phabricator: don't hardcode mail smarthost [puppet] - 10https://gerrit.wikimedia.org/r/169291 [00:44:29] (03PS2) 10Faidon Liambotis: phabricator: s/noreply/no-reply/ [puppet] - 10https://gerrit.wikimedia.org/r/169290 [00:45:05] (03CR) 10Rush: [C: 031] "seems good" [puppet] - 10https://gerrit.wikimedia.org/r/169290 (owner: 10Faidon Liambotis) [00:45:42] "An error has occurred while searching: HTTP request timed out." - search on Commons. Very flaky tonight. [00:46:13] (03CR) 10Faidon Liambotis: [C: 032] mail: kill role::mail::oldmx, not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/169264 (owner: 10Faidon Liambotis) [00:46:37] (03CR) 10Faidon Liambotis: [C: 032] exim: kill mail.wikipedia.org/mail.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169263 (owner: 10Faidon Liambotis) [00:47:14] (03CR) 10Faidon Liambotis: [C: 032] exim: blackhole noreply@phabricator emails [puppet] - 10https://gerrit.wikimedia.org/r/169008 (owner: 10Rush) [00:47:34] (03CR) 10Faidon Liambotis: [C: 032] phabricator: s/noreply/no-reply/ [puppet] - 10https://gerrit.wikimedia.org/r/169290 (owner: 10Faidon Liambotis) [00:48:04] chasemp: I'm not sure at all that the "smtp1;smtp2" syntax will work [00:48:22] https://github.com/PHPMailer/PHPMailer seems to suggest so [00:48:31] I can merge, but can you test somehow? [00:48:43] ah well we shall soon find out then [00:48:47] sure [00:48:51] (03CR) 10Faidon Liambotis: [C: 032] phabricator: don't hardcode mail smarthost [puppet] - 10https://gerrit.wikimedia.org/r/169291 (owner: 10Faidon Liambotis) [00:49:22] (03CR) 10Dzahn: "same as class misc::statistics::researchdb_password ? wanna remove the old class then?" [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [00:49:37] (03PS1) 10Chad: Stop using lsearchd pool 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169294 [00:49:39] (03PS1) 10Chad: Decom lsearchd pool 5 [puppet] - 10https://gerrit.wikimedia.org/r/169295 [00:51:08] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 310 seconds [00:51:38] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 341 seconds [00:52:05] paravoid: both changes landed, seems good I restarted mailer daemons (phab side) and tested delivery [00:52:34] <^d> NotASpy: Search has been flakey all day, almost fixed I hope. [00:52:46] <^d> Should have commons & others back on Cirrus shortly. [00:52:47] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:53:03] ^d: brilliant, many thanks [00:53:10] <^d> you're welcome [00:53:15] (03PS2) 10Spage: Allow wgMainCacheType setting to be overridden [puppet] - 10https://gerrit.wikimedia.org/r/169288 (https://bugzilla.wikimedia.org/72600) [00:53:18] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection timed out [00:53:19] paravoid: thanks [00:53:22] <^d> paravoid: my solution to pool 5 lvs warnings is to decom pool 5 ;-) [00:53:24] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:48] gah, ordered_json does not indent [00:55:00] csteipp_ooo: Could you have a quick look at https://gerrit.wikimedia.org/r/168922 please? [00:55:03] +1 would be nice [00:58:02] (03CR) 10Manybubbles: [C: 031] Decom lsearchd pool 5 [puppet] - 10https://gerrit.wikimedia.org/r/169295 (owner: 10Chad) [00:58:09] (03CR) 10Manybubbles: [C: 031] Stop using lsearchd pool 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169294 (owner: 10Chad) [00:58:19] (03CR) 10CSteipp: [C: 031] Only add the "oauthadmin" group on the central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 (owner: 10Hoo man) [00:58:36] RECOVERY - LVS Lucene on search-pool5.svc.eqiad.wmnet is OK: TCP OK - 0.003 second response time on port 8123 [00:58:41] thanks :) [00:58:52] (03CR) 10EBernhardson: [C: 031] "I'm wondering if anything should override the users LocalSettings.php? If i tell the app to be configured a certain way i want it to just" [puppet] - 10https://gerrit.wikimedia.org/r/169288 (https://bugzilla.wikimedia.org/72600) (owner: 10Spage) [01:00:50] (03CR) 10Chad: [C: 032] Reenable cirrus for all users that used to have it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169219 (owner: 10Manybubbles) [01:00:58] (03Merged) 10jenkins-bot: Reenable cirrus for all users that used to have it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169219 (owner: 10Manybubbles) [01:01:54] !log demon Synchronized wmf-config/InitialiseSettings.php: Turn Cirrus back on basically everywhere. If Elasticsearch freaks out again just revert I73ae276e to get back to lsearchd again (duration: 00m 04s) [01:01:59] Logged the message, Master [01:03:07] PROBLEM - puppet last run on ssl3003 is CRITICAL: CRITICAL: Puppet has 1 failures [01:03:52] (03PS1) 10Chad: Remove search-pool5 LVS entries, exists no more [dns] - 10https://gerrit.wikimedia.org/r/169300 [01:14:14] (03CR) 10Chad: "It's also worth noting that as we fell back to lsearchd today pool5 was the only one complaining via icinga. That further tells me that th" [puppet] - 10https://gerrit.wikimedia.org/r/169295 (owner: 10Chad) [01:14:32] (03PS1) 10Dzahn: add a phabricator check to LVS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/169303 [01:14:45] can anyone think of other unpuppetized services that would use wiki-mail.wikimedia.org to send email besides Jenkins? [01:16:32] <^d> Not offhand. [01:18:30] how is gerrit configured? :) [01:20:00] <^d> puppet. [01:20:20] <^d> it's a crappy puppet implementation from early puppet days. [01:20:22] <^d> but it's puppet. [01:20:30] good [01:20:50] oh look at that, I even fixed it [01:21:00] the mail relay I mean [01:21:26] RECOVERY - puppet last run on ssl3003 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [01:21:28] PROBLEM - MySQL Replication Heartbeat on db1015 is CRITICAL: CRIT replication delay 325 seconds [01:22:06] PROBLEM - MySQL Slave Delay on db1015 is CRITICAL: CRIT replication delay 353 seconds [01:23:06] RECOVERY - MySQL Slave Delay on db1015 is OK: OK replication delay 0 seconds [01:23:26] 'smtpserver' => 'smtp.pmtpa.wmnet', [01:23:32] heh, but this cant work [01:23:46] RECOVERY - MySQL Replication Heartbeat on db1015 is OK: OK replication delay 5 seconds [01:24:10] where is that? [01:24:21] bugzilla/data/params [01:24:28] where? [01:24:32] but i doubt it's used, or we would have had no BZ mail [01:24:36] <^d> bz is poorly puppetized still. [01:24:47] i dont think i agree [01:25:03] /srv/org/wikimedia/bugzilla/data/params on zirconium [01:25:05] <^d> Heh, or maybe not. [01:25:06] <^d> smtp.pmtpa.wmnet [01:25:09] <^d> in UI [01:25:22] can you fix in UI? i dont have admin [01:25:36] <^d> what should it be now? [01:26:04] i believe polonium.wikimedia.org [01:26:22] how is it working now? [01:26:26] <^d> good question [01:26:28] that's what i wonder [01:26:30] it probably just sendmail()s [01:26:38] <^d> Yep [01:26:43] <^d> That used to be SMTP [01:26:49] <^d> I see it set to sendmail now [01:27:00] i think this setting is way back from kaulen [01:27:01] that's fine then [01:27:06] and we changed it since zirconium [01:27:23] ok :) just grepped on the server and found this :p [01:27:38] :) [01:28:55] we have a check if misc-web-lb is up in general [01:29:07] does it make sense to check each backend configured on it as well [01:29:12] via misc-web-lb itself [01:30:24] or is that only duplacting things.. and what we should really do is check all the Apaches [01:37:56] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:57] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:57] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:57] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:57] <^d> whee, elasticsearch is green for the first time since i woke up. [01:37:57] RECOVERY - ElasticSearch health check on elastic1016 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:57] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:57] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:37:58] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:07] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:16] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:17] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:17] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:17] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:17] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:17] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:27] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:38:27] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [01:42:03] (03PS1) 10Dzahn: wikistats - add update cron for wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/169306 [01:42:36] (03CR) 10Dzahn: [C: 032] wikistats - add update cron for wikivoyage [puppet] - 10https://gerrit.wikimedia.org/r/169306 (owner: 10Dzahn) [01:42:59] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor inline comment + what Mark said" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/164386 (owner: 10Mark Bergsma) [01:43:11] (03CR) 10Dzahn: [V: 032] "@wikistats-petcow:~$ sudo -u wikistatsuser /usr/bin/php /usr/lib/wikistats/update.php wy" [puppet] - 10https://gerrit.wikimedia.org/r/169306 (owner: 10Dzahn) [01:45:04] (03CR) 10Faidon Liambotis: "It could be handled in an upper layer, but I bet you it won't :) I'm also sure these heuristics will also fail badly on several occasions," [puppet] - 10https://gerrit.wikimedia.org/r/167645 (owner: 10Alexandros Kosiaris) [01:45:09] PROBLEM - puppet last run on amssq54 is CRITICAL: CRITICAL: puppet fail [01:50:16] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [01:59:38] (03CR) 10Faidon Liambotis: [C: 04-1] "First pass, see inline comments." (0311 comments) [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 (owner: 10Filippo Giunchedi) [02:02:29] (03Abandoned) 10Faidon Liambotis: leave only one statsd/carbon-relay CNAME [dns] - 10https://gerrit.wikimedia.org/r/138568 (owner: 10Filippo Giunchedi) [02:03:58] RECOVERY - puppet last run on amssq54 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [02:07:03] (03CR) 10Faidon Liambotis: [C: 04-1] "This changeset surely has a lot of -1/-2s, we can probably abandon it, no?" [puppet] - 10https://gerrit.wikimedia.org/r/151523 (https://bugzilla.wikimedia.org/60238) (owner: 10Tim Landscheidt) [02:12:13] (03PS3) 10Faidon Liambotis: Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [02:12:41] (03CR) 10Faidon Liambotis: [C: 032] Redirect c[sz].wikimedia.org to http://www.wikimedia.cz [puppet] - 10https://gerrit.wikimedia.org/r/147485 (owner: 10Reedy) [02:17:06] (03PS3) 10Faidon Liambotis: Point c[sz].wikimedia.org at wikimedia-lb rather than external site [dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [02:21:00] (03PS1) 10Faidon Liambotis: Switch cs/cz.wikimedia.org redirect to funnel [puppet] - 10https://gerrit.wikimedia.org/r/169307 [02:21:29] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Switch cs/cz.wikimedia.org redirect to funnel [puppet] - 10https://gerrit.wikimedia.org/r/169307 (owner: 10Faidon Liambotis) [02:22:33] (03CR) 10Faidon Liambotis: [C: 032] Point c[sz].wikimedia.org at wikimedia-lb rather than external site [dns] - 10https://gerrit.wikimedia.org/r/143086 (owner: 10Reedy) [02:27:17] (03PS4) 10Faidon Liambotis: Switch *.{wap,mobile}.wikipedia.org to wikipedia-lb [dns] - 10https://gerrit.wikimedia.org/r/98055 [02:31:47] (03CR) 10Faidon Liambotis: "Will also need a Varnish change to kill those redirects (after DNS expires obviously)" [dns] - 10https://gerrit.wikimedia.org/r/98055 (owner: 10Faidon Liambotis) [03:55:46] (03PS2) 10Springle: Increase wgQueryCacheLimit for enwiki and dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168929 (https://bugzilla.wikimedia.org/44321) [03:56:42] (03CR) 10Springle: [C: 032] Increase wgQueryCacheLimit for enwiki and dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168929 (https://bugzilla.wikimedia.org/44321) (owner: 10Springle) [03:56:49] (03Merged) 10jenkins-bot: Increase wgQueryCacheLimit for enwiki and dewiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168929 (https://bugzilla.wikimedia.org/44321) (owner: 10Springle) [03:59:07] (03PS2) 10Springle: Increase updatequerypages frequency to twice per month for all wikis. [puppet] - 10https://gerrit.wikimedia.org/r/168933 [04:15:09] (03CR) 10Springle: [C: 032] Increase updatequerypages frequency to twice per month for all wikis. [puppet] - 10https://gerrit.wikimedia.org/r/168933 (owner: 10Springle) [04:48:54] (03CR) 10Springle: "Interested in hearing more about the passwords module. Would it be aimed at formalizing the use of private repo variables? Limited to DB s" [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [04:54:55] (03Abandoned) 10BryanDavis: Set wgMemCachedServers to point to nutcracker [mediawiki-config] - 10https://gerrit.wikimedia.org/r/161005 (owner: 10BryanDavis) [06:16:38] (03PS1) 10Faidon Liambotis: geoip: fetch all MaxMind products that we pay for [puppet] - 10https://gerrit.wikimedia.org/r/169320 [06:26:06] RECOVERY - Disk space on ocg1001 is OK: DISK OK [06:26:45] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:28:17] PROBLEM - puppet last run on db1023 is CRITICAL: CRITICAL: puppet fail [06:28:27] PROBLEM - puppet last run on amssq47 is CRITICAL: CRITICAL: puppet fail [06:28:35] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [06:28:46] PROBLEM - puppet last run on amssq60 is CRITICAL: CRITICAL: puppet fail [06:28:55] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [06:28:56] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: puppet fail [06:29:16] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [06:29:26] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: puppet fail [06:29:37] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:07] PROBLEM - puppet last run on db1040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:26] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:37] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:47] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:47] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:56] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:56] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:06] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:16] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:25] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:36] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:45] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:35] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:06] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:45:23] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:45:25] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:39] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:45:45] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:45:55] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:45:56] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:45:57] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [06:46:11] RECOVERY - puppet last run on db1040 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:25] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 60 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:54] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:47:05] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:10] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:25] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on db1023 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:47:35] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:47:45] RECOVERY - puppet last run on amssq47 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:52] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:48:06] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 66 seconds ago with 0 failures [06:48:06] RECOVERY - puppet last run on amssq60 is OK: OK: Puppet is currently enabled, last run 68 seconds ago with 0 failures [06:50:26] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:51:05] (03PS1) 10Florianschmidtwelzow: Remove obsolete mobile configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169321 [06:51:27] (03PS2) 10Florianschmidtwelzow: Remove obsolete mobile configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169321 [06:53:15] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: Puppet has 1 failures [07:12:16] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [07:15:57] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: puppet fail [07:23:51] (03CR) 10Nemo bis: "> What does prevent this change from being deployed in November?" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [07:24:44] (03PS4) 10Nemo bis: Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) [07:28:06] PROBLEM - Disk space on mw1114 is CRITICAL: DISK CRITICAL - free space: /run 47 MB (3% inode=99%): [07:33:56] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [07:36:15] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:49:39] (03PS4) 1001tonythomas: Make BounceHandler extension work on en-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [08:39:32] (03CR) 10Florianschmidtwelzow: [C: 031] "I'm happy with this :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [09:03:48] (03CR) 10Taueres: [C: 031] Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [09:46:58] (03PS1) 10Matanya: backups: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/169333 [09:59:19] (03PS1) 10Matanya: access: give Kunal Mehta deployment access [puppet] - 10https://gerrit.wikimedia.org/r/169337 [10:04:57] PROBLEM - MySQL Processlist on db1056 is CRITICAL: CRIT 115 unauthenticated, 0 locked, 0 copy to table, 0 statistics [10:05:59] RECOVERY - MySQL Processlist on db1056 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 1 statistics [10:42:29] (03PS1) 10Matanya: firewall: nitpicks [puppet] - 10https://gerrit.wikimedia.org/r/169340 [10:57:04] paravoid: hey, any comment on https://gerrit.wikimedia.org/r/#/c/167310/ ? LGTM, I think we can push it tomorrow [11:25:23] (03PS1) 10Aude: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169344 [11:52:48] Assuming that SSL3 has been disabled on misc-web-lb.eqiad.wikimedia.org and Tool Labs, anybody knows why ssllabs.com says it hasn't? [12:00:04] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141028T1200). [12:00:24] omg, so early... [12:03:43] (03PS2) 10Tim Landscheidt: compare-puppet-catalogs: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169260 [12:06:14] <_joe_> oh thanks :) [12:06:24] (03CR) 10Tim Landscheidt: "Stole the work-around from YuviPanda's operations/puppet:modules/diamond/files/collector/minimalpuppetagent.py." [software] - 10https://gerrit.wikimedia.org/r/169260 (owner: 10Tim Landscheidt) [12:27:50] (03CR) 10Giuseppe Lavagetto: [C: 032] "Thanks a lot!" [software] - 10https://gerrit.wikimedia.org/r/169260 (owner: 10Tim Landscheidt) [12:57:21] (03PS2) 10Filippo Giunchedi: import debian/ directory [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 [12:57:53] (03CR) 10Filippo Giunchedi: "fixed things from Faidon's good feedback + systemd unit file" (0310 comments) [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 (owner: 10Filippo Giunchedi) [13:27:37] (03PS1) 10Manybubbles: Enable faster regex searching and fix pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169355 [13:30:03] (03CR) 10Ottomata: [C: 032] access: give Kunal Mehta deployment access [puppet] - 10https://gerrit.wikimedia.org/r/169337 (owner: 10Matanya) [13:30:54] thanks matanya [13:30:59] sure [13:31:04] legoktm: don't break stuff :p [13:34:09] (03PS2) 10Ottomata: hadoop: avoid ganglia diskstat duplicate delaration [puppet] - 10https://gerrit.wikimedia.org/r/169275 (owner: 10Gage) [13:37:38] (03CR) 10Ottomata: "I can't remove the class, Daniel, as they ensure different group read permissions on the file." [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [13:40:09] (03CR) 10Ottomata: "To make this slighly more generic (and to not put another template into templates/misc), could I add this to some mysql module? There are" [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [13:40:25] ottomata: wanted to ask you, udp2log doesn't have ferm rules, do you plan on changing that ? [13:40:59] for the udp2log daemon? [13:41:24] (hadn't planned on changing it, as we are actively trying to turn off udp2log...which still maybe be months down the road due to the upcoming fundraiser) [13:44:54] I'm gonna sync so code to production now. anyone object? its a quick security bug. [13:45:09] (03CR) 10Ottomata: [C: 032] hadoop: avoid ganglia diskstat duplicate delaration [puppet] - 10https://gerrit.wikimedia.org/r/169275 (owner: 10Gage) [13:46:04] ori: I have a merged but unsynced rebased change of yours in wmf5 [13:47:17] !log manybubbles Synchronized php-1.25wmf4/extensions/CirrusSearch/: (no message) (duration: 00m 11s) [13:47:24] Logged the message, Master [13:48:15] !log manybubbles Synchronized php-1.25wmf5/extensions/CirrusSearch/: (no message) (duration: 00m 05s) [13:48:20] Logged the message, Master [13:53:36] (03PS1) 10Ottomata: Add halfak to research group [puppet] - 10https://gerrit.wikimedia.org/r/169365 [13:53:50] \o/ [13:54:44] yay [13:55:06] (03CR) 10Ottomata: [C: 032] Add halfak to research group [puppet] - 10https://gerrit.wikimedia.org/r/169365 (owner: 10Ottomata) [13:58:53] (03CR) 10QChris: Include research mysql user and password in file on stat1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [13:59:36] SWAAT isn't due for a while, isn't it? [14:01:37] <^d> YuviPanda: Another hr. [14:01:45] ah, cool [14:01:51] * YuviPanda goes to prepare wikitech patches [14:04:24] !log demon Synchronized wmf-config/PrivateSettings.php: (no message) (duration: 00m 04s) [14:04:29] Logged the message, Master [14:04:49] !log reload swift frontend in eqiad after password rotation [14:04:54] Logged the message, Master [14:08:02] <^d> manybubbles, ottomata: Do we want to try adding 20-31 today? [14:08:12] yeahhahhHHH [14:10:51] (03CR) 10Faidon Liambotis: [C: 04-1] import debian/ directory (031 comment) [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 (owner: 10Filippo Giunchedi) [14:11:13] godog: yes let's deploy https://gerrit.wikimedia.org/r/#/c/167310/ [14:11:22] ^d, i can start shortly... [14:11:31] i have meetings today, but i can mostly put them in the background [14:11:47] <^d> Only meeting I have is our meeting :) [14:11:52] godog: doesn't it need a respective varnish change though? [14:12:09] (03PS2) 10Faidon Liambotis: Remove temp zone rewrite logic since that zone should be private [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [14:12:22] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove temp zone rewrite logic since that zone should be private [puppet] - 10https://gerrit.wikimedia.org/r/167310 (owner: 10Aaron Schulz) [14:12:28] paravoid: no I don't think so, rewrite.py should be enough [14:15:10] (03PS2) 10Hashar: jenkins: use openjdk-7-jre-headless [puppet] - 10https://gerrit.wikimedia.org/r/153764 [14:15:29] <^d> manybubbles: welcome back. any objections? :) [14:15:42] ^d: huh? [14:16:03] <^d> Oh, must've missed original message. :) [14:16:08] <^d> manybubbles, ottomata: Do we want to try adding 20-31 today? [14:16:10] <^d> 20-31? [14:16:11] <^d> Yeah [14:17:05] yeah - lets do it! [14:17:26] ottomata: what is the best way to get it ready to verify? [14:17:38] (03CR) 10Hashar: "I finally took time to look at the impact caused by switching the default java version with Debian alternatives system. Seems there is no" [puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [14:18:07] (03CR) 10Andrew Bogott: [C: 032] Allow wgMainCacheType setting to be overridden [puppet] - 10https://gerrit.wikimedia.org/r/169288 (https://bugzilla.wikimedia.org/72600) (owner: 10Spage) [14:18:10] manybubbles: the OS is installed on them, and they have base puppet stuff applied [14:18:55] elasticsearch-roots is on them [14:18:57] you should be able to log in [14:19:02] if you want to check out settings first [14:19:13] i turned on hyperthreading on them all [14:19:23] but, i haven't checked after it sup if it is on (not sure how to...) [14:19:26] after reboot [14:21:02] ottomata: top claims 32 cpus - probably working [14:21:13] <^d> I checked HT and noatime yesterday [14:21:20] <^d> And was able to login on all new boxes [14:21:24] <^d> So I think they're good to go. [14:21:34] ottomata: so you wanna just set one of them up in puppet and see if puppet does all the right salt stuff? [14:22:03] Zuul seems to be stuck. Who knows how to unstick it? [14:23:18] (03PS4) 10John F. Lewis: Modify $wmgAddWikiNotify for use by notifyNewProjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 [14:23:22] (03PS5) 10John F. Lewis: Modify $wmgAddWikiNotify for use by notifyNewProjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 [14:23:31] i think puppet should, i don't trust it to auto deploy the plugins, we'll have to double cehck that [14:23:35] but if you are ready, I am ready! [14:23:39] shall I do 1020? [14:24:05] ottomata: ready. do 1020 [14:24:29] oh, btw, these have all been put in row D [14:24:32] two different racks [14:24:34] but all row D [14:24:37] ottomata: k [14:24:43] fine by me [14:24:45] ok [14:25:20] ^d: can you review https://gerrit.wikimedia.org/r/#/q/169355,n,z ? [14:25:25] I want to SWAT it today [14:25:31] (03PS1) 10Ottomata: Include elasticsearch role on new elastic1020 [puppet] - 10https://gerrit.wikimedia.org/r/169482 [14:26:13] (03CR) 10Manybubbles: [C: 031] Include elasticsearch role on new elastic1020 [puppet] - 10https://gerrit.wikimedia.org/r/169482 (owner: 10Ottomata) [14:26:44] (03CR) 10Chad: [C: 031] "swat away" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169355 (owner: 10Manybubbles) [14:26:52] ^d: thanks! [14:26:55] <^d> yw [14:27:27] cmon jenkins [14:27:32] * anomie finds https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues and tries that. And, disturbingly, it seems to have worked. [14:28:15] (03CR) 10Ottomata: [C: 032] Include elasticsearch role on new elastic1020 [puppet] - 10https://gerrit.wikimedia.org/r/169482 (owner: 10Ottomata) [14:28:51] (03CR) 10Jgreen: [C: 032 V: 031] Made the beta clusters use deployment-mx for outgoing mail delivery [puppet] - 10https://gerrit.wikimedia.org/r/169194 (owner: 1001tonythomas) [14:30:01] OO manybubbles, need to add nodes to the rack topology in the role... [14:30:47] (03CR) 10Stryn: [C: 031] Remove hardcoding from notifyNewProjects [puppet] - 10https://gerrit.wikimedia.org/r/168702 (owner: 10John F. Lewis) [14:31:00] PROBLEM - puppet last run on elastic1020 is CRITICAL: CRITICAL: puppet fail [14:31:08] ottomata: we aren't really using that to much but yeah - sorry. [14:31:16] I should have caught that [14:31:19] np, i forgot too [14:31:24] puppet needs it or it errors [14:31:26] chasemp: 503 on prod phabricator while trying to log in. generic varnish page... [14:31:33] Request: POST http://phabricator.wikimedia.org/auth/login/ldap:self/, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 1581136779 [14:35:32] (03PS1) 10Ottomata: Add rack information for new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/169490 [14:36:55] (03PS2) 10Ottomata: Add rack information for new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/169490 [14:37:12] (03CR) 10Hashar: [C: 04-1] contint: Minor clean up (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/168629 (owner: 10Krinkle) [14:37:42] (03CR) 10Andrew Bogott: [C: 031] script to monitor, clean up salt keys of deleted labs instances [puppet] - 10https://gerrit.wikimedia.org/r/168601 (owner: 10ArielGlenn) [14:38:02] (03CR) 10Ottomata: [C: 032] Add rack information for new elasticsearch nodes [puppet] - 10https://gerrit.wikimedia.org/r/169490 (owner: 10Ottomata) [14:38:49] ottomata: puppet is silly. fail doesn't return a value. it fails! [14:39:31] aye! [14:39:33] puppet is running now [14:40:07] paravoid: thanks for merging the rewrite.py patch, I'll reload the proxies [14:40:20] RECOVERY - puppet last run on elastic1020 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:40:21] sorry, I context-switched :) [14:40:23] (03CR) 10Hashar: [C: 04-1] "The idea with the qunit_localhost class is to have all definition grouped together to avoid cluttering the already messed up manifests/rol" [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [14:40:34] but didn't forget [14:41:09] haha that's fine, I have everything already from earlier restart to rotate the search password [14:41:22] ah great :) [14:41:26] cool, manybubbles! it looked like the initial git deploy worked! [14:41:30] without me deploying! [14:42:15] and before elasticsearch was started! [14:42:18] 41 Matching Service Entries Displayed [14:42:19] :(( [14:42:20] so....COOL, i think that's it then [14:42:28] no elasticsearch stop, deploy, start needed [14:42:35] can you double check that plugins are loaded there? [14:42:54] ottomata: checking [14:43:40] ottomata: looks good to me [14:44:44] * aude wonders if jenkins is alive for wikidata stuff [14:44:54] ottomata: ok - so add the others? [14:45:02] cool [14:45:07] yeah! manybubbles, can I do them all at once? [14:45:13] or is it better one at a time? [14:45:16] or a few at a time? [14:45:39] (03CR) 10Hashar: [C: 031 V: 032] contint: ruby2.0 on Trusty slaves [puppet] - 10https://gerrit.wikimedia.org/r/166046 (owner: 10Hashar) [14:46:01] ottomata: all once should be fine - elasticsearch won't start if the plugins aren't isntalled [14:46:06] k [14:46:18] (03PS2) 10Hashar: contint: switch Zuul conf to new repository [puppet] - 10https://gerrit.wikimedia.org/r/166012 [14:46:27] (03CR) 10Hashar: [V: 032] "Deployed manually." [puppet] - 10https://gerrit.wikimedia.org/r/166012 (owner: 10Hashar) [14:46:45] <^d> manybubbles: You know about _cat/plugins right? [14:47:11] <^d> Way easier than parsing out stuff from _nodes [14:48:54] (03PS1) 10Ottomata: Include elasticsearch role on all elastic* nodes [puppet] - 10https://gerrit.wikimedia.org/r/169497 [14:49:24] ^d: yeah - I just `find /srv/deploy/elasticsearch/plugins | grep jar | ls -lh` and eyeball it. that way I don't have to boot es to see if the plugins look right [14:49:29] its what I do when I deploy new plugin versions [14:49:42] <^d> ah true [14:50:11] (03CR) 10Ottomata: [C: 032] Include elasticsearch role on all elastic* nodes [puppet] - 10https://gerrit.wikimedia.org/r/169497 (owner: 10Ottomata) [14:51:19] !log rolling-restart of eqiad ms-fe* after https://gerrit.wikimedia.org/r/#/c/167310/ [14:51:24] Logged the message, Master [14:51:27] (03Abandoned) 10Hashar: Describe Math related packages in a class [puppet] - 10https://gerrit.wikimedia.org/r/115133 (https://bugzilla.wikimedia.org/61090) (owner: 10Hashar) [14:51:31] (03Abandoned) 10Hashar: Move math related packages to a puppet class [debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/115135 (https://bugzilla.wikimedia.org/61090) (owner: 10Hashar) [14:53:48] COOL [14:53:49] manybubbles: [14:53:59] ^d: hey! thanks for making ES send metrics to labs graphite [14:54:07] elasticsearch running on all new nodes [14:54:50] ottomata: no plugins on 1021 [14:55:12] ottomata: or 1022 [14:55:14] hm [14:55:21] (03Abandoned) 10Hashar: contint: reduce duplication with mediawiki::packages [puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [14:55:21] i thought you said es wouldn't start hten [14:55:31] it _shouldn't_ [14:55:36] I'm not sure what is up [14:55:41] i see plugins... [14:55:43] on 22 [14:56:06] and on 1021 [14:56:12] are they just not loaded? [14:56:34] they look loaded on 1021 [14:56:50] elastic1021 swift-repository 0.6 j [14:56:50] etc... [14:56:55] elastic1021 wikimedia-extra 0.0.1 j [14:56:56] ottomata: I'm an idiot. ignore [14:57:00] ha, k [14:57:01] :) [14:57:06] SOooOooOooooo! [14:57:07] cool! [14:57:09] <^d> YuviPanda: Yeah no problem. I need to take some time to compare what we've got in prod with ganglia. If they're close enough I'll probably just do the same in prod too. [14:57:10] what next? [14:57:18] (03Abandoned) 10Hashar: sanity test for refreshWikiversionsCDB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/105698 (owner: 10Hashar) [14:57:38] <^d> ottomata: Wait a few hours for allocation to finish pushing stuff around? [14:57:40] ^d: cool. [14:57:45] <^d> Maybe we could raise the throttle. [14:58:10] <^d> Oh, we should add them to pybal :) [15:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141028T1500). Please do the needful. [15:00:05] PROBLEM - puppet last run on ms-be3004 is CRITICAL: CRITICAL: puppet fail [15:00:06] ^d: we should ban a couple of the nodes so we can do the hard drive upgrade on them [15:00:11] I'll swat today [15:00:40] <^d> manybubbles: Yeah, was going to ask cmjohnson how many we wanted to do at a time and go ahead and ban and depool them [15:00:49] sounds good [15:01:13] andrewbogott: just a cherry pick, one for each branch. [15:01:13] aude: around for swat? [15:01:22] andrewbogott: a git fetch on mediawiki repo is taking forever for me atm... [15:01:34] * aude here [15:02:02] * aude assumes the train is in 3-4 hours? [15:02:14] but can do the submodule update now [15:02:18] (03CR) 10Manybubbles: [C: 032] Enable faster regex searching and fix pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169355 (owner: 10Manybubbles) [15:02:31] (03Merged) 10jenkins-bot: Enable faster regex searching and fix pool counter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169355 (owner: 10Manybubbles) [15:02:40] YuviPanda: hm, for me too it seems [15:03:02] andrewbogott: hmm, maybe we can ask manybubbles to make the bumps? [15:03:11] My fetch just finished, lemme try [15:03:22] andrewbogott: ok! [15:03:35] manybubbles: we're still trying to make submodule bumps for the next patch in SWAT [15:03:40] !log manybubbles Synchronized wmf-config/: SWAT cirrus config updates - (hopefully) faster regexes (duration: 00m 06s) [15:03:43] manybubbles: so feel free to go ahead with aude's in the meantime [15:03:44] ^d: hey could you unstuck zuul? [15:03:45] Logged the message, Master [15:04:13] do we not have selenium tests for search? [15:04:18] YuviPanda: you need me to merge the changes to the release branch of openstackmanager? [15:04:25] or maybe not ones that run on wikidata beta [15:04:43] aude: tons in cirrus but they aren't all running [15:04:45] manybubbles: yup, I've no merge rights in wmf branches [15:04:45] very few [15:04:47] oh [15:04:58] we broke search on test.wikidata because of silly typo [15:05:20] aude: ah. we can see about getting the selenium tests running there. [15:05:22] need to see about having some kind of smoke test... phpunit or somewhere [15:05:25] ok :) [15:08:40] why you no merge after +2? [15:08:57] i think zuul is stuck [15:09:14] i had to +2 and hit submit [15:10:47] manybubbles: yeah, zuul seems stuck for a while [15:11:08] and hashar's been prodded via email so :) [15:11:17] !log manybubbles Synchronized php-1.25wmf5/extensions/Wikidata/: SWAT update wikidata (duration: 00m 10s) [15:11:19] aude: ^^^^^ [15:11:23] I just submitted it myself [15:11:23] Logged the message, Master [15:11:29] big hammer [15:11:31] search works again! [15:11:46] aude: great! consider yourself SWATed [15:11:55] thanks [15:12:16] and the js / our widget still works [15:14:05] YuviPanda: did you ever get logged in to phab? / was it problems registering or logging in? [15:14:17] * AndyRussG waves at wikimedia-operations [15:14:23] chasemp: registering, I'm trying to register now [15:14:39] ldap or SUL? [15:16:27] chasemp: LDAP [15:16:41] Anyone: quick pointer to config of Varnish caches' handling of requests to Special:BannerRandom? [15:17:16] Should I restart zuul? [15:17:47] YuviPanda: screen shot for me if you would the process and failure. I can login via ldap now etc, but maybe there is some reg issue [15:18:23] manybubbles: can you just do the submodule bump? My internet has been trying to murder me today [15:18:23] RECOVERY - puppet last run on ms-be3004 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [15:18:30] two sources, both terrible today [15:18:38] sorry [15:18:40] YuviPanda: k [15:19:08] manybubbles: is zull still stuck? I can restart it. It'll break CI tests for pending patches though [15:19:15] s/zull/zuul/ [15:19:18] still stuck afaik [15:19:21] I think so [15:19:49] patches since like an hour not merged [15:19:50] i think zuul is unstuck now [15:20:00] ... [15:22:38] <^d> manybubbles: Do we want to raise allocation rates? Its going to take hours otherwise. [15:22:51] ^d: we can do that [15:23:40] (03PS1) 10Manybubbles: Raise queue capacity for regex searches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169504 [15:23:52] (03CR) 10Manybubbles: [C: 032] Raise queue capacity for regex searches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169504 (owner: 10Manybubbles) [15:23:59] (03CR) 10Chad: [C: 031] Raise queue capacity for regex searches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169504 (owner: 10Manybubbles) [15:24:01] (03Merged) 10jenkins-bot: Raise queue capacity for regex searches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169504 (owner: 10Manybubbles) [15:24:47] !log manybubbles Synchronized wmf-config/: SWAT cirrus regex queues too small? (duration: 00m 05s) [15:24:56] Logged the message, Master [15:26:53] YuviPanda, manybubbles, I'm lost re: what's happening with our swats. Still need submodule patches? Blocked by zuul? Everything going fine and I should just wait? [15:27:09] and jenkins is alive! [15:27:11] andrewbogott: yeah, manybubbles is making submodule patches [15:27:13] zuul* [15:27:13] andrewbogott: I'm building the submodule updates now. lots of downloading to do [15:28:29] ok, thanks [15:28:43] manybubbles: let me know if those docs work for you, and I'll presume something's broken with my checkout [15:30:55] <^d> manybubbles: Raised cluster.routing.allocation.cluster_concurrent_rebalance from 2 to 8. [15:31:09] ^d: k. we'll just have to watch it s abit [15:31:14] andrewbogott: they worked for me. [15:31:19] Reedy: greg-g https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=132584&oldid=132580 [15:31:29] i really don't care much which time [15:31:59] aude: we picked that time for you :) not sure why it wasn't done, talked with Reedy a few times about it [15:32:03] <^d> manybubbles: Well as long as node_concurrent_recoveries is still throttled at 3 I think we're ok. [15:32:03] manybubbles: ok, I'll wipe out my tree and start over, see if things make more sense [15:32:09] probably confusing [15:32:18] especially with daylight savings time etc [15:32:22] aude: it's just a change of time [15:32:34] aude: yeah. they are supposed to eb pegged to sf time [15:32:36] maybe [15:32:50] we can try early next week [15:32:55] ok - waiting on zuul again for the submodule updates for the SWAT [15:33:59] hmpf, my trackpad broke now... [15:34:46] <^d> ottomata: We still didn't pool them in pybal. [15:35:22] oh! [15:35:26] oook [15:35:27] shall I do that now? [15:35:29] ^d? [15:35:40] <^d> Sure, I think so [15:37:31] (03PS1) 10Manybubbles: Lower Cirrus regex timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169508 [15:38:42] hmm, think done? [15:40:10] Ugh. Sorry aude/greg-g [15:40:23] no big deal [15:41:55] <^d> ottomata: lgtm [15:44:43] ok, back with a stabler internet connection [15:44:45] did the OSM swat finish? [15:44:47] * YuviPanda apologizes to everyone for the spotty internet earlier [15:45:02] ah, looks like not [15:45:57] !log manybubbles Synchronized php-1.25wmf5/extensions/OpenStackManager/: SWAT update openstackmanager (duration: 00m 04s) [15:46:02] Logged the message, Master [15:46:07] YuviPanda: ^^^ [15:46:24] manybubbles: ah, cool. andrewbogott needs to run sync as well. regular sync doesn't touch wikitech... [15:46:39] ok, sync now? [15:47:08] andrewbogott: yeah, wmf4 got merged as well [15:47:44] !log manybubbles Synchronized php-1.25wmf4/extensions/OpenStackManager/: SWAT update openstackmanager (duration: 00m 04s) [15:47:46] YuviPanda: ^^^^ [15:47:47] !log running sync-common on virt1000 [15:47:49] Logged the message, Master [15:47:56] Logged the message, Master [15:47:58] prod is all synced up [15:48:45] cool, I'll test once andrewbogott's sync finishes [15:51:39] (03CR) 10Incola: [C: 031] Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [15:52:37] !log restarting gmond on elasticsearch nodes so the new elasticsearch nodes pick up that elasticsearch is up and running.... [15:52:48] andrewbogott: did sync finish? [15:52:52] YuviPanda: Yes, just now. [15:52:56] And the main page still comes up :) [15:52:56] andrewbogott: cool [15:53:02] yeay! [15:53:02] Want to test a bit? I have a meeting in a few minutes. [15:53:42] andrewbogott: yeah, just tested. works fine! [15:53:48] cool [15:53:50] manybubbles: consider ourselves swatted, and thanks! [15:53:59] YuviPanda: wonderful1 [15:55:14] thanks manybubbles [15:55:34] ottomata: ganglia doesn't seem to be working on new elastic nodes - not sending anything anywhere. is there something that must be done? [15:55:36] andrewbogott: np! [15:56:29] ^d and ottomata: we have a long way to go! https://gist.github.com/nik9000/f37a3d15bdb2a0eff3e8 [15:56:59] <^d> yeahhhh [15:57:32] if we ban a few of the nodes then the new nodes will suck up some of the slack. ofcourse so will the old nodes [15:57:50] manybubbles: i see ganglia data on 1020 [15:57:55] ottomata: so that is 12 of the 15 new nodes, right? we'll add the others once we've got 17 and 18 out of the pickture. [15:58:18] ah yes, that is the plan [15:58:42] ottomata: nothing in the metrics though: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Elasticsearch+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=2&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [15:59:01] should we just ban 17 and 18 now then and let them clear off so we can do the next three? [15:59:08] manybubbles: it is there [15:59:11] it is just low [15:59:17] oh [15:59:18] ... [15:59:26] <^d> And then just picked up some I think [15:59:57] i was just looking at graphs that had es_ metrics... [16:00:01] ? [16:00:09] ottomata: like this one: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Elasticsearch+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=es_heap_committed&sh=2&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [16:00:42] (03PS3) 10Filippo Giunchedi: import debian/ directory [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 [16:00:44] lke this [16:00:45] http://ganglia.wikimedia.org/latest/?c=Elasticsearch%20cluster%20eqiad&h=elastic1029.eqiad.wmnet&m=cpu_report&r=day&s=by%20name&hc=4&mc=2 [16:00:48] <^d> Day is boring [16:00:50] <^d> Too much nothing. [16:00:57] <^d> Set to hour or 2. [16:01:03] yeah [16:01:08] yeah that's why [16:01:15] oh man, I'm on day! [16:01:17] who does that? [16:01:39] thanks [16:01:39] (03CR) 10Filippo Giunchedi: import debian/ directory (031 comment) [debs/python-diamond] - 10https://gerrit.wikimedia.org/r/168599 (owner: 10Filippo Giunchedi) [16:01:46] making some lunch... [16:01:52] andrewbogott: whelp, that patch messed up something else, aka editing on non-hiera pages... [16:02:01] so wikitech editing is broken now [16:02:07] https://gerrit.wikimedia.org/r/#/c/169512/ needs to be deployed to fix that [16:03:04] YuviPanda: ok... [16:03:30] I will just merge that now [16:03:39] I should have caught that :/ [16:03:41] andrewbogott: yeah, it's merged. just needs to be deployed [16:03:44] legoktm: me too :| [16:03:55] <_joe_> andrewbogott: so can you deploy that? [16:04:05] yep [16:04:07] <_joe_> sync-common should do the trick I guess [16:04:53] <^d> manybubbles: I'm going to take the training wheels off and double concurrent from 8 -> 16 [16:05:14] ^d: just watch while you do it! [16:05:23] <^d> I'm not going anywhere :) [16:05:24] _joe_: it isn't fetched on tin yet is it? [16:05:44] <_joe_> andrewbogott: "merged" form me means "merged on tin" :) [16:05:50] oh, ok then [16:05:53] <_joe_> but I'm probably wrong [16:06:01] !log sync-common on virt1000 again [16:06:30] YuviPanda: ? [16:06:33] PROBLEM - puppet last run on db2002 is CRITICAL: CRITICAL: puppet fail [16:06:34] <_joe_> andrewbogott: I am probably wrong, yuvi doesn't have permissions for that I guess [16:06:43] yeah, I don't have permissions [16:06:49] YuviPanda: will you make a submodule commit too please? [16:06:53] andrewbogott: yeah, doing [16:08:04] andrewbogott: https://gerrit.wikimedia.org/r/169514 is branch cherry pick [16:08:06] andrewbogott: doing submodule [16:08:38] <^d> manybubbles: wheee https://phabricator.wikimedia.org/P45 [16:09:17] (PS1) Yuvipanda: Submodule bump for OSM [core] (wmf/1.25wmf4) - https://gerrit.wikimedia.org/r/169515 [16:09:20] andrewbogott: ^ submodule bump [16:10:07] andrewbogott: you should be able to sync when these merge, I'll prepare wmf5 now. wikitech is on wmf4 [16:12:28] andrewbogott: hmm, jenkins isn't letting the submodule bump merge... [16:12:34] yeah [16:14:11] YuviPanda: I suspect you didn't rebase with manybubbles's patch first [16:14:52] andrewbogott: yeah, I figured [16:15:18] <^d> manybubbles: Heh, old servers don't want big shards. [16:15:25] ? [16:15:30] <^d> They're giving all their en/fr/de/commons shards to 20-31 :p [16:15:45] ^d: thats somewhat random and somewhat observer bias [16:16:02] <^d> yep. [16:16:08] becuase those take the longest to transfer your more likely to see them [16:16:17] <^d> but it was funny, hence the "heh" :) [16:21:06] andrewbogott: https://gerrit.wikimedia.org/r/169518 and https://gerrit.wikimedia.org/r/169520 for wmf5, not urgent but should merge after current sync [16:21:23] Still waiting for Jenkins to merge the one on wmf4 [16:21:44] andrewbogott: oh, hmm. [16:22:16] andrewbogott: hmm, it failed on something called vendor integration.. [16:23:13] andrewbogott: it -1d it and then merged it? [16:23:27] looks like [16:23:52] GO DOUSE YOURSELF IN SUGAR AND DIVE INTO AN ANT HILL, JENKINS [16:24:19] YuviPanda: having fun? [16:24:23] PROBLEM - ElasticSearch health check on logstash1002 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.137 [16:24:25] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [16:24:34] YuviPanda: I don't think it actually merged, I don't see it... [16:24:47] manybubbles: yeah, with https://gerrit.wikimedia.org/r/#/c/169515/ [16:24:58] <^d> YuviPanda: quipped. [16:25:16] YuviPanda: did I do the submodule update wrong? [16:25:35] RECOVERY - puppet last run on db2002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:25:36] manybubbles: no, that swat broke wikitech editing on non-hiera namespaces, this is the fix... [16:28:22] andrewbogott: hmm, I see it on my machine after fetch.. [16:29:45] In php-1.25wmf4 I did a fetch, rebase wmf/1.25wmf4, submodule update [16:31:00] andrewbogott: ah, hmm. right, the submdoule bump itself seems off... [16:31:05] * YuviPanda tries.. [16:32:05] andrewbogott: I see the submodule update in mediawiki/core when I do afetch... [16:32:40] Well, I defer to manybubbles, I may be doing something wrong [16:35:38] andrewbogott: looks like manybubbles left... [16:35:45] Reedy: around? [16:35:51] Yeah [16:35:52] or has poor connection [16:36:55] Reedy: wikitech editing is broken, we're trying to push out https://gerrit.wikimedia.org/r/#/c/169515/ and andrewbogott is having some troubles... [16:36:55] help? [16:37:14] Where is it stuck with? [16:37:52] In php-1.25wmf4 I did a fetch, rebase wmf/1.25wmf4, submodule update [16:38:07] YuviPanda: I don't think it actually merged, I don't see it... [16:38:40] There was Niks change [16:38:43] but yours wasn't pulled in [16:39:07] YuviPanda: I just tried again and it worked... [16:39:15] aaf37fdb6f69e2c6f85c56e7357a84183766d9a0 / Ib12b2fba55fc6af3eb2982e38bab4729113cb6e0 [16:39:17] I just pulled [16:39:21] did 'git rebase' rather than 'git rebase origin/wmf/etc/' [16:39:26] well... [16:39:29] * YuviPanda is somewhat confused. [16:39:30] what the heck? [16:39:30] you can just git pull [16:39:37] it rebases by default (Chad set this a while ago) [16:40:29] !log andrew Synchronized php-1.25wmf4/extensions/OpenStackManager/OpenStackManagerHooks.php: (no message) (duration: 00m 01s) [16:40:53] ok, but that doesn't explain why an explicit rebase didn't work... [16:41:14] Anyway, now I want to sync-file php-1.25wmf4/extensions/OpenStackManager/OpenStackManagerHooks.php, yes? [16:41:48] bd808, anything change with logstash y'day? both ocg (/cc cscott ) and parsoid's events seem to be lost since ~22 hours y'day.. last event for parsoid is 2014-10-27T18:50:45.977Z .. similar for ocg. [16:42:06] !log andrew Synchronized php-1.25wmf4/extensions/OpenStackManager/OpenStackManagerHooks.php: (no message) (duration: 00m 01s) [16:42:08] (03CR) 10CSteipp: [C: 031] "I'm ok with this approach." [puppet] - 10https://gerrit.wikimedia.org/r/165779 (owner: 10Ori.livneh) [16:42:28] Reedy: My can't get sync-file to work, do you mind syncing that file as well? [16:42:33] (And telling me how you did it?) [16:43:04] subbu: We had some issues with the logstash collector on logstash1001. There may be problems on logstash1002 as well that weren't noticed. [16:43:13] subbu: I'll look around a bit. [16:43:13] parsoid gets to logstash1003 [16:43:15] andrewbogott: sorry? [16:43:23] the bot logged it [16:43:28] [16:40:29] !log andrew Synchronized php-1.25wmf4/extensions/OpenStackManager/OpenStackManagerHooks.php: (no message) (duration: 00m 01s) [16:43:45] Reedy: Yeah, but I got a whole bunch of Permission denied (publickey) [16:43:46] bd808, not sure if ocg goes to logstash1002 or 1003. [16:43:49] even though my key is forwarded. [16:44:15] * aude back in 1 hour or so for deploy stuff [16:44:19] subbu: I suppose we should figure that out (and document it somewhere) [16:44:25] !log reedy Synchronized php-1.25wmf4/extensions/OpenStackManager: (no message) (duration: 00m 14s) [16:44:32] Logged the message, Master [16:44:46] cscott, ^^^ reg. ocg. [16:45:08] ocg goes to logstash1002 [16:45:16] i thought parsoid went to logstash1003 [16:45:23] yes, parsoid goes to 1003. [16:45:29] lol [16:45:40] so there are some issues with all the logstash collectors :) [16:45:51] _joe_: andrewbogott I seem to be able to edit on wikitech... [16:45:54] Reedy: ? [16:46:05] andrewbogott: sync-dir WFM [16:46:21] ok… I'll try again with the other branch [16:46:36] subbu, cscott: The elasticsearch cluster behind logstash is still sad too. 1 unassigned and 3 recovering shards [16:47:17] last log stashed by ocg was 2014-10-27T18:50:46.104Z and parsoid's last log was 2014-10-27T18:50:45.977Z [16:47:59] at 18:34 manybubbles logged, "18:34 manybubbles: restarting elasticsearch servers to pick up new gc logging and to reset them into a "working" state so they can have their gc problem again and we can log it properly this time." [16:48:06] 18:37 manybubbles: note that this is a restart without waiting for the cluster to go green after each restart. I expect lots of whining from icinga. This will cause us to lose some updates but should otherwise be safe. [16:48:15] 18:57 manybubbles: completed restarting elasticsearch cluster. now it'll make a useful file on out of memory errors. raised the recovery throttling so it'll recover fast enough to cause oom errors [16:48:26] cscott: That is a different ES cluster (or should be) [16:48:30] <^d> That's all separate. [16:48:38] <^d> Although you guys got our improved heap logging. [16:48:39] *should be*. but it is suspiciously coincident in time. [16:49:04] <^d> ES_JAVA_OPTS="-XX:HeapDumpPath=/var/lib/elasticsearch " [16:49:04] <^d> ES_JAVA_OPTS="${ES_JAVA_OPTS} -XX:GCTimeLimit=70" [16:49:04] <^d> ES_JAVA_OPTS="${ES_JAVA_OPTS} -XX:GCHeapFreeLimit=10" [16:49:10] the only other server admin log is 18:47 logmsgbot: maxsem Synchronized php-1.25wmf4/extensions/GeoData: live hack to disable geosearch (duration: 00m 04s) [16:49:36] <^d> Maybe those GC settings are screwing you? [16:49:44] ^d: Did you guys move logging from /var/log/elasticsearch to somewhere else? I don't have any new logs on logstash1003 since September 30th [16:50:02] <^d> No, normal logs are still /var/log/elasticsearch [16:50:08] <^d> Only the heap log is /var/lib/elasticsearch [16:50:37] maybe /var/lib/elasticsearch isn't writable on logstash100x? [16:51:53] there are new files in /var/lib/elasticsearch/production-logstash-eqiad/nodes/0/indices which is good [16:52:02] but no logs which is concerning [16:55:10] The index recovery seems to be progressing so I don't want to randomly restart the nodes. Something is not right though. [16:55:21] !log andrew Synchronized php-1.25wmf5/extensions/OpenStackManager: (no message) (duration: 00m 02s) [16:55:27] Logged the message, Master [16:55:43] Reedy: I just ran sync-dir php-1.25wmf5/extensions/OpenStackManager and got the same whirlwind of refused keys. [16:55:48] bd808: you mean something not right other than the fact that logstash isn't actually stashing logs? ;) [16:55:51] I can check the keys myself, but -- that's the right command, yes? [16:55:56] Yup [16:56:52] !log No new logs in /var/log/elasticsearch for logstash100[123] since Sep 30 06:25 [16:56:58] Logged the message, Master [16:57:21] !log andrew Synchronized php-1.25wmf5/extensions/OpenStackManager: (no message) (duration: 00m 03s) [16:57:25] Logged the message, Master [16:57:32] Reedy: ok, my mistake, forwarded the wrong key [16:57:35] !log no logs for ocg/parsoid on logstash since 2014-10-27T18:50:46.104Z/2014-10-27T18:50:45.977Z (respectively) [16:57:40] Logged the message, Master [16:58:15] !log disk utilization on logstash100[123] greater than 80% [16:58:20] Logged the message, Master [16:59:55] !log restarted logstash on logstash1002 to try and get gelf input events into kibana again [17:00:01] Logged the message, Master [17:02:20] cscott: Look better now? [17:03:02] i see logs! [17:03:13] probably you need to restart logstash1003 to get parsoid's logs back [17:03:46] I see a ton of parsoid logs [17:03:50] andrewbogott: I'm going to go eat food, should we write up an incident report? [17:04:11] bd808: does parsoid have a dashboard link? [17:05:48] subbu: parsoid logs seem to have resumed [17:06:08] @ 2014-10-28 16:20 [17:06:18] i assume that's utc [17:06:39] (03PS2) 10Ottomata: Add centralauth parameters to wikimetrics role [puppet] - 10https://gerrit.wikimedia.org/r/169018 (owner: 10Mforns) [17:07:13] * YuviPanda goes to eat, will brb [17:07:35] (03PS2) 10Ottomata: Require 2 ACKs from kafka brokers per default [puppet] - 10https://gerrit.wikimedia.org/r/167553 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [17:08:10] ottomata: can we wait a bit with that change? [17:08:20] sure [17:08:30] There were some alerts over the night, and I want to make sure it's not related. [17:08:35] (03CR) 10Ottomata: [C: 032] Add centralauth parameters to wikimetrics role [puppet] - 10https://gerrit.wikimedia.org/r/169018 (owner: 10Mforns) [17:08:39] Cool. Thanks. [17:09:29] apergos: yt? [17:09:45] (03CR) 10Krinkle: "The problem is that "/srv/localhost" must not be inside the qunit-specific localhost, because then we have to illogically duplicate "/srv/" [puppet] - 10https://gerrit.wikimedia.org/r/168630 (owner: 10Krinkle) [17:09:48] cscott, ok, bd808 thanks. [17:17:27] YuviPanda: because edits were broken for half an hour? Hm… maybe. [17:23:46] (03PS6) 10Ori.livneh: Modify $wmgAddWikiNotify for use by notifyNewProjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 (owner: 10John F. Lewis) [17:23:53] (03CR) 10Ori.livneh: [C: 032] Modify $wmgAddWikiNotify for use by notifyNewProjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 (owner: 10John F. Lewis) [17:24:17] ori :D [17:24:17] (03Merged) 10jenkins-bot: Modify $wmgAddWikiNotify for use by notifyNewProjects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 (owner: 10John F. Lewis) [17:27:22] !log ori Synchronized wmf-config/CommonSettings.php: I56082795: Modify $wmgAddWikiNotify for use by notifyNewProjects (duration: 00m 05s) [17:27:31] Logged the message, Master [17:30:20] ^d: joining now [17:30:42] <^d> Yeah me too just grabbing my laptop charger [17:31:08] bblack, mark or paravoid, any way I can bribe you guys into reviewing https://gerrit.wikimedia.org/r/#/c/167453/ ? :P [17:32:01] not GET and HEAD? [17:32:49] large quantities of quality coffee beans! [17:32:55] we want to redirect only normal page views, they're GET [17:33:27] you'd think you'd want HEAD too so that HEAD matches GET though, right? [17:33:41] hmmm [17:33:56] what do people actually use HEAD for these days? :P [17:34:09] browsers use it all the time, to check if they need to refresh cache with a full GET [17:34:13] and/or other caches [17:34:15] to do a GET without a body :P [17:35:08] really HEAD shouldn't exist, it shouldn't been GET with a header like "Send-Content: false" or something and things would've been simpler. [17:35:17] 8should've been [17:35:20] from my observations, they mostly do GET with If-Modified-Since [17:35:47] ok let's put it differently [17:35:51] okay, let's make them behave the same way [17:35:55] do you have a very good specific reason why HEAD shouldn't match GET here? [17:37:52] chris said on the bug report [17:37:55] "So Max'es patch will probably work, although then the login experience on mobile isn't great" [17:38:00] I'm not sure I understand why [17:38:25] but worth replying to before we merge? [17:38:47] maybe he has something mind I can't think of [17:38:48] mark++ :) [17:38:57] (03PS2) 10MaxSem: Perform mobile redirect only for GET and HEAD requests [puppet] - 10https://gerrit.wikimedia.org/r/167453 (https://bugzilla.wikimedia.org/72186) [17:41:04] whether it will fully fix the particular bug or not, it still shouldn't redirect POST. it mostly works because people post mostly to index.php which doesn't redirect, but we'd better fix the corner cases when POST goes to pretty URLs [17:43:16] yeah true [17:43:47] I don't even think a 302 to a POST is valid HTTP [17:44:04] (03CR) 10Mark Bergsma: [C: 031] Perform mobile redirect only for GET and HEAD requests [puppet] - 10https://gerrit.wikimedia.org/r/167453 (https://bugzilla.wikimedia.org/72186) (owner: 10MaxSem) [17:47:20] (03PS4) 10Ori.livneh: Remove hardcoding from notifyNewProjects [puppet] - 10https://gerrit.wikimedia.org/r/168702 (owner: 10John F. Lewis) [17:47:29] (03CR) 10Ori.livneh: [C: 032 V: 032] Remove hardcoding from notifyNewProjects [puppet] - 10https://gerrit.wikimedia.org/r/168702 (owner: 10John F. Lewis) [17:50:18] (03PS3) 10Faidon Liambotis: Perform mobile redirect only for GET and HEAD requests [puppet] - 10https://gerrit.wikimedia.org/r/167453 (https://bugzilla.wikimedia.org/72186) (owner: 10MaxSem) [17:50:22] (03PS1) 10Ori.livneh: update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/169537 [17:50:25] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Perform mobile redirect only for GET and HEAD requests [puppet] - 10https://gerrit.wikimedia.org/r/167453 (https://bugzilla.wikimedia.org/72186) (owner: 10MaxSem) [17:50:41] (03PS2) 10Faidon Liambotis: geoip: fetch all MaxMind products that we pay for [puppet] - 10https://gerrit.wikimedia.org/r/169320 [17:51:06] (03CR) 10Faidon Liambotis: [C: 032 V: 032] geoip: fetch all MaxMind products that we pay for [puppet] - 10https://gerrit.wikimedia.org/r/169320 (owner: 10Faidon Liambotis) [17:51:14] (03CR) 10Ori.livneh: geoip: fetch all MaxMind products that we pay for (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169320 (owner: 10Faidon Liambotis) [17:51:21] tsk tsk [17:51:28] too late :) [17:51:52] Is jenkins broken? [17:52:05] more than usual, you mean? [17:52:14] Just wondering why V+2'ing stuff [17:52:41] because they were rebases [17:52:57] and also because jenkins does nothing with varnish configs really [17:53:08] ah [17:53:16] and because it takes quite a while to run and I'm impatient? [17:53:36] I think if I sum up all the time I've spent waiting for jenkins to push "submit" it would account for hours [17:53:54] waiting would be ok if the value was clear [17:54:01] or if there was any point :) [17:54:06] when there is, I wait [17:54:51] (03CR) 10Ori.livneh: [C: 032] update my (=ori) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/169537 (owner: 10Ori.livneh) [17:55:03] thanks guys! [17:55:21] ori: puppetd() { sudo puppt-agent "${@}"; } [17:55:22] <_joe_> good morning paravoid [17:55:24] <_joe_> :) [17:55:25] you should probably kill that :) [17:55:36] why? [17:55:48] because it has a typo, so I doubt you use it [17:55:55] puppt [17:55:57] heheh. touche [17:56:00] <_joe_> eheh [17:56:30] repackage() { sudo dpkg-buildpackage -b -uc; } [17:56:40] dpkg-buildpackage doesn't need root and is in fact a bad idea to use root to build packages [17:57:00] it prompted me to sudo [17:57:02] anyone seen something like man2json ? [17:57:12] because i didn't have a fakeroot or whatever [17:57:14] yeah that [17:57:24] man2html feels too 2000 [17:57:33] you should point out the cool things in my dotfiles, damn it :P [17:57:48] HOSTCOLOR="$(tput setaf $(($(cksum<<<$HOSTNAME|cut -d' ' -f-1)%6+1)))" [17:57:48] hah [17:57:50] export PS1='\[$BRIGHT\]\[$BLACK\][\[$HOSTCOLOR\]${HOSTNAME}\[$GREY\]:\[$RESET\]\[$GREY\]\w\[$BRIGHT\ [17:57:50] ]\[$BLACK\]]\[$RESET\] $ ' [17:57:55] in fact I opened it so I can steal cool stuff [17:58:22] different prompt color for hosts, after a while you associate the colors with the hosts and it makes finding the right term session in multiple tabs easier [17:58:38] lol [17:58:48] <_joe_> :P [17:59:19] export PATH="${PATH}:${HOME}/.bin" [17:59:24] that's not a very great idea [17:59:37] in fact in DSA, where we use a sudo password [17:59:47] I always /usr/bin/sudo [17:59:48] DSA? [17:59:53] debian sysadmin team [18:00:05] Reedy, greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141028T1800). [18:00:20] someone compromised your account in a box, but not your key [18:01:07] it's appended to $PATH, not prepended [18:01:18] they'd have to remove /usr/bin/sudo too [18:01:22] hm true [18:01:34] yeah I guess that's fine [18:01:38] and recursively managed by puppet (of course) [18:01:47] oh it is? [18:01:53] ~/.binned too? [18:02:16] well, .bin is, and it gets rsync --delete into .binned on login [18:03:04] (03PS1) 10Reedy: Non wikipedias to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169541 [18:03:52] (because chasemp's admin module chmods the files to 0644, that's why .binned exists) [18:03:59] ah [18:04:38] (03CR) 10Reedy: [C: 032] Non wikipedias to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169541 (owner: 10Reedy) [18:04:49] (03Merged) 10jenkins-bot: Non wikipedias to 1.25wmf5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169541 (owner: 10Reedy) [18:05:24] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.25wmf5 [18:05:31] Logged the message, Master [18:05:40] paravoid: the pybal stuff is kinda neat too: https://github.com/wikimedia/operations-puppet/blob/production/modules/admin/files/home/ori/.hosts/palladium [18:06:07] pybal query mw1189, pybal depool mw1189, pybal repool mw1189, etc. [18:06:08] oh neat [18:07:50] <_joe_> ori: that I already stole [18:07:52] <_joe_> :) [18:08:18] (03PS5) 10Reedy: Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [18:08:36] (03CR) 10Reedy: [C: 032] Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [18:08:43] (03Merged) 10jenkins-bot: Set $wgMFAnonymousEditing = true for Italian Wikipedia in November-December 2014 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168915 (https://bugzilla.wikimedia.org/72541) (owner: 10Nemo bis) [18:09:10] reedy@tin:/srv/mediawiki-staging$ touch wmf-config/InitialiseSettings.php [18:09:10] touch: cannot touch `wmf-config/InitialiseSettings.php': Permission denied [18:09:12] * Reedy blinks [18:09:53] Can someone please on tin: chmod g+w /srv/mediawiki-staging/wmf-config/InitialiseSettings.php [18:10:10] Reedy: Who owns it? root:root? :P [18:10:15] -rw-r--r-- 1 springle wikidev 474247 Oct 28 03:57 InitialiseSettings.php [18:10:15] Reedy: done [18:10:31] ori: Thanks [18:11:08] !log reedy Synchronized wmf-config/: Set = true for Italian Wikipedia in November-December 2014 (duration: 00m 14s) [18:11:13] Logged the message, Master [18:11:27] what = true? :P [18:11:43] * Reedy escapes MaxSem [18:11:47] $bash_variable [18:11:58] "$bash_variable" [18:12:15] springle: umask 0002 on tin :P [18:12:27] \\\\\\"\$bash_variable\\\\\\\" [18:14:20] also, why are you deploying changes not supported by PM Reedy? o_0 [18:14:56] I deploy a lot of stuff not supported by PMs [18:16:56] (03PS4) 10Reedy: Switch from SpecialCite to CiteThisPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (https://bugzilla.wikimedia.org/71112) (owner: 10Jforrester) [18:17:11] Reedy: We doing it? :-) [18:17:27] James_F: Greg listed it on the changeset as today, it's on the calendar too [18:17:37] Reedy: Yay. Let's hope we don't break everything. [18:18:00] (03CR) 10Reedy: [C: 032] Switch from SpecialCite to CiteThisPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (https://bugzilla.wikimedia.org/71112) (owner: 10Jforrester) [18:18:04] See what scap has to say [18:18:08] (03Merged) 10jenkins-bot: Switch from SpecialCite to CiteThisPage on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158121 (https://bugzilla.wikimedia.org/71112) (owner: 10Jforrester) [18:18:09] * James_F nods. [18:18:27] We'll need to move a bunch of MW: namespace pages. [18:18:39] !log reedy Started scap: Split Cite extension, scap to build l10n cache for CiteThisPage [18:18:46] Logged the message, Master [18:20:30] James_F: Do we have any idea of scale of message customsations that need relocationg? [18:20:44] (03PS1) 10Ori.livneh: hhvm::packages: add graphviz and gv, for pprof --pdf/--svg &c. [puppet] - 10https://gerrit.wikimedia.org/r/169545 [18:20:46] I'm guessing it's not gonna be many [18:21:12] Reedy: It won't be huge, no. [18:21:17] * James_F had a list somewhere. [18:21:40] We should be able to moveBatch it [18:21:55] That'd be great. [18:22:12] I guess something like that is a good use of a staff account [18:22:24] Ha. [18:22:35] Do you want me to do it (can I?) or will you? [18:22:35] $user = $this->getOption( 'u', 'Move page script' ); [18:22:51] I can assign the blame to you if you want ;) [18:22:59] That probably makes sense. [18:23:03] Rather than you having to deal. [18:23:43] Looks like it's [18:23:45] from|to [18:24:02] * James_F nods. [18:24:59] l10n cache seems to be building fine [18:25:07] Good start. [18:28:38] back (sorry late) [18:28:44] PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.022 second response time [18:29:03] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.011 second response time [18:29:07] !log restarted elasticsearch node on logstash1003 [18:29:13] Logged the message, Master [18:29:43] Reedy: https://gerrit.wikimedia.org/r/#/c/169344/ [18:29:58] aude: just scapping for CiteThisPage :) [18:30:10] ok [18:30:33] PROBLEM - HHVM processes on mw1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [18:31:18] hhvm processes paging? [18:31:21] i do see js errors on wikidata [18:31:36] could be a gadget is broken :( [18:31:59] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 31, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 82, uinitializing_shards: 0, unumber_of_data_nodes: 3} [18:31:59] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 31, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 82, uinitializing_shards: 0, unumber_of_data_nodes: 3} [18:32:09] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 31 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 31, utimed_out: False, uactive_primary_shards: 41, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 82, uinitializing_shards: 0, unumber_of_data_nodes: 3} [18:32:11] uh oh [18:32:26] Krenair: ES is fine [18:32:36] logstash1003 ES was just restarted [18:32:48] ok [18:33:17] 1114 [18:33:19] grrrr [18:34:31] Reedy: https://integration.wikimedia.org/ci/view/Beta/job/beta-scap-eqiad/27486/console seems to be doing OK too. [18:34:39] Reedy: Which is a good sign. [18:34:49] Scap is [18:34:49] sync-common: 24% (ok: 55; fail: 0; left: 174) [18:35:01] * James_F nods. [18:36:18] (03CR) 10Krinkle: "Follows-up Id779cebe6750fa0856d6add6fb56c90f4ef3e514 in operations-puppet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168701 (owner: 10John F. Lewis) [18:36:25] (03CR) 10Dzahn: [C: 032] backups: minor lint [puppet] - 10https://gerrit.wikimedia.org/r/169333 (owner: 10Matanya) [18:36:42] thanks mutante [18:36:53] np,ty [18:39:30] hoo: so, wikidata has wmf5 now [18:39:39] cache epoch not bumped [18:39:45] why not? [18:39:48] and looks ok to me, so far [18:40:02] i'm a bit confused [18:40:08] Wasn't there that patch of Henning that needed it? [18:40:21] It'll be done when scap is done [18:40:23] I code reviewed it and it broke for me w/o purging [18:40:30] i thought so and i needed to purge stuff on test.wikidata [18:40:31] yeah [18:40:37] I hadn't started going through the backlog of mediawiki-config, due to stuff scheduled [18:40:42] suppose we canwait until after scap [18:40:55] just weird that stuff looks mostly fine [18:41:10] except js errors occasionally from authority control gadget [18:41:10] Stuff is working, but we dunno why :D [18:41:33] RESOLVED WORKSFORME [18:41:34] odd [18:41:45] :-) [18:41:45] js error does not appear with debug=true [18:42:02] Reedy: Everything now looks good in Beta Labs (except for moving MW messages). [18:42:03] bumping cache epoch is a bad thing to do if it can be avoided [18:42:09] sounds like race condition or the usual cache mess [18:42:24] "TypeError: toolbar is undefined" TypeError: toolbar is undefined [18:42:24] yeah [18:42:28] is that from the gadget? [18:42:38] i see somethign about editGroup [18:42:45] sync-common: 77% (ok: 177; fail: 0; left: 52) [18:42:52] Cannot read property 'editGroup' of undefined [18:42:54] Reedy: Yay for speed. [18:43:03] aude: I see the bug [18:43:05] which is a properyt of toolbar [18:43:08] same thing [18:43:11] https://www.wikidata.org/wiki/Q40904#sitelinks-wikisource [18:43:12] aude: ^ [18:43:27] there are way to much edit links :P [18:43:33] omg [18:43:37] * Reedy adds wikidatawiki to readonly.dblist [18:43:38] that's what i saw on test.wikidata [18:43:50] cache epoch will fix it [18:43:50] (03PS1) 10Manybubbles: Add three more master node to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/169550 [18:43:52] and exactly what I had locally when testing [18:43:53] yep [18:44:40] :-) [18:44:40] could be with special:random that i am hitting stuff never viewed since we bumped cache last time [18:44:48] (03CR) 10Manybubbles: "We should be careful when we merge this as it unlocks split brain territory. I vote we merge it right before installing the selected node" [puppet] - 10https://gerrit.wikimedia.org/r/169550 (owner: 10Manybubbles) [18:44:50] since we have soooo many items [18:44:58] :) [18:45:12] aude: Totally possible... especially as we cache per user lang. [18:45:12] actually, c'mon aude [18:45:15] https://www.wikidata.org/wiki/Q60 looks odd [18:45:17] you know Special:Random sucks [18:45:18] :P [18:45:30] heh [18:47:31] scap-rebuild-cdbs: 0% (ok: 0; fail: 0; left: 229) [18:47:33] nearly there [18:47:50] this is actually quite a slow scap [18:47:57] considering it didn't have any code to push out either [18:48:37] well, i can save references on https://www.wikidata.org/wiki/Q72 [18:48:38] some servers tend to be under an insane load [18:48:44] which was reported to be not possible before [18:48:58] aude: OOM? :) [18:49:01] I mean former OOM [18:49:05] probably [18:49:10] * aude tries Q183 now :) [18:49:10] \o/ [18:49:12] scap-rebuild-cdbs: 38% (ok: 88; fail: 0; left: 141) [18:49:21] aude: I dare... :D [18:49:26] + you [18:50:07] Allowed memory size of 314572800 bytes exhausted (tried to allocate 622406 bytes) [18:50:15] Wikibase/repo/includes/View/ClaimsView.php line 212: [18:50:25] (03PS2) 10Reedy: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169344 (owner: 10Aude) [18:50:36] what did you try exactly? View the ~1mb revision? [18:50:43] (03CR) 10Reedy: [C: 032] Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169344 (owner: 10Aude) [18:50:55] Does it actually have the correct timestamp [18:51:06] (03Merged) 10jenkins-bot: Bump cache epoch for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169344 (owner: 10Aude) [18:51:30] scap-rebuild-cdbs: 91% (ok: 209; fail: 0; left: 20) [18:51:33] ok, should be ok [18:52:32] !log reedy Finished scap: Split Cite extension, scap to build l10n cache for CiteThisPage (duration: 33m 52s) [18:52:39] Logged the message, Master [18:52:40] James_F: ^^ [18:52:45] Yay. [18:53:09] Reedy: Now we just need to move those messages. [18:53:19] !log reedy Synchronized wmf-config/: Bump cache epoch for Wikidata (duration: 00m 14s) [18:53:23] manybubbles: Do you want https://gerrit.wikimedia.org/r/#/c/169508/ deploying nowish? [18:53:25] Logged the message, Master [18:53:34] (03PS2) 10Reedy: Stop using lsearchd pool 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169294 (owner: 10Chad) [18:53:38] https://gist.github.com/filbertkm/8448ac0837c24b292fe8 works though and didn't before [18:53:38] (03CR) 10Reedy: [C: 032] Stop using lsearchd pool 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169294 (owner: 10Chad) [18:53:45] Reedy: Everything looks good from testing. [18:53:46] (03Merged) 10jenkins-bot: Stop using lsearchd pool 5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169294 (owner: 10Chad) [18:53:48] Reedy: if you want to sync stuff out be my guest [18:53:57] (03PS2) 10Reedy: Lower Cirrus regex timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169508 (owner: 10Manybubbles) [18:54:02] (03CR) 10Reedy: [C: 032] Lower Cirrus regex timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169508 (owner: 10Manybubbles) [18:54:05] aude: Hopefully the Lua Q30 stuff is gone as well now [18:54:10] (03Merged) 10jenkins-bot: Lower Cirrus regex timeouts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169508 (owner: 10Manybubbles) [18:54:11] i hope so! [18:54:15] and also the mess on fluorine it caused [18:54:39] ^d: we're making lots of logs [18:54:43] I wonder why udp2log goes so crazy about 85k entry traces :P [18:55:00] (03PS2) 10Reedy: Remove quota from the rest of the wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169258 (owner: 10Jackmcbarn) [18:55:08] (03CR) 10Reedy: [C: 032] Remove quota from the rest of the wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169258 (owner: 10Jackmcbarn) [18:55:18] (03Merged) 10jenkins-bot: Remove quota from the rest of the wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169258 (owner: 10Jackmcbarn) [18:56:00] (03PS3) 10Reedy: Raise account creation throttle for JNCF 2014 workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168817 (https://bugzilla.wikimedia.org/72518) (owner: 10Glaisher) [18:56:06] (03CR) 10Reedy: [C: 032] Raise account creation throttle for JNCF 2014 workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168817 (https://bugzilla.wikimedia.org/72518) (owner: 10Glaisher) [18:56:14] ^d: I have a fix for it [18:56:18] (03Merged) 10jenkins-bot: Raise account creation throttle for JNCF 2014 workshop [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168817 (https://bugzilla.wikimedia.org/72518) (owner: 10Glaisher) [18:56:38] (03PS2) 10Reedy: Add debug log group for CentralAuthUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168936 (owner: 10Legoktm) [18:56:47] (03CR) 10Reedy: [C: 032] Add debug log group for CentralAuthUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168936 (owner: 10Legoktm) [18:56:57] (03Merged) 10jenkins-bot: Add debug log group for CentralAuthUserMerge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168936 (owner: 10Legoktm) [18:57:13] (03PS2) 10Reedy: Only add the "oauthadmin" group on the central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 (owner: 10Hoo man) [18:57:17] (03CR) 10Reedy: [C: 032] Only add the "oauthadmin" group on the central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 (owner: 10Hoo man) [18:57:25] (03Merged) 10jenkins-bot: Only add the "oauthadmin" group on the central wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168922 (owner: 10Hoo man) [18:57:29] (03Abandoned) 10Faidon Liambotis: Add IPv6 GeoIP support to Varnish [puppet] - 10https://gerrit.wikimedia.org/r/30836 (owner: 10Faidon Liambotis) [18:58:12] (03PS2) 10Reedy: Allow all custom Meta-Wiki namespaces in Special:Book [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168699 (https://bugzilla.wikimedia.org/72493) (owner: 10Nemo bis) [18:58:16] (03CR) 10Reedy: [C: 032] Allow all custom Meta-Wiki namespaces in Special:Book [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168699 (https://bugzilla.wikimedia.org/72493) (owner: 10Nemo bis) [18:58:23] (03Merged) 10jenkins-bot: Allow all custom Meta-Wiki namespaces in Special:Book [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168699 (https://bugzilla.wikimedia.org/72493) (owner: 10Nemo bis) [18:58:36] manybubbles: https://gerrit.wikimedia.org/r/#/c/168424/ too? [18:58:51] Reedy: lets not [18:59:02] (03Abandoned) 10Manybubbles: Split cirrus's pool counters [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168424 (owner: 10Manybubbles) [18:59:17] thanks [18:59:19] (03PS3) 10Reedy: Add "templateeditor" user group to $wgRestrictionLevels on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168498 (https://bugzilla.wikimedia.org/72146) (owner: 10Calak) [18:59:23] (03CR) 10Reedy: [C: 032] Add "templateeditor" user group to $wgRestrictionLevels on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168498 (https://bugzilla.wikimedia.org/72146) (owner: 10Calak) [18:59:32] (03Merged) 10jenkins-bot: Add "templateeditor" user group to $wgRestrictionLevels on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168498 (https://bugzilla.wikimedia.org/72146) (owner: 10Calak) [18:59:39] Reedy: if you are looking for something that needs to be deployed it looks like https://gerrit.wikimedia.org/r/#/c/169554 is needed [18:59:43] (03PS3) 10Reedy: Add 'mergehistory' to transwiki group at itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [19:00:00] (03CR) 10Reedy: [C: 032] Add 'mergehistory' to transwiki group at itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [19:00:07] (03Merged) 10jenkins-bot: Add 'mergehistory' to transwiki group at itwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168322 (https://bugzilla.wikimedia.org/72422) (owner: 10Glaisher) [19:00:35] (03PS2) 10Reedy: Turning on $wgCopyUploadsFromSpecialUpload for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168908 (https://bugzilla.wikimedia.org/71897) (owner: 10Kaldari) [19:00:39] (03CR) 10Reedy: [C: 032] Turning on $wgCopyUploadsFromSpecialUpload for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168908 (https://bugzilla.wikimedia.org/71897) (owner: 10Kaldari) [19:00:48] (03Merged) 10jenkins-bot: Turning on $wgCopyUploadsFromSpecialUpload for Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168908 (https://bugzilla.wikimedia.org/71897) (owner: 10Kaldari) [19:01:04] (03PS2) 10Reedy: Create "Abuse filter editor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168725 (https://bugzilla.wikimedia.org/72502) (owner: 10Calak) [19:01:20] (03CR) 10Reedy: [C: 032] Create "Abuse filter editor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168725 (https://bugzilla.wikimedia.org/72502) (owner: 10Calak) [19:01:27] (03Merged) 10jenkins-bot: Create "Abuse filter editor" user group on fawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168725 (https://bugzilla.wikimedia.org/72502) (owner: 10Calak) [19:01:52] (03PS3) 10Reedy: Remove obsolete mobile configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169321 (owner: 10Florianschmidtwelzow) [19:01:56] (03CR) 10Reedy: [C: 032] Remove obsolete mobile configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169321 (owner: 10Florianschmidtwelzow) [19:02:05] (03Merged) 10jenkins-bot: Remove obsolete mobile configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169321 (owner: 10Florianschmidtwelzow) [19:03:19] (03PS2) 10Reedy: Set wgCategoryCollation to 'uca-fr' on frwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168748 (https://bugzilla.wikimedia.org/72513) (owner: 10Glaisher) [19:03:31] (03CR) 10Reedy: [C: 032] "Script will be run a bit later today" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168748 (https://bugzilla.wikimedia.org/72513) (owner: 10Glaisher) [19:03:38] (03Merged) 10jenkins-bot: Set wgCategoryCollation to 'uca-fr' on frwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/168748 (https://bugzilla.wikimedia.org/72513) (owner: 10Glaisher) [19:04:11] (03PS9) 10Reedy: minor changes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [19:04:30] (03CR) 10Reedy: [C: 032] minor changes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [19:04:37] (03Merged) 10jenkins-bot: minor changes to InitialiseSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/129464 (owner: 10Ricordisamoa) [19:05:45] wmf-config/InitialiseSettings.php | 179 ++++++++++-------- [19:05:46] heh [19:05:51] "minor" [19:05:52] !log reedy Synchronized wmf-config/: All of the config changes! (duration: 00m 14s) [19:06:01] Logged the message, Master [19:06:47] !log Running mwscript updateCollation.php --wiki=frwikibooks --previous-collation=uppercase [19:06:54] Logged the message, Master [19:07:37] !log frwikibooks collation updated [19:07:43] Logged the message, Master [19:07:53] Reedy: thanks! I'll make a submoduel update for it [19:09:56] Reedy: Can we run the moveBatch command now? [19:10:31] * James_F is impatient^Whoping to get shouted at as little as possible. :-) [19:10:34] I guess so [19:10:56] Thanks. [19:11:38] !log reedy Synchronized php-1.25wmf5/extensions/CirrusSearch/: (no message) (duration: 00m 15s) [19:11:40] * Reedy formats list [19:11:45] Logged the message, Master [19:12:15] James_F: What username? [19:13:04] Reedy: thanks! [19:13:36] Reedy: "Jdforrester (WMF)" [19:13:38] Reedy: Use "Maintenance script" otherwise you're going to get hit by abusefilters that ignore bots [19:14:06] aha [19:15:11] reedy@tin:/srv/mediawiki-staging$ mwscript moveBatch.php --wiki=enwiki --u="Maintenance script" -r="Extension:CiteThisPage deployed" citethispage.txt [19:15:11] ERROR: n parameter given twice [19:15:12] wut [19:16:52] bblack: can you please enlighten me about pdns stats webserver? where does it sit, why it exists etc? i would like to replace that with pushes to graphite if that is applicable [19:16:54] oh, doesn't like the = [19:17:09] James_F: Was there any on enwiki? [19:17:16] Reedy: 2 "n" in "Maintenance" ?:P [19:17:24] They all say [19:17:24] FAILED: This action cannot be performed on this page. This page may have been deleted since your request was submitted. [19:17:39] matanya: what pdns stats webserver are you talking about? [19:17:58] bblack: in manifests/dns.pp [19:18:05] i would think we replaced pdns with gndns [19:18:07] Reedy: Yes. [19:18:13] gdnsd [19:18:15] Hmm, just ran it again with you... [19:18:17] toward line 127 [19:18:24] MediaWiki:Cite text --> MediaWiki:Citethispage-content [19:18:24] FAILED: '''The page could not be moved:''' a page of that name already exists, or the name you have chosen is not valid. [19:18:32] manybubbles: ^d no idea who would request this but https://www.wikidata.org/w/api.php?action=cirrus-config-dump&format=xmlfm doesn't work :/ [19:18:40] from the logs.... [19:19:00] idk if worth effort to fix [19:19:11] Reedy: Does it say which is failing? [19:19:19] matanya: that seems to only be defined for nescio, the recursor in esams, and not the others. [19:19:27] right [19:19:29] perhaps it was an experiment that never got off the ground? [19:19:39] paravoid: will know ? [19:19:41] Probably should've suppressed redirects too [19:20:00] Reedy: Aha. https://en.wikipedia.org/w/index.php?title=MediaWiki:Citethispage-content&action=history [19:20:01] (and no, gdnsd doesn't replace pdns in this case. We still use pdns as a recursor, whereas gdnsd does authoritative service) [19:20:10] Reedy: You moved it already there? [19:20:14] (pdns used to do both for us, separately) [19:20:22] (cur | prev) 19:16, 28 October 2014‎ Maintenance script (Talk | block)‎ m . . (7,489 bytes) (0)‎ . . (Maintenance script moved page MediaWiki:Cite text to MediaWiki:Citethispage-content: Extension:CiteThisPage deployed) (rollback 1 edit | undo) [19:20:34] James_F: I think it's mostly SNR for this script [19:20:40] hard to see what it changed :) [19:20:55] Reedy: So much for me being blamed/credited. ;-) [19:21:04] http://p.defau.lt/?NPcTzYsqshfU2AHpf6q_CQ [19:21:13] James_F: [19:13:39] Reedy: Use "Maintenance script" otherwise you're going to get hit by abusefilters that ignore bots [19:21:17] i see. so the background is: we want to remove "class firewall", which is used by the old webserver class, which is used in that pdns webserver setup [19:21:22] right Matanya? [19:21:26] Reedy: Aha. OK. [19:21:37] I'll just run it foreachwiki with --noredirects [19:21:48] * James_F nods. [19:22:52] my gut says you can just nuke it (dns::recursor::statistics), but let me look around and see if it's doing something useful first [19:24:34] matanya: or we can replace "firewall" with ferm rules in modules/webserver, then suggest to delete misc/firewall.pp before having to remove all of webserver [19:26:52] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2032: active_shards: 6090: relocating_shards: 75: initializing_shards: 1: unassigned_shards: 2 [19:26:53] (03PS2) 10Ori.livneh: hhvm::packages: add graphviz and gv, for pprof --pdf/--svg &c. [puppet] - 10https://gerrit.wikimedia.org/r/169545 [19:26:59] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm::packages: add graphviz and gv, for pprof --pdf/--svg &c. [puppet] - 10https://gerrit.wikimedia.org/r/169545 (owner: 10Ori.livneh) [19:27:10] (03PS1) 10Ori.livneh: hhvm: fix ganglia memory reporter [puppet] - 10https://gerrit.wikimedia.org/r/169560 [19:27:20] (03PS2) 10Ori.livneh: hhvm: fix ganglia memory reporter [puppet] - 10https://gerrit.wikimedia.org/r/169560 [19:27:32] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: fix ganglia memory reporter [puppet] - 10https://gerrit.wikimedia.org/r/169560 (owner: 10Ori.livneh) [19:27:36] (03PS1) 10Dzahn: webserver - replace firewall rules with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169561 [19:27:50] matanya: ^ easy enough ? [19:30:19] (03PS2) 10Ori.livneh: Apt::Conf['no-recommends'] -> Package <| provider == 'apt' |> [puppet] - 10https://gerrit.wikimedia.org/r/167020 [19:30:22] thief mutante :) [19:30:30] heh [19:30:34] (03CR) 10Ori.livneh: "ping" [puppet] - 10https://gerrit.wikimedia.org/r/167020 (owner: 10Ori.livneh) [19:31:10] (03CR) 10Dzahn: "this is the only place i see the 'class firewall' being used" [puppet] - 10https://gerrit.wikimedia.org/r/169561 (owner: 10Dzahn) [19:31:19] matanya / mutante: afaics that dns::recursor::statistics isn't really doing anything. there's a lighttpd instance on the box, but it doesn't even have a site configured [19:31:34] bblack: better just nuke it ? [19:31:56] oh wait I found it! [19:31:58] http://nescio.esams.wikimedia.org/pdns/ [19:32:06] you just have to know the URL path I guess [19:32:27] James_F: just done ko* [19:32:30] but yeah that totally could/should be replaced by pushing stats into some other normal stats system we use [19:32:40] Reedy: 99% failure rate? [19:32:41] and I don't think anyone's going to miss it if you break it in the meantime [19:32:46] looks to be [19:33:05] I've tee'd it, will have a look when it's all done [19:33:15] will do, thanks bblack [19:33:24] !log Elasticsearch not recovering indices at all on logstash1003 and no logging output [19:33:28] yea, thanks for checking, bblack [19:33:29] Logged the message, Master [19:33:41] Reedy: Thank you so much. :-) [19:34:08] (03CR) 10Matanya: [C: 031] "though we should not include ferm rules in modules, just for the sake of getting rid of firewall.pp" [puppet] - 10https://gerrit.wikimedia.org/r/169561 (owner: 10Dzahn) [19:35:11] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2033: active_shards: 6094: relocating_shards: 75: initializing_shards: 0: unassigned_shards: 0 [19:37:52] (03CR) 10Dzahn: [C: 032] "identical compilation - http://puppet-compiler.wmflabs.org/464/change/169340/html/" [puppet] - 10https://gerrit.wikimedia.org/r/169340 (owner: 10Matanya) [19:42:40] !log deleting unused labs projects: commons-dev, echo, farsi-wikitest [19:42:45] Logged the message, Master [19:45:09] James_F: done [19:46:21] Reedy: Thank you, you're great. [19:46:58] It would seem there was maybe 300 or so renames [19:47:03] s/rename/moves/ [19:50:50] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [19:52:58] ori: how do i push stats to graphite? and puppet class/doc on this ? [19:53:59] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 1970.65709053 [19:55:26] (03PS1) 10Dzahn: remove catrope from ops admin group [puppet] - 10https://gerrit.wikimedia.org/r/169566 [19:55:59] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [19:56:20] (03CR) 10Dzahn: [C: 031] "anti-access requests are auto-approved? :p" [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [19:57:21] mutante: manager approval for anti-access requests seems like a nice idea :p [19:57:46] :( [19:57:50] matanya: write to statsd on tungsten [19:58:12] matanya: the statsd protocol is text- and udp-based and simple to implement: https://github.com/b/statsd_spec [19:58:25] thanks ori [19:58:42] (03CR) 10John F. Lewis: [C: 031] remove catrope from ops admin group [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [19:58:55] JohnLewis: done [19:59:13] :p [19:59:52] legoktm: JohnLewis: i am getting errors when editing any wikidata namespace page with my work account since yesterday. any chance it is because of the permission change? [19:59:59] didn't check other namespaces [20:00:17] uhhh, it shouldn't be... [20:00:19] what does the error say? [20:00:23] Lydia_WMDE: a little more information would be helpful :D [20:00:30] matanya: simple python code looks like this: [20:00:35] matanya: import socket; sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM); addr = ('statsd.eqiad.wmnet', 8125); sock.sendto('matanya.irc.ctcp_ping:800|m', addr) [20:00:47] it's a Lydia_WMDE.... [20:00:50] Request: POST http://www.wikidata.org/w/index.php?title=Wikidata:Contact_the_development_team&action=submit, from 10.64.32.104 via cp1053 cp1053 ([10.64.32.105]:3128), Varnish XID 2149438097 [20:00:50] Forwarded for: 198.73.209.5, 208.80.154.75, 10.64.32.104 [20:00:50] Error: 503, Service Unavailable at Mon, 27 Oct 2014 23:22:58 GMT [20:00:57] the one from yesterday [20:01:01] hey aude [20:01:14] yikes [20:01:26] multichill reported somethign like that afaik [20:02:06] Lydia_WMDE: unless someone introduced a 'no503' permission which you weren't granted - not a permissions thing :) more technical. [20:02:34] and I'll let the 5 people currently talking with some sort of ability to deal with it deal with it [20:02:37] ori: i would like to push pdns stats to graphite, would this puppet sinippet the right path ? [20:02:53] which puppet snippet? [20:03:29] Lydia_WMDE: still getting the errors? [20:04:14] aude: i can try again but i just got it this morning again [20:04:17] let me check [20:04:45] i am trying with my staff account [20:04:52] it works [20:05:03] whut [20:05:03] ok [20:05:05] now it worked [20:05:07] -.- [20:05:07] i know there were issues yesterday with one of the data centers (SF) [20:05:08] ok [20:05:10] thanks and sorry [20:05:14] maybe that was it [20:05:22] ori: *python, sorry [20:05:25] Given your location, that sounds quite likely :D [20:05:30] yeah [20:05:32] ok [20:06:04] matanya: i don't know much about pdns, but if it's python-based, yeah, that would work well [20:06:06] there seems to be an issue with authority control gadget though [20:06:10] looking into it [20:06:24] (was no issue on test.wikidata) [20:08:21] thanks aude [20:19:42] (03CR) 10Jforrester: "So presumably you're going to add him to services-roots and parsoid-roots and all the other root groups he still needs to be in?" [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:20:33] James_F: good point out actually :p [20:23:49] PROBLEM - ElasticSearch health check on logstash1003 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.136 [20:24:14] !log disabling puppet on logstash1003 and trying to run elasticserach by hand to learn more about why its borked. [20:24:22] Logged the message, Master [20:25:34] (03PS1) 10Matanya: firewall: remove, unused [puppet] - 10https://gerrit.wikimedia.org/r/169571 [20:25:39] (03CR) 10jenkins-bot: [V: 04-1] firewall: remove, unused [puppet] - 10https://gerrit.wikimedia.org/r/169571 (owner: 10Matanya) [20:26:39] (03CR) 10Dzahn: "yes, that, need to add him to other groups he still needs" [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:29:15] (03PS2) 10Matanya: firewall: remove, unused [puppet] - 10https://gerrit.wikimedia.org/r/169571 [20:29:29] !log removed /etc/elasticsearch/*.dpkg-dist fromg logstash machines - that was breaking logging for some reason. magic. [20:29:34] Logged the message, Master [20:29:44] oooh [20:29:50] (03CR) 10Matanya: "only after: https://gerrit.wikimedia.org/r/#/c/169561/" [puppet] - 10https://gerrit.wikimedia.org/r/169571 (owner: 10Matanya) [20:30:01] srsly [20:30:53] beta had elasticsearch.yml.dpkg-dist but not one for logging [20:30:56] * bd808 shrugs [20:31:06] yeah - worth filing a bug about [20:31:36] thanks manybubbles :) [20:32:55] waht: if (file.getFileName().toString().startsWith("logging.")) { [20:32:55] loadConfig(file, settingsBuilder); [20:32:55] } [20:33:15] doh. [20:34:37] robh: any eta for deleting the blog puppet files? did the external host finished with it ? [20:35:08] bd808: now to the next problem [20:35:35] yeah. I was hoping the log would be full of the problem :/ [20:35:54] seems not so much (at least yet) [20:36:34] I wonder if it freaked out about being out of date and now can't do anything because of disk space... (random speculation) [20:36:54] its disk space [20:37:10] elasticsearch won't allocate new shards if the disk is already 85% full [20:37:13] by default [20:37:16] jgage: you filled up my disks :( [20:37:35] spam spam spam spam [20:37:45] just nuke some files. [20:37:48] crap! [20:37:52] it won't recheck automatically (open bug) [20:38:04] over the last month the index size has gone from 7G to 13G to 31G per day [20:38:05] bd808, i will minimize input :( [20:38:34] so you'll want to restart the node or do something less drastic like disable and re-enable allocation [20:38:37] The nodejs and java folks are a little too verbose for our cluster size [20:38:44] can we just purge "old" stuff? [20:39:03] any old hadoop logs in logstash can be nuked [20:39:08] We can drop days at a time [20:39:21] dropping records is a bit more awkward [20:39:21] we don't need historical, we just need to be able to debug [20:39:22] jgage: what do you class as old? :) [20:39:27] well - here is a funny thing [20:39:41] reedy, how about anything >7 days [20:39:42] the /var/lib/elasticsearch directory is taking up a ton of space already..... [20:39:52] so why doesn't it clear up what it can't use? [20:40:03] let me try something [20:40:55] even the healthy nodes look like: /dev/sda1 433G 337G 74G 83% [20:41:07] I guess 83 < 85 [20:41:13] So they're gonna become an issue soon? [20:41:20] unstuch [20:41:21] (03PS2) 10Dzahn: remove catrope from ops, add to "-oid"-roots [puppet] - 10https://gerrit.wikimedia.org/r/169566 [20:41:22] yeah. like tomorrow :( [20:41:36] "roids" :p [20:41:46] what is the plan for scaling logstash? do we need to make hardware requests? it's only going to get more inputs over time.. [20:42:01] for example i'd like to see all of syslog in there [20:42:07] Is all the disk space allocated? [20:42:13] so it looks like it wasn't going going to allocate the shards because it didn't have enough free space. but it didn't have enough free space because it couldn't remove the old shards. which it couldn't do because..... I dunno. maybe because the shards weren't allocated. [20:42:22] jgage: The plan is ... no plan yet. :) This is a year old experiment that needs to be productionized. [20:42:23] Disk /dev/sda: 499.6 GB, 499558383616 bytes [20:42:24] Yup [20:42:31] bd808, ok [20:42:46] jgage: and writing an email about that to ops-l is on my list of stuff to do this week actually [20:43:02] cool :) [20:43:07] (03CR) 10John F. Lewis: [C: 031] "new groups look good and valid." [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:43:12] in the mean time i'll see what i can do to limit our logging [20:43:35] bd808: I deleted the node data on elastic1003 and I'm letting it reallocate. that should get you unstuck for now [20:43:43] But yeah, the 3 random misc servers that I acquired last December are reaching their limit [20:44:06] manybubbles: Cool. That's what I would have tried next too, so great minds think alike I guess. [20:44:21] thanks much for your help [20:44:40] (03PS3) 10Dzahn: remove catrope from ops, add to "-oid"-roots [puppet] - 10https://gerrit.wikimedia.org/r/169566 [20:44:52] bd808: I guess they only really need more disk space [20:44:53] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Logstash%2520cluster%2520eqiad&tab=m&vn=&hide-hf=false [20:45:05] cpu/network is fine [20:45:13] rm -rf fixes all things [20:45:17] obviously it's java, so it'd eat any more memory we could throw at it [20:45:21] (03CR) 10Dzahn: "RoanKattouw: so, parsoid,citoid,mathoid,gerrit and deployers" [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:46:03] (03CR) 10Dzahn: "John F. Lewis: thanks! i just added gerrit-roots" [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:46:07] It will be interesting to see what happens when I start using the redis on those nodes for MW log event shipping [20:46:11] https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Logstash%20cluster%20eqiad&h=logstash1003.eqiad.wmnet&r=week&z=default&jr=&js=&st=1414529135&v=451.940&m=disk_free&vl=GB&ti=Disk%20Space%20Available&z=large [20:46:33] I wonder if we should get icinga adjusted to warn at the magic 85% for elasticsearch nodes? [20:46:54] If logstash can keep up it shouldn't use much more ram, but if it falls behind then everything will fight for ram [20:47:17] guess that's a TIAS [20:47:30] Reedy: How do I find out if those physical boxes have space for more disks? [20:48:18] * Reedy tries to login to racktables [20:49:41] (03CR) 10Rush: [C: 031] "This looks good to me. I don't see Catrope commenting himself either here or in the RT issue. Should probably get his explicit 'yes' some" [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:49:41] They're all Dell PowerEdge R320 [20:50:38] bd808: They just seem to be single disk (no raid) [20:50:51] Up to eight 2.5” hot-plug SAS, SATA or SSD [20:50:51] Up to four 3.5" hot-plug SAS, SATA or SSD [20:50:55] I presume they'll be 3.5" drives [20:51:11] probably. I was just reading the same spec sheet [20:51:32] hang on [20:51:33] "HARD DRIVE, 500GB, EXPANDABLE SYSTEM, 7.2, 3.5, W-SU, E/C" [20:51:38] they're 3.5" [20:51:46] but dell site says that they've got 2 drives in [20:51:56] I think they were all misc spares that we latched on to [20:52:38] (03CR) 10Andrew Bogott: [C: 031] remove catrope from ops, add to "-oid"-roots [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:52:42] 500gb, a pittance! [20:52:55] yeah, they were all bought with 2 x 500GB [20:53:06] (03CR) 10Catrope: [C: 031] remove catrope from ops, add to "-oid"-roots [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [20:53:08] They are just over 2 years old ;) [20:53:16] fair enough :) [20:53:45] * bd808 will get to writing that email [20:53:47] Unless they're hardware raided or something... [20:53:49] I wonder if they are mirrored raid now [20:54:19] They've got H310 raid cards [20:54:23] !log Zuul deadlocked again. Restarting Gearman plugin on Jenkins [20:54:28] Logged the message, Master [20:54:58] https://rt.wikimedia.org/Ticket/Display.html?id=3278 [20:55:33] Guess somone from ops can probably confirm that fairly easily [20:55:45] But they should have at least 2 drive slots free each [20:57:16] (03CR) 10Dzahn: [C: 032] remove catrope from ops, add to "-oid"-roots [puppet] - 10https://gerrit.wikimedia.org/r/169566 (owner: 10Dzahn) [21:00:04] spagewmf, ebernhardson: Respected human, time to deploy Flow (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141028T2100). Please do the needful. [21:02:35] !log Zuul back in action. [21:02:39] Logged the message, Master [21:12:16] (03PS2) 10Dzahn: webserver - replace firewall rules with ferm [puppet] - 10https://gerrit.wikimedia.org/r/169561 [21:14:06] (03CR) 10Dzahn: [C: 031] "yea, _after_ the dependency has been merged this should be cool, seems really only used in the old webserver class, and killing this is gr" [puppet] - 10https://gerrit.wikimedia.org/r/169571 (owner: 10Matanya) [21:20:10] PROBLEM - check_puppetrun on pay-lvs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:12] PROBLEM - check_puppetrun on boron is CRITICAL: CRITICAL: puppet fail [21:20:13] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:14] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:20:15] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:04] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:14] RECOVERY - check_puppetrun on pay-lvs1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [21:25:16] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:17] RECOVERY - check_puppetrun on boron is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:25:18] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:18] PROBLEM - check_puppetrun on barium is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:20] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:37] new that we get those for fundraising boxen? [21:25:48] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [21:25:48] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 1 failures [21:28:51] PROBLEM - ElasticSearch health check on logstash1001 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.138 [21:30:05] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:30:28] PROBLEM - check_puppetrun on tellurium is CRITICAL: CRITICAL: Puppet has 1 failures [21:30:28] PROBLEM - check_puppetrun on samarium is CRITICAL: CRITICAL: Puppet has 1 failures [21:30:29] PROBLEM - check_puppetrun on payments1001 is CRITICAL: CRITICAL: Puppet has 1 failures [21:30:30] PROBLEM - check_puppetrun on pay-lvs1002 is CRITICAL: CRITICAL: Puppet has 1 failures [21:30:31] RECOVERY - check_puppetrun on barium is OK: OK: Puppet is currently enabled, last run 291 seconds ago with 0 failures [21:30:32] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 1 failures [21:30:32] PROBLEM - check_puppetrun on payments1003 is CRITICAL: CRITICAL: Puppet has 1 failures [21:34:57] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 137 seconds ago with 0 failures [21:35:07] RECOVERY - check_puppetrun on tellurium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:35:08] RECOVERY - check_puppetrun on pay-lvs1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [21:35:18] RECOVERY - check_puppetrun on samarium is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [21:35:43] RECOVERY - check_puppetrun on payments1001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [21:35:44] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [21:35:44] RECOVERY - check_puppetrun on payments1003 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [21:41:23] (03PS1) 10Dzahn: phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 [21:41:55] (03PS2) 10Dzahn: phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 [21:43:03] (03PS3) 10Dzahn: phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 [21:43:06] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=72%): [21:44:52] (03CR) 10Dzahn: "@iridium:~# /usr/lib/nagios/plugins/check_procs -w 10:40 -c 1:50 --ereg-argument-array 'PhabricatorTaskmasterDaemon'" [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [21:50:15] (03CR) 10Rush: phab: add monitoring class, monitor TaskMaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [21:55:02] (03CR) 10Dzahn: "maybe somebody will say this doesn't belong into the LVS class. unless we would add a check for _each_ configured backend that is behind m" [puppet] - 10https://gerrit.wikimedia.org/r/169303 (owner: 10Dzahn) [21:55:44] (03PS2) 10Dzahn: add a phabricator check to LVS monitoring [puppet] - 10https://gerrit.wikimedia.org/r/169303 [21:56:15] (03PS2) 10Dzahn: phabricator - add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/169265 [21:56:38] (03CR) 10Dzahn: phab: add monitoring class, monitor TaskMaster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [21:57:25] mutante: add that to your todo! :D [21:57:41] forwards that message to YuviPanda [21:58:01] fair enough [21:58:28] by Yuvi is all about shinken now [21:58:30] *but [21:59:01] i can only keep saying we want a real icinga in labs [21:59:10] by real i mean.. uses same puppet that happens in prod [21:59:25] since .. like before petan started making one in labs [22:00:14] or nothing that adds monitoring can be tested [22:00:32] and more $realm checks or separate roles and all the workarounds [22:02:00] (03PS4) 10Dzahn: phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 [22:02:53] (03CR) 10Dzahn: "chasemp: including from role::phabricator::main now" [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:06:17] (03CR) 10John F. Lewis: [C: 031] "Looks good. Regarding the other services - the more the better." [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:07:43] (03CR) 10Dzahn: "maybe there is a config setting that influences the number of Taskmaster processes?" [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:08:51] (03CR) 10John F. Lewis: ""You can set the number of taskmasters that phd start starts by the config key phd.start-taskmasters. If you have a task backlog, try incr" [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:08:53] (03CR) 10Rush: "There is and it's set at 10, but I'm not sure if there are spawned procs that would inflate it. 10 should be the floor tho." [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:10:52] mutante: maybe want to up the lower crit to probably 5 instead of 1? anything less than half the set value should probably be 'something is wrong' bell :) [22:17:01] (03PS1) 10Dzahn: add check for https://phabricator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169604 [22:18:40] (03CR) 10Dzahn: "why do we have checkcommands.cfg AND check_commands directory with individual files now? with nagios_common things have changed" [puppet] - 10https://gerrit.wikimedia.org/r/169604 (owner: 10Dzahn) [22:21:20] (03PS2) 10Dzahn: add check for https://phabricator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169604 [22:22:28] (03PS1) 10Andrew Bogott: Parameterize adminscripts class [puppet] - 10https://gerrit.wikimedia.org/r/169607 [22:22:30] (03PS1) 10Andrew Bogott: Minor changes for labs testing [puppet] - 10https://gerrit.wikimedia.org/r/169608 [22:23:16] (03CR) 10Dzahn: "i think it's way simpler and easier to read if i just make a custom command here and use that, instead of trying to have "yet another gene" [puppet] - 10https://gerrit.wikimedia.org/r/169604 (owner: 10Dzahn) [22:27:23] (03PS5) 10Dzahn: phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 [22:27:38] JohnLewis: chasemp: 10:30 then [22:28:04] i always see 20 or 21, like: ps aux | grep Taskmaster | wc -l [22:28:32] (03CR) 10QChris: "It seems the recent alerts were not related to ACKs changing." [puppet] - 10https://gerrit.wikimedia.org/r/167553 (https://bugzilla.wikimedia.org/69667) (owner: 10QChris) [22:28:34] (03CR) 10John F. Lewis: [C: 031] "New critical levels seems reasonable to me." [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:29:02] re the other procs, not sure if it is really just "the more the better", having too many monitoring checks can also be bad.. see swift [22:29:29] like "if taskmaster is already down the rest doesnt matter".. then i guess it doesnt matter [22:29:42] but i'm asking [22:29:45] Hm. I guess. [22:29:53] let's let this ride and see how it goes [22:30:04] if taskmaster is missing it's all moot so good first step [22:30:11] ok [22:30:48] (03CR) 10Rush: [C: 031] phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:30:55] (03CR) 10Rush: [C: 031] add check for https://phabricator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169604 (owner: 10Dzahn) [22:31:03] I think your dependencies are reversed fyi [22:31:12] :) [22:31:17] PROBLEM - HHVM processes on mw1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [22:31:32] (03CR) 10John F. Lewis: [C: 031] add check for https://phabricator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169604 (owner: 10Dzahn) [22:32:18] (03PS1) 10Reedy: Only enable Extension:Oversight on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169611 (https://bugzilla.wikimedia.org/60373) [22:32:27] (03PS1) 10Reedy: Disable Extension:Oversight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169612 (https://bugzilla.wikimedia.org/60373) [22:32:35] (03CR) 10Andrew Bogott: [C: 032] Parameterize adminscripts class [puppet] - 10https://gerrit.wikimedia.org/r/169607 (owner: 10Andrew Bogott) [22:36:15] i think they are in the right order, first need to add monitoring.pp, i bet i still have to rebase though [22:36:18] thanks, going on with those [22:36:43] (03CR) 10Dzahn: [C: 032] phab: add monitoring class, monitor TaskMaster [puppet] - 10https://gerrit.wikimedia.org/r/169585 (owner: 10Dzahn) [22:37:54] (03PS3) 10Dzahn: add check for https://phabricator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169604 [22:38:18] (03CR) 10Dzahn: [C: 032] add check for https://phabricator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/169604 (owner: 10Dzahn) [22:39:22] (03CR) 10Dzahn: [C: 04-2] "replaced by Change-Id: Ie790fd2e3b607e9272" [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [22:39:30] (03Abandoned) 10Dzahn: phabricator - add monitoring [puppet] - 10https://gerrit.wikimedia.org/r/169265 (owner: 10Dzahn) [22:45:32] YuviPanda: ping [22:45:43] how do you define a check command now? [22:45:59] there is still the old file, then there is the new directory with separate files. i added one [22:46:06] i saw it being created. yet. icinga breaks [22:46:13] because command is not defined [22:46:25] duplication of commands is kind of confusing [22:58:41] * aude will have something for swat in a few minutes [22:58:45] * legoktm has a swat as well [22:59:39] AND nagios_common::check_command {} as well ? [22:59:44] legoktm: you should get that looked at [23:00:00] lol [23:00:04] RoanKattouw, ^d, marktraceur, MaxSem: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141028T2300). [23:00:29] The only entry is a patch by legoktm [23:00:33] "patch incoming" [23:01:05] <^d> You volunteering Roan? [23:01:11] <^d> I can do it if you're busy tho. [23:01:16] (03PS2) 10Legoktm: Update ExtensionDistributor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167408 [23:01:16] my patch is hot off the presses [23:01:24] it's ^ that one [23:01:25] fixes a bit of ugliness for rtl languages [23:01:59] ^d: I'm still trying to get Parsoid in labs to work again, so yeah if you could do it that would be great [23:02:01] (03CR) 10Legoktm: "Ignore what I said earlier, this only needed I2f53b23631aeeff91023ae8b44e2a4753c1f0ba3 to be deployed, which it is." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167408 (owner: 10Legoktm) [23:02:07] <^d> RoanKattouw: np I got it [23:02:11] (03CR) 10Chad: [C: 032] Update ExtensionDistributor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167408 (owner: 10Legoktm) [23:02:20] (03Merged) 10jenkins-bot: Update ExtensionDistributor config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/167408 (owner: 10Legoktm) [23:02:53] !log demon Synchronized wmf-config/CommonSettings.php: extension distributor stuffs (duration: 00m 05s) [23:02:54] <^d> legoktm: ^ [23:02:59] Logged the message, Master [23:03:21] (03PS1) 10Dzahn: add phab check command to old checkcommands.cfg [puppet] - 10https://gerrit.wikimedia.org/r/169619 [23:03:55] <^d> aude: Drop the link(s) for merging as soon as you're ready [23:03:58] ok [23:04:03] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 1 process with command name hhvm [23:04:07] waiting on jenkins [23:04:09] ^d: thanks, the 1.24 part looks good, and I'll have to wait for the cache to expire on the extension list... [23:04:16] <^d> Yeah [23:04:22] (03CR) 10Dzahn: [C: 032] "Error: Service check command.. not defined anywhere" [puppet] - 10https://gerrit.wikimedia.org/r/169619 (owner: 10Dzahn) [23:04:36] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 67098 bytes in 2.187 second response time [23:05:05] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [23:05:17] RECOVERY - Disk space on mw1114 is OK: DISK OK [23:05:23] <_joe_> !log removed stale heap profile files from /run/hhvm on mw1114 [23:05:29] Logged the message, Master [23:05:47] ^d: actually I don't want to wait a full hour to see if something goes wrong, so I'll just delete it from memcache and test it [23:06:07] <^d> goforit :) [23:07:12] works :D [23:07:43] hmm, it's sorted in a different order now [23:07:55] lowercase extensions are now below the Zero* ones [23:11:10] ^d: https://gerrit.wikimedia.org/r/#/c/169621/ [23:11:16] shall put on the wiki [23:12:35] !log demon Synchronized php-1.25wmf5/extensions/Wikidata: (no message) (duration: 00m 10s) [23:12:37] <^d> aude: ^ [23:12:42] Logged the message, Master [23:12:54] thanks [23:13:03] <^d> yw [23:13:10] verifying [23:14:33] chasemp: JohnLewis https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=iridium [23:14:56] mutante: lovely :) [23:15:05] looks good :) [23:16:30] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 102, initializing_shards: 2, number_of_data_nodes: 3 [23:16:41] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 102, initializing_shards: 2, number_of_data_nodes: 3 [23:16:50] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 9, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 102, initializing_shards: 2, number_of_data_nodes: 3 [23:20:09] <^d> aude, legoktm: Thank you for participating in swat. Sadly, we've got no prizes today. [23:20:18] <^d> Please feel free to come back tomorrow and try again :D [23:20:30] !wheeloffortune [23:20:55] :) [23:20:57] <^d> I need a Bob Barker microphone when I do swat. [23:27:50] (03PS1) 10Catrope: Fix Parsoid in beta [puppet] - 10https://gerrit.wikimedia.org/r/169622 [23:41:39] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: puppet fail [23:53:05] (03PS1) 10Ori.livneh: hhvm: make HHVM's working directory be /var/tmp/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/169627 [23:53:57] (03CR) 10Ori.livneh: [C: 032] hhvm: make HHVM's working directory be /var/tmp/hhvm [puppet] - 10https://gerrit.wikimedia.org/r/169627 (owner: 10Ori.livneh) [23:54:53] James_F|Away: legoktm https://en.wikipedia.org/wiki/User_talk:Reedy#Maintenance_script [23:54:54] lol [23:55:25] bahaha [23:56:24] I thought there was an account for it on enwp [23:58:29] Reedy: also, my page move rewrite patch will fix that! [23:58:37] too late! [23:59:11] https://gerrit.wikimedia.org/r/#/q/Ic5026384b92a0d68d628397ffe1de6e5b6183f02,n,z [23:59:49] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures