[00:05:17] (03CR) 10Dzahn: "check exists on neon but "check_ssh: Port number must be a positive integer - gerrit-new.wikimedia.org "" [puppet] - 10https://gerrit.wikimedia.org/r/299606 (owner: 10Chad) [00:12:29] (03PS1) 10Dzahn: gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 [00:13:09] (03PS2) 10Dzahn: gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 [00:13:11] PROBLEM - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [00:13:31] (03CR) 10Dzahn: [C: 032] gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 (owner: 10Dzahn) [00:15:43] (03CR) 10jenkins-bot: [V: 04-1] gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 (owner: 10Dzahn) [00:16:48] (03PS3) 10Dzahn: gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 [00:18:38] (03PS4) 10Dzahn: gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 [00:18:44] (03CR) 10Dzahn: [C: 032] gerrit: fix ssh port monitoring [puppet] - 10https://gerrit.wikimedia.org/r/299695 (owner: 10Dzahn) [00:33:56] mutante: Does that mean that the check on ssl above only needs the command name and no arg? [00:34:06] (ie: it's unused) [00:36:35] It seems https://gerrit-new.wikimedia.org/r/ is down [00:36:39] ostriches ^^ [00:39:45] I know. [00:39:48] I'm testing something [00:41:02] Oh [00:41:04] I get [00:41:05] Gerrit is currently down for maintenance. Please try again later. [00:41:16] Which is nice [00:41:28] Better then what it was [00:43:06] Yeah, I've been testing a new maintenance mode :) [00:43:31] Oh [00:43:32] :) [00:58:31] I'm going to upgrade the Elasticsearch cluster that stashbot uses. Stashbot will probably end up dying while this happens. [01:05:13] (03CR) 10Dzahn: "works now, fixed with follow-up" [puppet] - 10https://gerrit.wikimedia.org/r/299606 (owner: 10Chad) [01:05:52] ostriches: it needs only 1 arg, the port, but not 2, the host name is already there [01:05:59] internally [01:06:03] works now [01:06:24] Yay :) [01:06:51] Looks like it worked. stashbot restarted itself and everything [01:08:11] (03CR) 10Dzahn: [C: 031] Add css to turn repo links into blue again [puppet] - 10https://gerrit.wikimedia.org/r/299596 (owner: 10Paladox) [01:10:44] RECOVERY - gerrit process on lead is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [01:22:04] (03PS1) 10BryanDavis: logstash: update logstash_optimize_index.sh for ES 2.x [puppet] - 10https://gerrit.wikimedia.org/r/299699 [01:22:12] (03CR) 10Dzahn: "compiler claims this fails, but the error seems unrelated ?!" [puppet] - 10https://gerrit.wikimedia.org/r/299678 (owner: 10Chad) [01:26:38] (03CR) 10Dzahn: [C: 031] admin: add marktraceur to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/299115 (https://phabricator.wikimedia.org/T140132) (owner: 10Filippo Giunchedi) [01:29:42] (03PS6) 10Dzahn: Introduce wmde-analytics-users group [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [01:29:53] (03CR) 10Dzahn: [C: 032] Introduce wmde-analytics-users group [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [01:31:04] (03CR) 10Dzahn: "merging, analytics-ops already suggested the group name" [puppet] - 10https://gerrit.wikimedia.org/r/298928 (https://phabricator.wikimedia.org/T140342) (owner: 10Addshore) [01:32:46] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2461330 (10Dzahn) The new but empty group has been added now. Can we have formal mana... [01:39:32] (03PS1) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [01:40:45] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 (owner: 10Chad) [01:44:15] (03PS2) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [01:45:35] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 (owner: 10Chad) [01:46:32] (03PS3) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [01:49:14] (03PS1) 10Dzahn: admin: create shell account for mpany [puppet] - 10https://gerrit.wikimedia.org/r/299702 (https://phabricator.wikimedia.org/T140399) [01:54:24] PROBLEM - puppet last run on mw2237 is CRITICAL: CRITICAL: puppet fail [01:54:49] (03PS4) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [01:56:35] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2473810 (10Dzahn) p:05Triage>03Normal [02:04:54] Hi. Could someone run the script to fix Meatballs123 (https://phabricator.wikimedia.org/T119736#2465114)? bd808 maybe [02:06:43] (03PS5) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [02:07:47] (03CR) 10Chad: [C: 031] Add css to turn repo links into blue again [puppet] - 10https://gerrit.wikimedia.org/r/299596 (owner: 10Paladox) [02:07:55] (03CR) 10jenkins-bot: [V: 04-1] Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 (owner: 10Chad) [02:08:28] (03PS4) 10Dzahn: Add css to turn repo links into blue again [puppet] - 10https://gerrit.wikimedia.org/r/299596 (owner: 10Paladox) [02:08:41] (03CR) 10Dzahn: [C: 032] Add css to turn repo links into blue again [puppet] - 10https://gerrit.wikimedia.org/r/299596 (owner: 10Paladox) [02:10:28] (03CR) 10Chad: [C: 031] Add some colors to the site table on changes [puppet] - 10https://gerrit.wikimedia.org/r/299447 (owner: 10Paladox) [02:14:18] (03PS6) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [02:14:54] JJMC89: should be fixed -- https://en.wikipedia.org/w/index.php?title=Special%3ACentralAuth&target=Meatballs123 [02:15:19] Thanks bd808! [02:15:53] there is code in the branch that will roll out this week to fix these automatically [02:17:57] (03PS7) 10Chad: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 [02:20:13] (03PS9) 10Dzahn: Add some colors to the site table on changes [puppet] - 10https://gerrit.wikimedia.org/r/299447 (owner: 10Paladox) [02:20:53] (03CR) 10Chad: [C: 031] "https://puppet-compiler.wmflabs.org/3375/ results in no changes made on disk :)" [puppet] - 10https://gerrit.wikimedia.org/r/299701 (owner: 10Chad) [02:21:25] RECOVERY - puppet last run on mw2237 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [02:21:43] (03CR) 10Dzahn: [C: 032] Add some colors to the site table on changes [puppet] - 10https://gerrit.wikimedia.org/r/299447 (owner: 10Paladox) [02:21:58] (03PS8) 10Dzahn: Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 (owner: 10Chad) [02:24:06] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.10) (duration: 08m 37s) [02:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:58] (03CR) 10Dzahn: [C: 032] Gerrit: Introduce comment of $maint_mode for the web UI [puppet] - 10https://gerrit.wikimedia.org/r/299701 (owner: 10Chad) [02:30:08] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jul 19 02:30:08 UTC 2016 (duration 6m 2s) [02:30:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:30:22] (03CR) 10Chad: [C: 031] [gerrit] Use HEAD instead of master for branch [puppet] - 10https://gerrit.wikimedia.org/r/299689 (owner: 10Paladox) [02:32:49] (03PS3) 10Dzahn: [gerrit] Use HEAD instead of master for branch [puppet] - 10https://gerrit.wikimedia.org/r/299689 (owner: 10Paladox) [02:33:31] (03CR) 10Dzahn: [C: 032] [gerrit] Use HEAD instead of master for branch [puppet] - 10https://gerrit.wikimedia.org/r/299689 (owner: 10Paladox) [02:38:25] PROBLEM - puppet last run on cp3049 is CRITICAL: CRITICAL: puppet fail [02:45:06] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80656 MB (15% inode=99%) [02:56:56] RECOVERY - Disk space on elastic1017 is OK: DISK OK [03:02:26] RECOVERY - puppet last run on cp3049 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [03:58:36] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [04:00:26] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:12:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:14:45] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:19:25] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Puppet has 1 failures [04:45:16] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [04:49:16] PROBLEM - puppet last run on cp3018 is CRITICAL: CRITICAL: puppet fail [05:15:23] RECOVERY - puppet last run on cp3018 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:35:18] 06Operations, 06Commons, 10media-storage, 13Patch-For-Review, 07User-notice: Some fonts not anti-aliasing in SVG thumbnails after upgrade of scaling servers - https://phabricator.wikimedia.org/T139543#2473952 (10Joe) >>! In T139543#2472383, @kaldari wrote: > Yay! Looks great to me! Is there a way we can... [05:43:20] Krenair: could you (or someone) kick gerrit-wm again? [05:44:17] <_joe_> legoktm: will do [05:44:24] thank you [05:45:45] <_joe_> done [05:45:52] <_joe_> a new one should arrive shortly [05:46:00] <_joe_> here he is [05:46:52] thanks :) [06:18:19] (03PS4) 10Giuseppe Lavagetto: puppetmaster: add test site to palladium [puppet] - 10https://gerrit.wikimedia.org/r/299145 (https://phabricator.wikimedia.org/T98173) [06:21:08] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: add test site to palladium [puppet] - 10https://gerrit.wikimedia.org/r/299145 (https://phabricator.wikimedia.org/T98173) (owner: 10Giuseppe Lavagetto) [06:29:57] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:09] <_joe_> a few errors could be expected [06:30:17] PROBLEM - puppet last run on es2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:18] PROBLEM - puppet last run on mw1276 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:28] PROBLEM - puppet last run on mw2228 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:58] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:57] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: puppet fail [06:32:58] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [06:32:59] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: puppet fail [06:33:08] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: puppet fail [06:46:38] (03PS1) 10Giuseppe Lavagetto: puppetmaster::web_test: add NameVirtualHost directive [puppet] - 10https://gerrit.wikimedia.org/r/299711 [06:51:37] 06Operations, 10Ops-Access-Requests, 06WMDE-Analytics-Engineering, 13Patch-For-Review, 15User-Addshore: Requesting sudo access to analytics-wmde user on stat1002 for Addshore - https://phabricator.wikimedia.org/T140342#2474006 (10elukey) @Nuria Can you approve? I believe that Andrew already worked on the... [06:54:16] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2474008 (10madhuvishy) Pasting public gpg key here - if this isn't the best idea, happy to make a new one. ``` -----BEGIN PGP PUBLIC KEY BLOCK----- Ver... [07:05:05] RECOVERY - puppet last run on mw2088 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [07:08:13] (03PS8) 10Mobrovac: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) [07:10:11] (03CR) 10Mobrovac: [C: 031] Change-Prop: Ignore certain errors on page_delete and null_edit. [puppet] - 10https://gerrit.wikimedia.org/r/295680 (owner: 10Ppchelko) [07:13:07] RECOVERY - puppet last run on mw2201 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [07:22:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:24:08] (03CR) 10Mobrovac: "PCC - https://puppet-compiler.wmflabs.org/3377/ . It still unclear why all the admins are being removed from wtp." [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [07:25:10] (03PS2) 10Giuseppe Lavagetto: Change-Prop: Ignore certain errors on page_delete and null_edit. [puppet] - 10https://gerrit.wikimedia.org/r/295680 (owner: 10Ppchelko) [07:25:26] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:25:33] <_joe_> uhm [07:25:39] <_joe_> checking ^^ [07:25:50] 06Operations, 10Parsoid, 06Services, 10service-runner, and 2 others: Replace custom server.js with service-runner - https://phabricator.wikimedia.org/T90668#2474053 (10mobrovac) @Joe, @ssastry and I will move the first couple of `wtp` nodes to the new version today @ 14:30 UTC. [07:26:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Change-Prop: Ignore certain errors on page_delete and null_edit. [puppet] - 10https://gerrit.wikimedia.org/r/295680 (owner: 10Ppchelko) [07:27:06] PROBLEM - puppet last run on mw2106 is CRITICAL: CRITICAL: Puppet has 3 failures [07:31:16] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 238 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:31:35] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:32:45] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:37:16] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 238 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [07:53:26] RECOVERY - puppet last run on mw2106 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:56:18] <_joe_> I restarted apache on palladium [07:56:25] <_joe_> that might cause soime puppet failure [07:58:07] PROBLEM - puppet last run on lvs4002 is CRITICAL: CRITICAL: Puppet has 1 failures [08:00:17] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: puppet fail [08:00:17] PROBLEM - puppet last run on snapshot1007 is CRITICAL: CRITICAL: puppet fail [08:00:36] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: puppet fail [08:12:37] (03PS2) 10Giuseppe Lavagetto: puppetmaster::web_test: add NameVirtualHost directive [puppet] - 10https://gerrit.wikimedia.org/r/299711 [08:12:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] puppetmaster::web_test: add NameVirtualHost directive [puppet] - 10https://gerrit.wikimedia.org/r/299711 (owner: 10Giuseppe Lavagetto) [08:20:24] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [08:20:40] (03PS1) 10Elukey: Remove analytics-deploy user/group since they will be created by scap:target. [puppet] - 10https://gerrit.wikimedia.org/r/299713 (https://phabricator.wikimedia.org/T129151) [08:21:27] (03CR) 10Elukey: [C: 032] Remove analytics-deploy user/group since they will be created by scap:target. [puppet] - 10https://gerrit.wikimedia.org/r/299713 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [08:22:13] PROBLEM - puppet last run on mw2096 is CRITICAL: CRITICAL: Puppet has 1 failures [08:22:53] RECOVERY - puppet last run on lvs4002 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:23:23] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 1 failures [08:25:12] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:25:22] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:26:13] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:26:22] RECOVERY - puppet last run on mw2069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:26:23] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:26:32] RECOVERY - puppet last run on mw2076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:26:33] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:26:43] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:26:43] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:26:44] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:26:52] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:26:53] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:03] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:27:03] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:27:12] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [08:27:12] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:14] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [08:27:22] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:27:22] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [08:27:32] RECOVERY - puppet last run on aqs1002 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:27:32] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:27:32] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:27:33] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:33] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [08:27:33] RECOVERY - puppet last run on elastic1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:42] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:43] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:27:43] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:27:53] RECOVERY - puppet last run on snapshot1007 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:27:54] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:02] RECOVERY - puppet last run on cp4010 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:28:02] RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:03] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [08:28:12] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:13] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:28:22] RECOVERY - puppet last run on mw2228 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:23] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:24] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:24] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:32] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:33] RECOVERY - puppet last run on mw2095 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:34] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:52] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:28:52] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:28:53] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:29:03] RECOVERY - puppet last run on mw2136 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:29:23] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:29:24] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:29:33] RECOVERY - puppet last run on mw2077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:34:31] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:35:31] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:35:56] (03PS9) 10Giuseppe Lavagetto: Parsoid: Move to service::node [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [08:37:10] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:39:57] (03PS4) 10Addshore: Add more to stats:wmde config [puppet] - 10https://gerrit.wikimedia.org/r/298931 [08:45:51] (03CR) 10Giuseppe Lavagetto: "I fixed the admins disappearence." [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [08:47:41] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:48:39] 06Operations, 06Commons, 10media-storage: Install mscorefonts on scaling servers for SVG rendering - https://phabricator.wikimedia.org/T140141#2474333 (10fgiunchedi) >>! In T140141#2470645, @fgiunchedi wrote: > @kaldari indeed you are right `fonts-liberation` is already installed! > > I'll check with legal... [08:49:11] RECOVERY - puppet last run on mw2096 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:52:37] (03CR) 10Filippo Giunchedi: [C: 031] admin: create shell account for mpany [puppet] - 10https://gerrit.wikimedia.org/r/299702 (https://phabricator.wikimedia.org/T140399) (owner: 10Dzahn) [09:10:29] !log upgrade slapd to 2.4.41+dfsg-1+wmf1 on serpens - T130593 [09:10:30] T130593: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593 [09:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:11:32] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "After a thorough examination:" [puppet] - 10https://gerrit.wikimedia.org/r/298436 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [09:14:21] 06Operations, 06Labs, 10Labs-Infrastructure: investigate slapd memory leak - https://phabricator.wikimedia.org/T130593#2474575 (10fgiunchedi) freshly restarted slapd ```openldap 16378 1.8 1.2 466556 52048 ? Ssl 09:10 0:03 /usr/sbin/slapd -h ldap:/// ldaps:/// ldapi:/// -g openldap -u openldap -f... [09:22:00] 06Operations, 10ops-eqiad, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2474624 (10jcrespo) Finally scheduled from Thursday 21 July 14-15 UTC. [09:22:37] 06Operations, 10ops-eqiad, 10DBA, 06DC-Ops: dbstore1002.mgmt.eqiad.wmnet: "No more sessions are available for this type of connection!" - https://phabricator.wikimedia.org/T119488#2474625 (10jcrespo) [09:25:25] (03CR) 10Filippo Giunchedi: [C: 031] Add ability to dual-serve a portion of Swift rewrite.py traffic to Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/298431 (https://phabricator.wikimedia.org/T140072) (owner: 10Gilles) [09:26:48] (03PS1) 10Giuseppe Lavagetto: service::node: add 'entrypoint' and 'heartbeat_to' [puppet] - 10https://gerrit.wikimedia.org/r/299715 (https://phabricator.wikimedia.org/T90668) [09:26:50] (03PS1) 10Giuseppe Lavagetto: parsoid: add role based on service::node, apply to two hosts [puppet] - 10https://gerrit.wikimedia.org/r/299716 (https://phabricator.wikimedia.org/T90668) [09:26:52] (03PS1) 10Giuseppe Lavagetto: parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) [09:26:54] (03PS1) 10Giuseppe Lavagetto: parsoid: move to role::parsoid for all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/299718 (https://phabricator.wikimedia.org/T90668) [09:27:31] <_joe_> mobrovac: ^^ [09:28:47] (03CR) 10Bmansurov: [C: 04-1] Wikidata description config cleanup (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [09:30:47] (03CR) 10Mobrovac: [C: 031] service::node: add 'entrypoint' and 'heartbeat_to' [puppet] - 10https://gerrit.wikimedia.org/r/299715 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [09:34:08] (03CR) 10Mobrovac: "Should we have this patch not apply the role to wtp100[12] and then have a special one that does only that?" [puppet] - 10https://gerrit.wikimedia.org/r/299716 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [09:35:01] (03CR) 10Mobrovac: [C: 031] parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [09:36:42] (03CR) 10Mobrovac: [C: 031] parsoid: move to role::parsoid for all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/299718 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [09:43:50] (03PS1) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [09:46:07] (03PS1) 10Giuseppe Lavagetto: parsoid: add transition cleanup role [puppet] - 10https://gerrit.wikimedia.org/r/299720 (https://phabricator.wikimedia.org/T90668) [09:46:22] <_joe_> mobrovac: ^^ too [09:47:42] (03PS2) 10Elukey: Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) [09:49:15] (03CR) 10Mobrovac: [C: 031] parsoid: add transition cleanup role [puppet] - 10https://gerrit.wikimedia.org/r/299720 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [09:51:27] (03PS5) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [09:52:34] (03CR) 10jenkins-bot: [V: 04-1] Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [09:52:59] (03PS2) 10Giuseppe Lavagetto: parsoid: add role based on service::node, apply to two hosts [puppet] - 10https://gerrit.wikimedia.org/r/299716 (https://phabricator.wikimedia.org/T90668) [09:53:01] (03PS2) 10Giuseppe Lavagetto: parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) [09:53:03] (03PS2) 10Giuseppe Lavagetto: parsoid: move to role::parsoid for all production nodes [puppet] - 10https://gerrit.wikimedia.org/r/299718 (https://phabricator.wikimedia.org/T90668) [09:53:05] (03PS2) 10Giuseppe Lavagetto: service::node: add 'entrypoint' and 'heartbeat_to' [puppet] - 10https://gerrit.wikimedia.org/r/299715 (https://phabricator.wikimedia.org/T90668) [09:55:33] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: add transition cleanup role [puppet] - 10https://gerrit.wikimedia.org/r/299720 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [09:57:47] (03PS6) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [10:01:17] !log scb disabling puppet [10:01:18] !log testing haproxy start sequence on dbproxy1005 (unused proxy) [10:01:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:26] <_joe_> mobrovac: I already did it... [10:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:36] heh [10:01:37] kk _joe_ [10:01:50] <_joe_> on all affected servers, just in case [10:02:05] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: add 'entrypoint' and 'heartbeat_to' [puppet] - 10https://gerrit.wikimedia.org/r/299715 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [10:02:12] <_joe_> using wmfpuppet from salt is pretty nice [10:02:53] via Reedy: Bug 1154339 - JS microbenchmark with JQuery proxy is much slower when inner callback function uses tabs instead of spaces for indentation [10:03:20] "I have 234 and 50 miliseconds. The diff between these scripts is 3 tabs and 12 spaces." [10:04:18] (03PS7) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [10:04:20] apparently one of the heuristics employed by firefox to gauge function complexity is length in characters [10:04:36] ori: nodejs does that as well [10:04:54] if the func has less than 600 chars, it optimises it heavily [10:05:05] comments count tohugh [10:05:08] *though [10:05:31] in this case the optimization was not [10:05:41] <_joe_> sorry, so tabs are worse than spaces? [10:05:55] <_joe_> I would have thought the contrary [10:06:09] the "optimization" made things slower [10:07:49] <_joe_> oh I see [10:07:50] <_joe_> lol [10:08:13] how do you get built debs from copper to carbon? [10:08:25] scp? :D [10:08:29] * mobrovac trolling [10:08:30] scp via laptop? [10:08:41] can't scp between copper and carbon [10:08:47] if you can access both from your laptop, then you should be able to do it [10:08:52] <_joe_> YuviPanda: rsync [10:09:09] ah [10:09:32] scp -3 [10:09:41] <_joe_> look at my bash history on carbon [10:09:47] -3 Copies between two remote hosts are transferred through the local host. Without this option the data is copied directly between [10:09:47] the two remote hosts. Note that this option disables the progress meter. [10:10:08] <_joe_> ori: yeah no need for that [10:10:18] <_joe_> copper has an rsync server [10:10:20] ori nice! [10:10:25] no progress meter? that's ludicrous! [10:10:26] :P [10:10:28] python -m SimpleHTTPServer if you're naughty [10:10:44] joe yup I see it [10:10:45] thanks [10:10:53] * YuviPanda learnt about '-y' to strace a couple weeks ago [10:11:00] rsync server is the correct solution, but correctness is overrated [10:11:32] nice echo, joe [10:11:36] <_joe_> :D [10:11:58] _joe_: has the service::node patch been applied to scb? [10:12:09] <_joe_> mobrovac: nope [10:12:18] I finished packaging pykube, now to put it on carbon [10:12:18] <_joe_> mobrovac: I am doin sca2001 now [10:12:22] <_joe_> the scb2001 [10:12:33] _joe_: sca* will not be affected, no service::node there [10:12:38] <_joe_> oh right [10:12:40] <_joe_> sigh [10:12:46] <_joe_> ok let's go with scb2001 [10:12:55] speaking of which [10:13:02] * mobrovac checks zotero's mem consumption [10:16:51] (03PS8) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [10:18:00] (03CR) 10jenkins-bot: [V: 04-1] Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [10:19:29] <_joe_> mobrovac: puppet has run on scb2001, everything seems ok [10:19:52] 06Operations, 13Patch-For-Review, 07Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2474695 (10jcrespo) [10:19:56] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#2474692 (10jcrespo) 05stalled>03Open a:03jcrespo [10:20:18] _joe_: awesome [10:20:43] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2474699 (10jcrespo) [10:20:46] 06Operations, 10DBA, 13Patch-For-Review: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027#1972513 (10jcrespo) [10:21:02] <_joe_> mobrovac: I'd go with the rest [10:21:09] <_joe_> I didn't restart cp though [10:21:15] ok, no need [10:21:19] i'll do it soon anyway [10:22:25] _joe_ hmm, so I built a (pure python) package for jessie, trusty and precise, but reprepo won't let me import them into all? [10:22:33] (03PS9) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [10:22:37] I got it into jessie-wikimedia, but importing the trusty version into trusty fails... [10:22:46] Already existing files can only be included again, if they are the same, but: [10:22:51] do I need to give them different names? [10:23:01] <_joe_> YuviPanda: what is the version of the trusty package> [10:23:25] <_joe_> e.g. add a "~trusty0" version to the changelog when you build for trusty [10:24:08] I see [10:24:29] * YuviPanda adds one more to 'list of his reasons to not like building debs' [10:24:32] thanks joe [10:24:36] they all had the same version [10:24:46] <_joe_> YuviPanda: I scripted it with dch [10:24:53] <_joe_> see my bash history on copper [10:24:58] <_joe_> (no echos this time) [10:25:04] :D [10:25:07] ok! [10:25:08] looking [10:26:24] dch --force-distribution -D trusty-wikimedia "release for trusty" -b -v 3.12.1+dfsg-1~wmf3+trusty0 [10:26:26] thanks joe [10:26:45] <_joe_> YuviPanda: you might need to define DEBFULLNAME and DEBEMAIL [10:26:53] <_joe_> in order to have a nice looking changelog [10:30:55] just did and scripted that too, joe [10:30:57] thanks! [10:31:26] <_joe_> YuviPanda: we should really add those 3 inches of automation on top of pbuilder [10:31:35] <_joe_> which already does what we need [10:31:49] yeah [10:37:06] (03PS1) 10Yuvipanda: Move pykube out to own debian package [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/299726 [10:37:32] !log sent SIGHUP to eventbus on kafka100[12] to reload schemas [10:37:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:24] (03PS2) 10Filippo Giunchedi: site: add node_exporter for prometheus machines [puppet] - 10https://gerrit.wikimedia.org/r/299558 (https://phabricator.wikimedia.org/T140646) [10:40:32] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] site: add node_exporter for prometheus machines [puppet] - 10https://gerrit.wikimedia.org/r/299558 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [10:49:25] (03PS1) 10Elukey: Add wmde_secrets and analyticsdeploy scap keyholder keys [labs/private] - 10https://gerrit.wikimedia.org/r/299731 [10:49:27] (03CR) 10Yuvipanda: [C: 032 V: 032] Move pykube out to own debian package [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/299726 (owner: 10Yuvipanda) [10:50:04] (03CR) 10Elukey: [C: 032] Add wmde_secrets and analyticsdeploy scap keyholder keys [labs/private] - 10https://gerrit.wikimedia.org/r/299731 (owner: 10Elukey) [10:50:18] (03CR) 10Jcrespo: [C: 031] "I am happy with this, and I intend to deploy it soon: https://puppet-compiler.wmflabs.org/3383/" [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [10:50:34] (03CR) 10Elukey: [V: 032] Add wmde_secrets and analyticsdeploy scap keyholder keys [labs/private] - 10https://gerrit.wikimedia.org/r/299731 (owner: 10Elukey) [10:51:55] _joe_, paravoid, after following your recommendations, I intend to merge https://gerrit.wikimedia.org/r/273958 [10:53:14] (03PS10) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [10:56:08] <_joe_> jynus: that works? impressive [10:56:25] I actuall commented before that I doubted that worked [10:56:38] which is one of the reasons I used a convoluted hack [10:57:01] but if it works I will just commit it (I checked on an idle proxy) [10:57:51] I supposed the defaults were loaded and then the execution chain- it seems not [10:59:21] I am thinking of deprecating haproxy anyway in favour of an L7 proxy, but I need this working by thursday [11:00:37] (03CR) 10Elukey: "Puppet compiler https://puppet-compiler.wmflabs.org/3386/" [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [11:04:29] (03PS11) 10Jcrespo: Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) [11:06:22] (03CR) 10Jcrespo: [C: 032] Update haproxy default file, as it cannot be dynamic in jessie [puppet] - 10https://gerrit.wikimedia.org/r/273958 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [11:10:32] (03PS1) 10Jcrespo: Fixing typo on systemd haproxy unit (extra newline) [puppet] - 10https://gerrit.wikimedia.org/r/299736 (https://phabricator.wikimedia.org/T125027) [11:10:35] (03PS1) 10Filippo Giunchedi: site: use 'include' [puppet] - 10https://gerrit.wikimedia.org/r/299737 (https://phabricator.wikimedia.org/T140646) [11:10:55] (03PS2) 10Jcrespo: Fixing typo on systemd haproxy unit (extra newline) [puppet] - 10https://gerrit.wikimedia.org/r/299736 (https://phabricator.wikimedia.org/T125027) [11:11:21] (03CR) 10Jcrespo: [C: 032 V: 032] Fixing typo on systemd haproxy unit (extra newline) [puppet] - 10https://gerrit.wikimedia.org/r/299736 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [11:11:59] (03PS2) 10Filippo Giunchedi: site: use 'include' for role::prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/299737 (https://phabricator.wikimedia.org/T140646) [11:20:28] (03PS1) 10Jcrespo: Regenerate haproxy defaults on reload, in addition to on start [puppet] - 10https://gerrit.wikimedia.org/r/299738 (https://phabricator.wikimedia.org/T125027) [11:22:02] (03CR) 10Jcrespo: [C: 032] Regenerate haproxy defaults on reload, in addition to on start [puppet] - 10https://gerrit.wikimedia.org/r/299738 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [11:32:16] (03PS1) 10Jcrespo: Revert "Regenerate haproxy defaults on reload, in addition to on start" [puppet] - 10https://gerrit.wikimedia.org/r/299740 [11:33:37] (03PS2) 10Jcrespo: Revert "Regenerate haproxy defaults on reload, in addition to on start" [puppet] - 10https://gerrit.wikimedia.org/r/299740 [11:36:35] (03PS1) 10Filippo Giunchedi: puppetmaster: show commit hash, not tree hash [puppet] - 10https://gerrit.wikimedia.org/r/299741 [11:37:02] PROBLEM - Apache HTTP on mw1297 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:38:52] RECOVERY - Apache HTTP on mw1297 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.053 second response time [11:41:38] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Publish "pending_tasks" count from Elastic search cluster to graphite - https://phabricator.wikimedia.org/T134240#2474808 (10Gehel) 05Open>03Resolved Metrics are published, added to grafana dashboard. Alerting... [11:42:00] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2474810 (10Gehel) [11:45:46] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, and 3 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2474811 (10Gehel) a:05mark>03RobH @mark confirmed during weekly ops meeting that we'll decommission those servers as they are old enough. [11:47:19] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2474813 (10Gehel) 05Open>03Resolved Closing this as the new elasticsearch servers are installed and serving t... [12:12:29] PROBLEM - puppet last run on mw2165 is CRITICAL: CRITICAL: puppet fail [12:13:24] (03PS1) 10ArielGlenn: add simple variable substitution to dumpscheduler command lists [dumps] - 10https://gerrit.wikimedia.org/r/299742 (https://phabricator.wikimedia.org/T126339) [12:13:53] (03CR) 10jenkins-bot: [V: 04-1] add simple variable substitution to dumpscheduler command lists [dumps] - 10https://gerrit.wikimedia.org/r/299742 (https://phabricator.wikimedia.org/T126339) (owner: 10ArielGlenn) [12:15:59] (03PS3) 10ArielGlenn: extend dumps cron job to run partial dumps as well [puppet] - 10https://gerrit.wikimedia.org/r/299527 (https://phabricator.wikimedia.org/T126339) [12:18:58] (03PS2) 10ArielGlenn: add simple variable substitution to dumpscheduler command lists [dumps] - 10https://gerrit.wikimedia.org/r/299742 (https://phabricator.wikimedia.org/T126339) [12:24:57] (03Abandoned) 10Addshore: Don't log dewiki_diffstats to logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288158 (https://phabricator.wikimedia.org/T134861) (owner: 10Addshore) [12:39:21] RECOVERY - puppet last run on mw2165 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [12:39:29] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2474884 (10elukey) p:05Unbreak!>03High [12:40:41] 06Operations, 10MediaWiki-General-or-Unknown, 07HHVM: Application servers in constant crash - https://phabricator.wikimedia.org/T140223#2457012 (10elukey) This work is pending T135483, but it might be good to keep it open for visibility. Lowered down the priority to High. [12:42:09] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2474893 (10elukey) p:05Triage>03High [12:43:48] 06Operations, 06Release-Engineering-Team, 10Traffic, 13Patch-For-Review, 05Security: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2474897 (10elukey) p:05Triage>03High [12:45:14] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2474899 (10elukey) p:05Triage>03Normal [12:46:45] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2328753 (10elukey) @Dereckson would you mind to double check that everything is working correctly at the moment running l1... [12:48:33] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2474904 (10elukey) p:05Triage>03Low [12:49:57] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2468277 (10elukey) I am very ignorant and I have no idea how Ops could help. Anybody can point me to documentation or just give me more info? [12:50:14] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2474907 (10elukey) p:05Triage>03High [12:53:15] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2468277 (10Joe) This is not just matter of fixing re-rendering these, as it seems to be a problem with magemagick and this specific image, that makes convert create those black stripes. [12:53:38] 06Operations, 10RESTBase, 06Services: RESTBase shutting down spontaneously - https://phabricator.wikimedia.org/T136957#2353511 (10elukey) @Eevans, @GWicke: How should we proceed? I don't see any official owner of this task and it seems really important. Ops can definitely help but probably the efforts should... [12:55:03] (03PS2) 10Gehel: logstash: update logstash_optimize_index.sh for ES 2.x [puppet] - 10https://gerrit.wikimedia.org/r/299699 (owner: 10BryanDavis) [12:55:25] elukey: hi. For l10nupdate, may I proceed right now, or should I plan that at the end of a deployment window? [12:57:09] (03CR) 10Gehel: [C: 032] "Looks good, simple enough and will fix the current logspam. (well not spam, this is a real issue)." [puppet] - 10https://gerrit.wikimedia.org/r/299699 (owner: 10BryanDavis) [13:01:26] (03PS3) 10Jcrespo: Revert "Regenerate haproxy defaults on reload, in addition to on start" [puppet] - 10https://gerrit.wikimedia.org/r/299740 [13:02:55] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2468277 (10Peachey88) I believe if the thumbnails are deleted (from swift) they should regenerate on the next visit [13:03:04] 06Operations, 06Labs, 10netops: Intermittent bandwidth issue to labs proxy (eqiad) from Comcast in Portland OR - https://phabricator.wikimedia.org/T136671#2474946 (10faidon) 05Open>03Resolved a:03faidon I'm resolving this, as this was primarily a task for the intermittent bandwidth issue. @yuvipanda, f... [13:03:57] (03PS4) 10Jcrespo: Revert "Regenerate haproxy defaults on reload, in addition to on start" [puppet] - 10https://gerrit.wikimedia.org/r/299740 [13:04:00] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2474949 (10Peachey88) @DaBPunkt Do you know if the original thumbs before the first delete exhibited the black lines? [13:07:54] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2474953 (10elukey) @madhuvishy for the pwstore part I think that you should follow up with @MoritzMuehlenhoff when he'll be back (I think July 25th). No... [13:08:15] (03CR) 10Jcrespo: [C: 032] Revert "Regenerate haproxy defaults on reload, in addition to on start" [puppet] - 10https://gerrit.wikimedia.org/r/299740 (owner: 10Jcrespo) [13:14:13] 06Operations, 10Ops-Access-Requests, 06Labs, 13Patch-For-Review: madhuvishy is moving to operations on 7/18/16 - https://phabricator.wikimedia.org/T140422#2474988 (10Gehel) Looking at the recent history of the pw repository, at least @Dzahn and @ArielGlenn have recent commits, so they should be able to add... [13:16:25] (03PS2) 10Gehel: Delete maps-team hiera [puppet] - 10https://gerrit.wikimedia.org/r/299270 (owner: 10MaxSem) [13:16:45] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016), 13Patch-For-Review: Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2474997 (10MarcoAurelio) @Racso is the owner of BOTzilla. Is that bot still using insecure HTTP? [13:17:28] (03PS1) 10Jcrespo: Repool db1001 as m1 secondary (passive) host [puppet] - 10https://gerrit.wikimedia.org/r/299746 (https://phabricator.wikimedia.org/T125027) [13:18:04] (03CR) 10Gehel: [C: 032] Delete maps-team hiera [puppet] - 10https://gerrit.wikimedia.org/r/299270 (owner: 10MaxSem) [13:19:50] (03PS1) 10Jcrespo: Install jessie by default on all dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/299747 (https://phabricator.wikimedia.org/T125027) [13:20:04] (03PS2) 10Jcrespo: Repool db1001 as m1 secondary (passive) host [puppet] - 10https://gerrit.wikimedia.org/r/299746 (https://phabricator.wikimedia.org/T125027) [13:20:32] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2475008 (10elukey) a:03Cmjohnson [13:20:51] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2475009 (10Joe) I verified we have the same issue with the older imagescaler, but I have no time to dig further; this seems like an imagemagick bug/artifact. [13:21:50] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2475011 (10Joe) @Peachey88 I have tried both to purge all thumbnails, and to generate new sizes, always with the same result. [13:24:57] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [13:26:08] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:30:20] elukey: can you confirm and resolve https://phabricator.wikimedia.org/T138609 ? [13:30:49] PROBLEM - puppet last run on cp3017 is CRITICAL: CRITICAL: puppet fail [13:30:53] (03PS1) 10Rush: tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) [13:31:23] paravoid: I am going to update it, the experiment that we tried to run didn't give the expected results. I would wait a bit to leave my team to take a final decision on bulk loading, and then I'll come back [13:31:28] would it be fine? [13:31:56] we are trying to bulk load data and it is not easy with AQS/Cassandra atm [13:31:59] :( [13:32:02] sure [13:33:26] elukey: I ack your comment for T136258 / l10nupdate. Should I run it now or plan it at a regular deployment window? [13:33:26] T136258: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258 [13:34:20] Dereckson: Hi! I am very ignorant about l10update, so if you don't mind during a regular deployment window would be fine [13:35:38] Okay, fine. [13:36:08] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2468277 (10jcrespo) `convert` Version: ImageMagick 6.8.9-9 Q16 x86_64 2016-06-01 on my local host (jessie) resized the image with no problems; we need to check the exact version and options used. [13:36:38] 06Operations, 06Commons: Please fix broken thumbnails - https://phabricator.wikimedia.org/T140536#2475044 (10jcrespo) [13:37:44] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2475046 (10Dereckson) Test run added to this evening SWAT. [13:37:54] Dereckson: thanks! [13:38:15] (03CR) 10Jcrespo: [C: 032] Repool db1001 as m1 secondary (passive) host [puppet] - 10https://gerrit.wikimedia.org/r/299746 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [13:38:19] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:38:20] (03PS2) 10Jcrespo: Install jessie by default on all dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/299747 (https://phabricator.wikimedia.org/T125027) [13:39:42] (03CR) 10Jcrespo: [C: 032] Install jessie by default on all dbproxies [puppet] - 10https://gerrit.wikimedia.org/r/299747 (https://phabricator.wikimedia.org/T125027) (owner: 10Jcrespo) [13:39:48] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: puppet fail [13:41:49] (03CR) 10Yuvipanda: [C: 04-1] tools: set cgred application to trusty only (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) (owner: 10Rush) [13:44:08] !log reloading dbproxy1001 to repool db1001 as pasive backend [13:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:46:20] (03PS2) 10Rush: tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) [13:48:23] !log restarting dbproxy1005 for kernel upgrade [13:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:50:00] Hey elukey! I see your on ops duty! Any chance you could help me out with sshing to tin? [13:53:43] addshore: sure [13:53:49] (03CR) 10Yuvipanda: [C: 031] tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) (owner: 10Rush) [13:54:30] After getting added to deployers I don't appear to be able to ssh to any of the boxes that should give me access to, sshing to other places in the cluster still works just fine! [13:56:00] (03PS1) 10Nikerabbit: Compact Language Links: To beta in ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299751 [13:56:58] RECOVERY - puppet last run on cp3017 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [13:56:59] PROBLEM - puppet last run on ms-be2008 is CRITICAL: CRITICAL: Puppet has 1 failures [13:57:30] for absolutely no reasons, dbproxy1005 started after power restart with its first backend down [13:59:23] (03PS3) 10Filippo Giunchedi: site: use 'include' for role::prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/299737 (https://phabricator.wikimedia.org/T140646) [13:59:31] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] site: use 'include' for role::prometheus::node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/299737 (https://phabricator.wikimedia.org/T140646) (owner: 10Filippo Giunchedi) [13:59:41] (03PS3) 10Rush: tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) [14:01:10] addshore: mmm I don't find you among deployers, can you link me the task/code-review? [14:01:20] hmm, yup [14:01:52] https://phabricator.wikimedia.org/T140276 & https://gerrit.wikimedia.org/r/#/c/299032/ [14:01:57] (03PS1) 10Giuseppe Lavagetto: puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 [14:03:25] (03CR) 10jenkins-bot: [V: 04-1] puppetmaster: declare NameVirtualHost where expected [puppet] - 10https://gerrit.wikimedia.org/r/299752 (owner: 10Giuseppe Lavagetto) [14:04:42] <_joe_> addshore: I am not sure why, but seems that change has been reverted somehow [14:04:45] <_joe_> let me dig [14:05:04] (03PS2) 10Filippo Giunchedi: puppetmaster: show commit hash, not tree hash [puppet] - 10https://gerrit.wikimedia.org/r/299741 [14:05:11] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] puppetmaster: show commit hash, not tree hash [puppet] - 10https://gerrit.wikimedia.org/r/299741 (owner: 10Filippo Giunchedi) [14:05:37] ahh _joe_ https://github.com/wikimedia/operations-puppet/commit/58bac88260f98ab212023f1921eff3b81bba9113#diff-9771baee4b4339971721eab7e35e721b [14:06:08] yes [14:06:11] I was just pasting it [14:06:18] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [14:06:26] addshore: I am going add you back [14:06:34] thanks! :) [14:06:49] (03CR) 10Gehel: "notes: runUpdate.sh has been updated to take new log4j configuration location into account. Old log configuration should be cleaned at som" [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) (owner: 10Smalyshev) [14:06:52] <_joe_> I guess a git merge fail [14:07:01] <_joe_> s/merge/rebase/ [14:08:31] (03PS1) 10Yuvipanda: tools: Add check for all nodes in Ready condition [puppet] - 10https://gerrit.wikimedia.org/r/299753 (https://phabricator.wikimedia.org/T140248) [14:10:57] !log reboot and reimage dbproxy1003 to jessie T125027 T138460 [14:10:58] T125027: upgrade dbproxy1001/1002 to jessie - https://phabricator.wikimedia.org/T125027 [14:10:58] T138460: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460 [14:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:11:10] (03PS1) 10Elukey: Add addshore back to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/299755 (https://phabricator.wikimedia.org/T140276) [14:12:04] (03CR) 10Andrew Bogott: [C: 032] labs dnsrecursor: tidy up paths [puppet] - 10https://gerrit.wikimedia.org/r/299499 (owner: 10Alex Monk) [14:12:53] (03CR) 10Andrew Bogott: [C: 032] labs dnsrecursor metaldns: use hiera's labs_tld instead of assuming its value [puppet] - 10https://gerrit.wikimedia.org/r/299500 (owner: 10Alex Monk) [14:13:11] I think that zfilipin was removed too [14:13:18] (03PS2) 10Andrew Bogott: labs dnsrecursor: tidy up paths [puppet] - 10https://gerrit.wikimedia.org/r/299499 (owner: 10Alex Monk) [14:13:26] (03PS2) 10Andrew Bogott: labs dnsrecursor metaldns: use hiera's labs_tld instead of assuming its value [puppet] - 10https://gerrit.wikimedia.org/r/299500 (owner: 10Alex Monk) [14:13:40] elukey: yup, it looks like it! [14:14:48] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016), 13Patch-For-Review: Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2475190 (10Racso) Hello. I'm the owner of BOTzilla. As far as I know, the bot is currently inactive. Is it actually d... [14:17:56] (03PS2) 10Elukey: Add addshore and zfilipin back to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/299755 [14:18:12] gehel, addshore: mind to review --^ ? [14:18:17] _joe_: subbu is here, should we start? [14:18:20] (03PS2) 10Yuvipanda: tools: Add check for all nodes in Ready condition [puppet] - 10https://gerrit.wikimedia.org/r/299753 (https://phabricator.wikimedia.org/T140248) [14:18:37] mobrovac, saw my mail about worker restarts? [14:18:49] yup subbu [14:18:51] k [14:18:53] <_joe_> subbu: yes, but the new code is deployed to a different directory [14:18:56] (03CR) 10Addshore: [C: 031] Add addshore and zfilipin back to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/299755 (owner: 10Elukey) [14:19:00] Looks good to me! [14:19:02] (03CR) 10Gehel: [C: 031] Add addshore and zfilipin back to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/299755 (owner: 10Elukey) [14:19:04] didn't reply because i figured you'd read it when you come here [14:19:05] :) [14:19:07] _joe_, ah, ok. :) [14:19:14] _joe_: euh, no it is not [14:19:22] * mobrovac checking [14:19:35] (03PS3) 10Elukey: Add addshore and zfilipin back to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/299755 [14:19:56] _joe_: subbu: the dir is the same - /srv/deployment/parsoid/deploy [14:20:07] <_joe_> mobrovac: uhm so what did I get wrong in your changes? [14:20:18] _joe_: (i know that the parsoid::production role would have left you thinking otherwise) [14:20:43] <_joe_> anyways, first precondition is depooling wtp1001-2 [14:21:19] RECOVERY - puppet last run on ms-be2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:21:22] i will create a branch on tin with the new deploy repo code, so that we can pull that remote branch directly onto wtp100[12] instead of using trebuchet [14:21:31] <_joe_> mobrovac: ok [14:21:35] <_joe_> makes sense [14:21:46] (03PS3) 10Yuvipanda: tools: Add check for all nodes in Ready condition [puppet] - 10https://gerrit.wikimedia.org/r/299753 (https://phabricator.wikimedia.org/T140248) [14:22:04] !log oblivian@palladium conftool action : set/weight=0; selector: cluster=parsoid,name=wtp100[12].* [14:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:22:13] (03PS2) 10Ema: cache_upload VTC tests [puppet] - 10https://gerrit.wikimedia.org/r/299543 (https://phabricator.wikimedia.org/T128188) [14:22:33] _joe_, mobrovac ok. got it. [14:22:42] (03PS2) 10Andrew Bogott: labs dnsrecursor metaldns: Resolve PTR records too [puppet] - 10https://gerrit.wikimedia.org/r/299501 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [14:22:52] (03CR) 10Elukey: [C: 032] Add addshore and zfilipin back to the deployment group [puppet] - 10https://gerrit.wikimedia.org/r/299755 (owner: 10Elukey) [14:23:06] !log oblivian@palladium conftool action : set/pooled=no:weight=15; selector: cluster=parsoid,name=wtp100[12].* [14:23:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:23:23] <_joe_> ok wtp1001-2 are depooled [14:23:31] <_joe_> we can now torture those at will [14:23:42] <_joe_> I would propose we first deploy the code then puppet [14:24:02] (03CR) 10Andrew Bogott: [C: 032] labs dnsrecursor metaldns: Resolve PTR records too [puppet] - 10https://gerrit.wikimedia.org/r/299501 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [14:25:13] (03CR) 10Giuseppe Lavagetto: "https://puppet-compiler.wmflabs.org/3390/ DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/299716 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [14:25:33] <_joe_> mobrovac, subbu what do you think? [14:25:48] _joe_: yup, makes sense since it'll have to be a manual pull anyhow [14:25:50] (03PS4) 10Rush: tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) [14:26:06] _joe_, i am not familiar with the subtleties here .. so will go with whatever you two recommend. [14:26:15] (03PS3) 10Andrew Bogott: labs dnsrecursor metaldns: Resolve PTR records too [puppet] - 10https://gerrit.wikimedia.org/r/299501 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [14:26:16] _joe_: i created the service-runner branch on tin, can you pull it onto wtp100[12] ? [14:26:24] <_joe_> mobrovac: ok will do [14:26:31] addshore: you should be able to ssh to tin now.. can you try? [14:26:44] mobrovac, this is a branch of the deploy repo? [14:26:55] elukey: works! many thanks! [14:27:06] super! Sorry for the trouble! [14:27:07] subbu: yes, but exists only on tin [14:27:16] got it. [14:27:22] <_joe_> mobrovac: oh, uhm [14:27:29] <_joe_> all parsoid files are owned by root [14:27:35] <_joe_> is that ok for service-runner? [14:27:42] <_joe_> well I guess we'll see :P [14:27:52] yup _joe_, that's why i asked you to do it (because of the uid) [14:28:04] _joe_: the owner of the files is relevant for the migration to scap3 only [14:28:51] <_joe_> mobrovac: I did a git remote update [14:28:59] <_joe_> but no branch service-runner is there [14:30:02] <_joe_> let me know when you added it [14:30:09] (03PS5) 10Rush: tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) [14:30:33] _joe_: perhaps add it manually ? I can see it on tin [14:30:35] (03PS2) 10Chad: Gerrit: Enable proper backups from new hosts [puppet] - 10https://gerrit.wikimedia.org/r/299705 [14:30:42] git branch lists service-runner [14:30:44] <_joe_> mobrovac: yeah let me check something [14:30:57] and the remote def on wtp1001 points to tin [14:31:06] so you should be able to see it [14:31:07] 06Operations, 10Incident-20150825-Redis, 10Monitoring: Alert when ES indexes are freezed for more than 30 minutes - https://phabricator.wikimedia.org/T110171#2475332 (10Gehel) [14:31:09] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 07Epic: EPIC: Cultivating the Elasticsearch garden (operational lessons from 1.7.1 upgrade) - https://phabricator.wikimedia.org/T109089#2475331 (10Gehel) [14:31:49] euh why is the deploy repo on wtp1001 on the rollback branch subbu? [14:31:59] <_joe_> mobrovac: that's me [14:32:11] <_joe_> so that I can rollback in case of need [14:32:12] kk [14:32:49] the list of branches is the same except for the service-runner one [14:32:49] (03PS3) 10ArielGlenn: add simple variable substitution to dumpscheduler command lists [dumps] - 10https://gerrit.wikimedia.org/r/299742 (https://phabricator.wikimedia.org/T126339) [14:32:54] (03CR) 10Andrew Bogott: "Works!" [puppet] - 10https://gerrit.wikimedia.org/r/299501 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [14:32:57] (03PS4) 10Yuvipanda: tools: Add check for all nodes in Ready condition [puppet] - 10https://gerrit.wikimedia.org/r/299753 (https://phabricator.wikimedia.org/T140248) [14:33:23] but, mobrovac _joe_ do you need the service-runner branch? [14:33:41] subbu: well, actually no! [14:33:44] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add check for all nodes in Ready condition [puppet] - 10https://gerrit.wikimedia.org/r/299753 (https://phabricator.wikimedia.org/T140248) (owner: 10Yuvipanda) [14:33:45] good point subbu [14:33:47] current deploy master should be at the service-runner hash and the submodule should point to th eright code as well. [14:33:51] <_joe_> mobrovac: I am a bit baffled by this [14:34:01] (03PS6) 10Rush: tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) [14:34:08] (03CR) 10Rush: [C: 032 V: 032] tools: set cgred application to trusty only [puppet] - 10https://gerrit.wikimedia.org/r/299748 (https://phabricator.wikimedia.org/T140696) (owner: 10Rush) [14:34:24] _joe_: let me update tin's master, as no parsoid deploys will occur until we transition anyway [14:34:43] <_joe_> ok [14:34:50] <_joe_> this is kinda weird [14:34:56] yup [14:35:11] <_joe_> I also tried the usual --prune [14:35:35] ok _joe_, you should be able to update master on wtp1001 now [14:37:14] <_joe_> mobrovac: and well, it seems git remote update doesn't work at all [14:37:16] <_joe_> wtf? [14:37:27] euh [14:37:28] (03CR) 10ArielGlenn: [C: 032] add simple variable substitution to dumpscheduler command lists [dumps] - 10https://gerrit.wikimedia.org/r/299742 (https://phabricator.wikimedia.org/T126339) (owner: 10ArielGlenn) [14:37:41] _joe_: a simple git fetch ought to do it though [14:39:02] <_joe_> mobrovac: no any git update function is not working [14:39:12] <_joe_> let me understand what's up please [14:40:15] 06Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for Jksamra - https://phabricator.wikimedia.org/T140445#2475380 (10Jksamra) @Jgreen Thanks wikitech username: Jksamra shell name: jsamra [14:41:31] kk _joe_ [14:41:51] <_joe_> mobrovac: I am honestly puzzled [14:42:51] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2475397 (10faidon) [14:43:15] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2475413 (10Reedy) [14:43:21] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2474962 (10Reedy) MW is getting "Received HTTP code 403 from proxy after CONNECT" from url-downloader.eqiad [14:45:03] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1630902 (10Gehel) This seems to have been stalled for a long time. Should we close this? Or is there something we can still do at this point? My understanding is that the loss of revision is old enough that the in... [14:45:25] 06Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2475419 (10faidon) [14:45:45] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016), 13Patch-For-Review: Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2475434 (10BBlack) The final patch to block insecure HTTP is going out a few hours from now, so if the bot is inactiv... [14:46:32] <_joe_> ok so something is wrong with the parsoid deploy repo [14:47:20] <_joe_> subbu: when did you last deploy it [14:47:37] https://www.mediawiki.org/wiki/Parsoid/Deployments .. july 11 [14:48:32] something is wrong with the repo on wtp1001, you mean, right. not with the repo itself. [14:48:46] <_joe_> yeah, well [14:48:50] <_joe_> let's see what's wrong [14:48:52] k [14:49:23] _joe_, the git hash on that deployments page refers to parsoid submodule btw .. just in case you were using that info. [14:49:25] <_joe_> I guess something that has to do with the root user on that machine [14:49:36] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2475487 (10faidon) [14:49:36] <_joe_> subbu: it's not that [14:49:39] k [14:49:42] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2475499 (10jcrespo) > Should we close this? Why, this is on the backlog? > Is there anything else we should do? Maybe help solving this instead of blindly closing it? [14:49:50] <_joe_> the root user on wtp1001 can't update that repo correctly [14:50:08] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2475503 (10Reedy) ``` if ( $extList === false ) { $extList = $this->fetchRepositoryList(); $wgMemc->set( $key, $extList, 3600... [14:50:25] 06Operations, 10ops-eqiad, 10netops: Upgrade cr1/cr2-eqiad JunOS - https://phabricator.wikimedia.org/T140770#2475516 (10faidon) [14:51:19] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2475544 (10matmarex) The content can potentially be recovered from old dumps. We should also check for any revisions that are not connected to existing pages, perhaps they're still there. [14:51:27] _joe_, mobrovac i don't know how trebuchet deployment works .. and i am throwing this out fwiw .. are there restrictions on git pulls from individual nodes .. or is this specific to wtp1001? [14:52:00] <_joe_> subbu: specific to the root user somehow [14:52:21] ok .. so, then on all nodes probably. [14:53:49] <_joe_> fatal: reference is not a tree: 36075c7fc242ad2fd7bba05661606722ebda49aa [14:53:57] <_joe_> for the submodule, wtf? [14:54:09] <_joe_> ok, I am going to re-clone the repo on this machine [14:54:10] <_joe_> sorry [14:54:18] <_joe_> fist let me try on wtp1002 [14:54:37] oh yes, viva submodules [14:55:28] <_joe_> mobrovac: I think that is your error, not mine, though [14:55:36] mine? [14:55:48] <_joe_> well, I cannot find that reference [14:55:52] <_joe_> in the submodule [14:56:29] _joe_, mobrovac normally .. here are the 2 commands we run on tin for deploys .. git pull; git submodule update --init [14:56:34] in case that is useful. [14:56:51] rights seem ok on tin [14:56:57] all is owned by trebuchet:wikidev [14:57:16] hm except .git/packed-refs [14:57:16] but that shouldn't matter for the remote [14:57:44] mobrovac@tin:/srv/deployment/parsoid/deploy/src$ git status [14:57:46] HEAD detached at 36075c7 [14:57:53] so the submodule is ok on tin [14:58:35] <_joe_> mobrovac: well I am not sure it's ok [14:58:37] <_joe_> but still [14:59:03] bblack (or other dns people): Is there any more-efficient notation to express the range of delegation in https://gerrit.wikimedia.org/r/#/c/299513/2/templates/155.80.208.in-addr.arpa ? [15:00:04] anomie, ostriches, thcipriani, hashar, and twentyafterfour: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160719T1500). [15:00:04] matt_flaschen and bd808: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] (03CR) 10Andrew Bogott: [C: 031] "This needs a ton of testing (if it hasn't been done already) but I like it." [puppet] - 10https://gerrit.wikimedia.org/r/299503 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:00:43] andrewbogott: yes [15:01:03] andrewbogott: https://tools.ietf.org/html/rfc2317 [15:01:12] (03CR) 10Andrew Bogott: [C: 04-1] "Good in theory. I'm going to ask around in case there's some less run-on notation to delegate DNS for a block of IPs." [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:01:33] bblack: thanks, reading [15:01:43] <_joe_> mobrovac: I am going to re-clone the repo on wtp1001, I am really not sure why this doesn't work [15:01:48] Present [15:02:10] (03CR) 10Andrew Bogott: "Brandon suggests https://tools.ietf.org/html/rfc2317 for the dns notation" [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:02:59] FWIW, IIRC, trebuchet updates .gitmodules to point to tin pre-fetch. That object does exist on tin at .git/modules/src/objects/36/075c7fc242ad2fd7bba05661606722ebda49aa, so that's...weird. [15:03:08] I can SWAT today [15:03:18] (03CR) 10Alex Monk: "With the existing hacks, we may be able to set those up in designate explicitly, which would override the instance-$instance.$project.wmfl" [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:03:22] <_joe_> thcipriani: yeah it is weird [15:04:09] andrewbogott: basically you want the NS delegation to use "128/25 1H NS ..." [15:04:54] andrewbogott: and then also put in all the CNAMEs as "128 1H CNAME 128.128/25" [15:05:05] so it's still a bunch of lines, but it's the better way to do it [15:05:28] (03CR) 10Andrew Bogott: "' basically you want the NS delegation to use "128/25 1H NS ..."" [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:05:43] through ending at "255 1H CNAME 255.128/25" [15:06:11] (03PS1) 10Jcrespo: Set m3-master as an alias of dbproxy1003 [dns] - 10https://gerrit.wikimedia.org/r/299764 (https://phabricator.wikimedia.org/T138460) [15:06:20] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299704 (owner: 10Mattflaschen) [15:06:31] and then on your labs-nsX end, you define the zone's name as 128/25.155.80.208.in-addr.arpa [15:06:59] and within that have "normal" records like "128 1H IN PTR foo.example" [15:07:06] (03Merged) 10jenkins-bot: Remove Echo transition flags [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299704 (owner: 10Mattflaschen) [15:07:55] (03CR) 10Andrew Bogott: "'through ending at "255 1H CNAME 255.128/25"" [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:08:02] * andrewbogott pastes all that into gerrit [15:08:04] thanks bblack [15:08:56] matt_flaschen: change is live on mw1099, check please [15:10:52] hmm...now where is the manual refresh on the new kibana... [15:13:09] I think hitting, pause and then play works? [15:13:16] -, [15:15:38] ah, auto-refresh was off. blerg. moved-cheese. [15:15:39] (03PS3) 10Giuseppe Lavagetto: parsoid: add role based on service::node, apply to two hosts [puppet] - 10https://gerrit.wikimedia.org/r/299716 (https://phabricator.wikimedia.org/T90668) [15:16:27] <_joe_> mobrovac: ^^ ok to go? [15:16:29] thcipriani, flag change worked, and basic regression testing looks good. [15:16:35] +1 to go to prod [15:16:41] matt_flaschen: ack, thank you [15:16:42] <_joe_> both servers are depooled anyways [15:16:51] _joe_: managed to update wtp100[12] ? [15:16:52] the deploy repo i mean [15:17:01] <_joe_> mobrovac: yeah but had to recreate it [15:17:14] <_joe_> worth a better investigation before we upgrade the next machines [15:18:15] !log replacing PEM0-3 cr2-eqiad [15:18:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:58] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:299704|Remove Echo transition flags]] PART I (duration: 00m 26s) [15:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:33] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:299704|Remove Echo transition flags]] PART II (duration: 00m 30s) [15:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:19:43] ^ matt_flaschen check prod please [15:19:52] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: add role based on service::node, apply to two hosts [puppet] - 10https://gerrit.wikimedia.org/r/299716 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [15:19:56] (03CR) 10Alex Monk: "That'd be fine if we were handling naming on the labs end manually, but this needs to be done by script, so we'd have to add hacks for it " [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [15:20:49] bd808: ping for SWAT [15:24:36] bblack, so I don't really like the approach used in RFC 2317 [15:24:51] the labs end will be handling these requests dynamically [15:24:53] thcipriani, looks good. [15:25:14] matt_flaschen: great, thanks for checking! [15:25:48] We'd have to add a hack to change '155.80.208.in-addr.arpa' to '128/25.155.80.208.in-addr.arpa' if I understand correctly [15:26:08] 06Operations, 10Deployment-Systems, 03Scap3: Warning: rename(): Permission denied in /srv/mediawiki/wmf-config/CommonSettings.php on line 189 - https://phabricator.wikimedia.org/T136258#2475715 (10bd808) >>! In T136258#2475046, @Dereckson wrote: > Test run added to this evening SWAT. You should be able to j... [15:26:22] thcipriani: pong [15:26:48] "The advantage of this approach over the other proposed approaches for dealing with this problem is that there should be no need to modify any already-deployed software" - well, it's correct that this is not already-deployed... :/ [15:26:55] thcipriani: I'm not 100% sure how to reproduce that bug. I think you have to save an edit using the translation exension [15:27:48] bd808: ack, we'll give it a shot. Seems like your patch should fix the error log explosion at any rate. [15:28:14] (03PS1) 10Giuseppe Lavagetto: parsoid: add realserver ips to role::parsoid as well [puppet] - 10https://gerrit.wikimedia.org/r/299766 [15:28:32] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2475719 (10RobH) [15:28:43] <_joe_> mobrovac: ^^ this will fix the lvs realserver issue [15:28:44] I just cherry-picked it from core. :) All credit to AaronSchulz [15:29:30] (03CR) 10Mobrovac: [C: 031] parsoid: add realserver ips to role::parsoid as well [puppet] - 10https://gerrit.wikimedia.org/r/299766 (owner: 10Giuseppe Lavagetto) [15:29:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] parsoid: add realserver ips to role::parsoid as well [puppet] - 10https://gerrit.wikimedia.org/r/299766 (owner: 10Giuseppe Lavagetto) [15:30:29] (03PS2) 10Jdlrobson: Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) [15:30:48] (03CR) 10Jdlrobson: Wikidata description config cleanup (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [15:31:07] (03CR) 10jenkins-bot: [V: 04-1] Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) (owner: 10Jdlrobson) [15:32:07] (03PS3) 10Jdlrobson: Wikidata description config cleanup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299615 (https://phabricator.wikimedia.org/T140600) [15:32:18] Krenair: the RFC for CIDR delegation came out in 1998. It's the normal way to delegate these things since long before anything related to OpenStack, so they should implement support for it, it shouldn't be a "hack"... [15:32:22] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2475726 (10Reedy) This is likely caused by T140658 [15:32:51] Krenair: also, this has nothing to do with whether it's handled dynamically. It's just about how delegation is done. The alternative in andrewbogott's original patch would require you defining separate zones for each IP address. [15:33:16] bblack, we're not talking about OpenStack, we're talking about our custom code that connects to OpenStack to set up these records [15:33:51] it's still the same records, only the name of the zone has changed over classful delegation [15:33:54] With this our script will need to know about each different range floating IPs can be set up in, and how to generate the proper in-addr.arpa zone name [15:34:01] unless you were actually preferring the one-zone-per-IP solution? [15:34:08] Instead of just IP.split('.').reverse().join('.') [15:34:54] Krenair: in any working solution, you still need the names to be in correct zones, it's just a question of what zones... [15:35:04] Krenair: again, did you actually want one IP per zone? [15:35:23] no [15:35:45] that's what andrewbogott's first patch implemented, on the prod authdns side. for it to work that way, you'd need one zonefile per IP [15:35:58] what patch is this? [15:36:04] We don't have zonefiles bblack [15:36:10] https://gerrit.wikimedia.org/r/#/c/299513/2/templates/155.80.208.in-addr.arpa [15:36:16] how are you serving DNS without zonefiles? [15:36:24] bd808: patch is live on mw1099, check please [15:36:26] well without some conception of zones [15:36:30] that's not andrew's patch [15:36:37] I guess they don't have to be files. but zones are things with meaning [15:36:42] https://gerrit.wikimedia.org/r/#/c/299503/1/modules/dnsrecursor/files/labs-ip-alias-dump.py [15:36:43] sorry, it's the one he linked me [15:37:32] what DNS server is actually serving the records? [15:37:42] 06Operations, 06Discovery, 06Maps, 10Maps-data, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2475734 (10Gehel) [15:37:44] 06Operations, 06Discovery, 06Maps, 10Tilerator, 10Traffic: Tilerator should purge Varnish cache - https://phabricator.wikimedia.org/T109776#2475733 (10Gehel) [15:38:25] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:39:13] Krenair: ^ ? [15:41:08] !log oblivian@palladium conftool action : set/pooled=yes; selector: name=wtp1001.*,cluster=parsoid,dc=eqiad [15:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:41:46] bblack, I thought it was labs-ns* but looking into it now, it appears it may be labs-recursor* [15:42:18] (03PS5) 10Andrew Bogott: Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 [15:42:28] Krenair: so this script is ... generating zonefiles for that DNS server? [15:42:56] Krenair: I'm just trying to understand what's actually happening at the DNS layer on the labs end of things. What software is serving it and how it's configured. [15:43:11] there is no concept of operations/dns.git-like zonefiles that I am aware of [15:43:27] Krenair: if we're delegating, the nameservers that answer the delegated queries do have to implement the concept of Zones somehow.... [15:43:53] (and probably a recursor doesn't do that very well, it should be an authserver) [15:43:58] there might be elsewhere in pdns-recursor but not when it comes to these lua scripts [15:44:37] at the DNS protocol level, zones matter [15:44:53] otherwise it can't generate a correct NXDOMAIN response, for example [15:44:53] (03CR) 10Andrew Bogott: [C: 032] Add diamond collector for rabbitmq stats [puppet] - 10https://gerrit.wikimedia.org/r/299193 (owner: 10Andrew Bogott) [15:45:43] labs-ns* seems like the logical choice, since that's an authserver of some kind (for wmflabs.org) [15:45:47] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 2 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2475770 (10demon) [15:45:49] Forgetting the specific changes we were talking about for a moment, there definitely appears to be something wrong with ns vs. recursor [15:46:49] thcipriani: I think it worked ok. I was able to make a new page via Special:ContentTranslation [15:46:52] The lua recursor scripts live only in dnsrecursor and don't seem to apply on labs-ns0 [15:47:24] shouldn't they be run on ns* instead andrewbogott? [15:47:34] bd808: ack, FWIW, haven't seen any errors on mw1099, rolling to prod. [15:48:17] Can somebody with rights on simplewiki delete https://simple.wikipedia.org/wiki/Portland_International_Beerfest for me? That was my test. [15:48:39] Krenair: For reasons that I don't understand, the lua plugin system only works in the recursor [15:48:58] hm, I guess that's not great, if means all of this will only work from within labs :( [15:49:07] right [15:49:20] it should be on an authserver that prod can reach, too [15:49:26] thcipriani: I built this basic dashboard for mw1099 -- https://logstash.wikimedia.org/app/kibana#/dashboard/mw1099 -- might be helpful [15:49:28] !log thcipriani@tin Synchronized php-1.28.0-wmf.10/extensions/ContentTranslation/includes/AbuseFilterCheck.php: SWAT: [[gerrit:299707|Avoid accessing private $filters field (T139657)]] (duration: 00m 26s) [15:49:29] T139657: Fatal error: Invalid static property access: AbuseFilter::filters in AbuseFilterCheck.php on line 121 - https://phabricator.wikimedia.org/T139657 [15:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:49:33] (well ideally, that the whole world can reach, like lab-ns[01]) [15:49:37] ^ bd808 check in prod please [15:49:59] is labs-ns0 actually running some kind of normal authdns software, or is the DNS there served by some openstack-ish software? [15:50:10] bd808: awesome :) /me adds to swat/deployers page [15:50:14] it's just pdns [15:50:23] ok, so you can work with that pretty easily [15:50:47] which connects to designate somehow [15:51:13] for the forward lookups, anyways [15:51:22] is this reverse data ultimately from designate as well? [15:51:27] I think maybe designate writes to the pdns DB in mysql? [15:51:39] the reverse data we're trying to add? no [15:52:25] thcipriani: no crash saving. lgtm [15:52:38] bd808: thanks :) [15:53:22] so couldn't the script just update the same mysql database, and modify records in the RFC2137 zones like 128/25.155.80.208.in-addr.arpa? [15:53:59] It'll still need to know about how we've set that up [15:54:21] But yes, we'll have to move it away from an lua script if those don't run on the ns* servers [15:54:28] well it needs to know the delegation boundaries, yes, but that's a small amount of metadata/config, listing the CIDR networks to place addresses in [15:54:54] e.g. your config file says: reverse_networks: 208.80.155.128/25, 208.80.154.0/26, ... [15:55:19] I think we'd manage it via the designate API instead of messing with pdns directly [15:55:21] there's standard libraries in every language to deal with ip networks/masks, it should be trivial to place incoming IPs into the right network for database updates [15:55:29] or that :) [15:55:54] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [15:56:19] in any case, for delegation to work right, the IPs do need to be placed in correct "zones", which are in agreement with the delegated zone names at the upstream server (prod authdns). [15:56:29] andrewbogott, I think the metaldns stuff should also work on the authoritative servers? [15:57:00] you can do that with RFC2137 for subnets, or you can delegate one zone per IP, those are the basic options. Or we can delegate a whole /24 without the RFC2137 syntax, but we don't appear to have one in this case. [15:57:01] would be better, yeah [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160719T1600). Please do the needful. [16:00:04] Smalyshev: A patch you scheduled for Puppet SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [16:00:23] (03PS1) 10Andrew Bogott: Remove a typo space [puppet] - 10https://gerrit.wikimedia.org/r/299772 [16:00:24] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: puppet fail [16:01:21] !log swapping pem0-3 on cr1-eqiad [16:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:01:43] I'll puppet SWAT [16:02:45] (03CR) 10Andrew Bogott: [C: 032] Remove a typo space [puppet] - 10https://gerrit.wikimedia.org/r/299772 (owner: 10Andrew Bogott) [16:03:26] SMalyshev: I'm looking at https://gerrit.wikimedia.org/r/#/c/298880 I suppose the old config will be cleaned up manually? [16:03:48] !admin See bd808's comment above: Can somebody with rights on simplewiki delete https://simple.wikipedia.org/wiki/Portland_International_Beerfest for me? That was my test. [16:03:53] gehel: ^ as well since you were a reviewer [16:05:37] (03PS1) 10Mobrovac: Parsoid: Increase heap limit to 600 MB [puppet] - 10https://gerrit.wikimedia.org/r/299774 (https://phabricator.wikimedia.org/T90668) [16:06:00] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2475848 (10RobH) [16:06:40] (03PS2) 10Giuseppe Lavagetto: Parsoid: Increase heap limit to 600 MB [puppet] - 10https://gerrit.wikimedia.org/r/299774 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [16:07:28] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 2 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2475862 (10BBlack) [16:07:51] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:08:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Parsoid: Increase heap limit to 600 MB [puppet] - 10https://gerrit.wikimedia.org/r/299774 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [16:09:21] PROBLEM - Juniper alarms on cr1-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 1 yellow alarms [16:10:06] ^^^ ignore that for now, it's just Chris doing a scheduled upgrade [16:12:00] (03PS1) 10Mobrovac: Parsoid: Lower the heartbeat timeout to 3 mins [puppet] - 10https://gerrit.wikimedia.org/r/299775 (https://phabricator.wikimedia.org/T90668) [16:12:40] (03CR) 10Paladox: [C: 031] Gerrit: Fix redirect to commit-msg, vary on $host [puppet] - 10https://gerrit.wikimedia.org/r/299706 (owner: 10Chad) [16:13:06] (03CR) 10Giuseppe Lavagetto: [C: 032] Parsoid: Lower the heartbeat timeout to 3 mins [puppet] - 10https://gerrit.wikimedia.org/r/299775 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [16:13:22] 06Operations, 10ops-eqiad, 10hardware-requests: eqiad: audit cisco servers for return/decom - https://phabricator.wikimedia.org/T140786#2475894 (10RobH) [16:13:31] 06Operations, 10ops-codfw, 10hardware-requests: eqiad: audit cisco servers for return/decom - https://phabricator.wikimedia.org/T140787#2475912 (10RobH) [16:14:02] 06Operations, 10ops-codfw, 10hardware-requests: codfw: audit cisco servers for return/decom - https://phabricator.wikimedia.org/T140787#2475929 (10RobH) [16:14:16] (03CR) 10Paladox: [C: 031] Gerrit: Enable proper backups from new hosts [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [16:15:14] bd808, matt_flaschen: I don't think you'll find many users with on-wiki privileges like that here [16:15:41] (03PS1) 10Yuvipanda: tools: Add limited sudo capabilities to toolschecker account [puppet] - 10https://gerrit.wikimedia.org/r/299777 [16:15:41] 06Operations, 10Gerrit, 13Patch-For-Review: Update gerrit sshkey in role::ci::slave::labs when upgrade to Jessie happens - https://phabricator.wikimedia.org/T131903#2475931 (10demon) I'm thinking we should copy the host key to the new machine as well so people don't get unexpected key mismatches. [16:16:06] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2475933 (10RobH) [16:16:12] Krenair: where should I ask? Any idea? [16:16:17] (03PS2) 10Yuvipanda: tools: Add limited sudo capabilities to toolschecker account [puppet] - 10https://gerrit.wikimedia.org/r/299777 [16:16:24] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Add limited sudo capabilities to toolschecker account [puppet] - 10https://gerrit.wikimedia.org/r/299777 (owner: 10Yuvipanda) [16:16:36] bd808, #wikipedia-simple [16:16:48] thx [16:17:32] Krenair, not a huge number, but there are some. [16:18:03] !log oblivian@palladium conftool action : set/pooled=yes; selector: name=wtp1002.*,cluster=parsoid,dc=eqiad [16:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:19:29] godog: old config will go away since it was deployed into code dir, and scap uses new dir every time [16:20:03] SMalyshev: ok, I'll merge it then cc/ gehel [16:20:14] <_joe_> mobrovac, subbu superb job :) I rarely see such a complex transition go this smoothly [16:20:14] thanks! [16:20:41] indeed _joe_! [16:20:53] _joe_: wouldn't have been possible without your help, though [16:20:54] (03PS5) 10Filippo Giunchedi: Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) (owner: 10Smalyshev) [16:20:56] much appreciated [16:21:01] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Move updater logs config to /etc/wdqs [puppet] - 10https://gerrit.wikimedia.org/r/298880 (https://phabricator.wikimedia.org/T139434) (owner: 10Smalyshev) [16:21:19] <_joe_> mobrovac: the global deploy is going to be *fun* [16:21:44] yeha [16:21:51] <_joe_> every server will need to be depooled, parsoid stopped, deploy, puppet run, verify, repool [16:22:05] <_joe_> so yeah, SO MUCH FUN [16:22:27] _joe_: hopefully we won't need to re-clone the deploy repo on each [16:22:35] <_joe_> I guess I'll prepare a script [16:22:41] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [16:22:53] <_joe_> mobrovac: hopefully, but I really have to understand what went wrong there :) [16:22:54] _joe_: but we might as well salt that beforehand and clone it into a different dir and then just do a mv [16:22:58] godog: thanks for the merge! [16:23:08] <_joe_> mobrovac: yeah something like that would do too [16:23:20] gehel: np, it is done SMalyshev ! [16:23:26] 06Operations: Debian repository supporting multiple package versions - https://phabricator.wikimedia.org/T115758#2475941 (10Eevans) >>! In T115758#2458271, @Eevans wrote: > [ ... ] > This would be really simple to do (I'm sure we could even find something that [[ https://git.autistici.org/ale/urepo | already doe... [16:23:36] <_joe_> mobrovac: I guess it's enough for me not to assume I can ignore how git-deploy works [16:23:42] _joe_, mobrovac thanks .. it is all arlo on the parosid end. i am just acting managerial. [16:25:07] (03PS3) 10Giuseppe Lavagetto: parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) [16:26:48] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2475949 (10Gehel) [16:26:51] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 3 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2475948 (10Gehel) [16:28:00] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 3 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2447489 (10Gehel) Removing dependency on T133744. We are ok with enabling Kartographer on smaller wiki... [16:28:27] (03CR) 10jenkins-bot: [V: 04-1] parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [16:28:48] (03PS1) 10Yuvipanda: tools: Use a different tool for k8s webservice checks [puppet] - 10https://gerrit.wikimedia.org/r/299780 [16:28:48] <_joe_> 16:26:18 ./modules/toollabs/manifests/checker.pp:86 ERROR two-space soft tabs not used (2sp_soft_tabs) [16:28:51] <_joe_> 16:26:18 ./modules/toollabs/manifests/checker.pp:87 ERROR two-space soft tabs not used (2sp_soft_tabs) [16:28:57] <_joe_> grr [16:29:10] <_joe_> YuviPanda: ^^ [16:29:18] robh: while going through unassigned tasks, I found T109903. It is probably obvious to you what needs to happen, but it isnt to me... [16:29:18] T109903: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903 [16:29:38] <_joe_> YuviPanda: I'ma fix that now [16:29:41] robh: if you have a few minutes to chat, I'll add that to the task [16:30:08] joe uh, ok... [16:30:17] two-space? [16:30:22] <_joe_> yes [16:30:27] (03CR) 10jenkins-bot: [V: 04-1] tools: Use a different tool for k8s webservice checks [puppet] - 10https://gerrit.wikimedia.org/r/299780 (owner: 10Yuvipanda) [16:31:31] (03PS2) 10Yuvipanda: tools: Use a different tool for k8s webservice checks [puppet] - 10https://gerrit.wikimedia.org/r/299780 [16:31:33] (03PS1) 10Yuvipanda: tools: Fix spacing [puppet] - 10https://gerrit.wikimedia.org/r/299782 [16:31:33] joe ok pushed a fix [16:31:58] <_joe_> geez gerrit is slow today [16:32:01] gehel: I have time to chat, but not sure how much I'll add to that particular task. So it basically is someone needs to go in and write (typically by editing another) check in icinga to poll for power supply redundancy [16:32:21] _joe_ https://gerrit.wikimedia.org/r/#/c/299782/ should fix it [16:32:39] its needed for two reasons, just to know if a power supply goes dead [16:32:40] <_joe_> let's wait for jenkins though [16:32:47] yeah [16:32:50] or if we lose power from an entire PDU/tower in a rack [16:33:21] the reason it was triggered (iirc) is we lost a tower (so the B side power) in a rack in ulsfo [16:33:21] (03PS2) 10Chad: Gerrit: Go ahead and ensure lets_encrypt everywhere other than ytterbium [puppet] - 10https://gerrit.wikimedia.org/r/299678 [16:33:33] (03CR) 10Giuseppe Lavagetto: [C: 032] tools: Fix spacing [puppet] - 10https://gerrit.wikimedia.org/r/299782 (owner: 10Yuvipanda) [16:33:46] and we only knew about it due to one of the devices attached had only single power supply. [16:33:52] robh: there was a question on that task about whether the PDU we have in ulsfo are monitorable [16:34:01] (03PS4) 10Giuseppe Lavagetto: parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) [16:34:02] oh, not really [16:34:11] we get nagios alerts from UnitedLayer [16:34:21] <_joe_> YuviPanda: please wait until I merged this change [16:34:22] but we have no way to snmp walk their pdu infrastructure [16:34:28] thanks joe [16:34:31] we snmp walk our own for info [16:34:46] (03CR) 10Yuvipanda: [C: 032] tools: Use a different tool for k8s webservice checks [puppet] - 10https://gerrit.wikimedia.org/r/299780 (owner: 10Yuvipanda) [16:34:49] hence we need to monitor from the server/switch/router side power supply unputs [16:34:50] input [16:34:53] <_joe_> YuviPanda: wtf? [16:34:54] Ok, so we want to monitor that from the consumer side [16:35:04] its also a good idea to monitor from both so we know about dead power supplies as well [16:35:07] yep! [16:35:17] joe ? [16:35:25] robh: from both? what do you mean? [16:35:26] <_joe_> this change as the one I just rebased [16:35:27] <_joe_> :P [16:35:29] <_joe_> lol [16:35:30] _joe_ bah, sorry, I didn't see your message in time :| [16:35:37] (03PS5) 10Giuseppe Lavagetto: parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) [16:35:48] gehel: in our own sites, we should monitor both consumer side (servers/switches/routers) and PDU side [16:35:49] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] parsoid::testing: move to use the parsoid class [puppet] - 10https://gerrit.wikimedia.org/r/299717 (https://phabricator.wikimedia.org/T90668) (owner: 10Giuseppe Lavagetto) [16:35:51] since we provide our pwn PDUs and can snmp walk them for info [16:35:53] * YuviPanda stops merging things [16:35:58] and we do monitor our PDUs in our own sites [16:36:08] though not all the parts of them i'd like, we have basic monitoring [16:36:29] (i'd like us to also monitor the individual circuit banks in each PDU, but we dont right now) [16:37:00] robh: Ok, thanks a lot for all those info! I'll add some comment to the task [16:37:14] glad to assist =] [16:37:18] robh: not sure it is actually help moving it forward though... [16:37:37] well, it makes it more clear so if someone knows icinga checks they'll know what we need [16:38:24] gehel: part of clinic duty is the feeling that you are re-arranging deck chairs on the titanic. [16:38:28] ;D [16:38:41] robh: you were saying that we do receive alerts from UnitedLayer [16:38:52] RECOVERY - Start and verify pages via webservices on kubernetes on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 8.758 second response time [16:38:55] Yes, but not exactly reliably! [16:38:56] robh: the titanic before or after the iceberg? [16:39:08] Everyone seems happy, I'd say before. [16:39:30] robh: do we integrate them with our own icinga? Or do we just receive emails on some list? [16:39:36] we just receive them into email [16:39:40] no integration [16:39:58] and even then its not exactly reliable. we didnt get an email when we lost one of the 3 breakers in that PDU [16:39:59] hey ops, ocg seems to be down [16:40:08] because they arent monitoring the individual fuses [16:40:12] is that a symptom of some other ops issue, or should i start digging into this? [16:40:12] (much like we arent either!) [16:40:33] 06Operations, 10ops-eqiad, 10netops: cr1/cr2-eqiad: install new SCBs and linecards - https://phabricator.wikimedia.org/T140764#2476006 (10Cmjohnson) [16:40:36] 06Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2476003 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson Replaced all 4 PEM's in each mx-480. Turns out we already have high capacity fans in each. We now have 2 spa... [16:40:36] icinga doesn't seem to be complaining about it, so i'm guessing the ocg service is actually up but the PHP side config got b0rked somehow [16:40:43] cscott: im not aware of anything [16:40:57] elukey: ^ do you know anything about OCG? [16:40:59] nothign in deploymesnts so no one should be messign with it [16:41:05] was there a mediawiki-config change recently? [16:41:23] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Rack and setup new fundraising queue servers - https://phabricator.wikimedia.org/T136882#2476007 (10Cmjohnson) per Jeff's instruction...removed the cable from aluminum to frqueue1002. [16:41:54] cscott: last change 14 hours ago [16:42:23] gehel: nope [16:42:52] (03PS2) 10Dzahn: Gerrit: Fix redirect to commit-msg, vary on $host [puppet] - 10https://gerrit.wikimedia.org/r/299706 (owner: 10Chad) [16:43:13] 06Operations, 10OCG-PDFRenderer: Export to PDF fails on enwiki - https://phabricator.wikimedia.org/T140789#2476009 (10cscott) p:05Triage>03Unbreak! [16:43:23] ouch I tried with wikipedia it and it seems indeed broken [16:44:04] cscott: let us now if / how we can help... [16:44:52] as you already said, the OCG health checks are all green [16:44:55] oops [16:45:01] so theoretically reads from a redis queue [16:45:17] and mw should push stuff in there afaik [16:45:29] 06Operations, 10OCG-PDFRenderer: Export to PDF fails on enwiki - https://phabricator.wikimedia.org/T140789#2476016 (10cscott) mark was notified at about 10:58AM EDT. He pinged me on IRC, I started working on it 12:36pm EDT. Ops says no deployments recently, last change to mediawiki-config was "about 14 hours... [16:46:05] phab ticket is there ^ i'll post status as i figure things out. [16:46:19] like i said, i'm suspecting a PHP side config issue, i'll start looking at OCG logs [16:46:43] (03PS1) 10Eevans: Enable instance restbase1013-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/299784 (https://phabricator.wikimedia.org/T134016) [16:46:50] cscott: Thanks! [16:46:53] I don't think I have access to logs on the production hhvm cluster though, if someone with the right bits in ops could look there for anything related to the collection extension, that would be helpful. [16:47:13] <_joe_> cscott: logstash ? [16:47:31] <_joe_> cscott: also, sorry but I cannot hop onto the ocg servers to get a look now [16:47:40] <_joe_> I am in the middle of the Great Parsoid Migration [16:47:53] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2476018 (10RobH) So in looking up old emails, I found that it is indeed 40 systems: Both ship sets shipped Oct 6th. 40 cartons. SO# 50808791 ss2 and ss3 Item - R250-2480805W = 40 servers. Weight... [16:48:00] _joe_: logstash collects the output of ocg, but from all appearances the ocg servers are fine. it's something php config side. i *think*. [16:48:15] (03PS1) 10Eevans: Enable instance restbase2003-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/299785 (https://phabricator.wikimedia.org/T134016) [16:48:22] <_joe_> cscott: they collect logs from hhvm as well [16:48:34] 15:19 logmsgbot: thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Remove Echo transition flags PART II (duration: 00m 30s) [16:48:34] 15:19 logmsgbot: thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: Remove Echo transition flags PART I (duration: 00m 26s) [16:48:50] that lines up right time-of-failure wise, i'm going to look at that closely as well. [16:49:02] <_joe_> cscott: seems like a good idea [16:49:03] i can see ocg.log on ocg1001. it looks like normal requests [16:49:04] cscott, I'm still around. [16:49:18] <_joe_> ok mutante can you help with this? [16:49:23] cscott, that just reverts back to the July 4th state, though. [16:49:24] also OK: ocg_job_status 323058 msg: ocg_render_job_queue 0 msg [16:50:00] (03PS1) 10Mobrovac: Parsoid: testreduce: correct gitRepoPath [puppet] - 10https://gerrit.wikimedia.org/r/299786 (https://phabricator.wikimedia.org/T90668) [16:50:17] well, what can i do [16:50:25] cscott, basically we needed those transition flags until some maintenance scripts finished, due to change in how we sorted notifications and how bundling works. The maint scripts have finished, so we turned the flags off. [16:50:30] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2476025 (10RobH) The labsdb1001-1003 systems are still in active use. The current plan is to migrate off of them, onto new labsdb machines. Unfortunately, the timeline for this is in (5-6) mont... [16:50:47] cscott: do you need the log directly from an ocg server? [16:51:04] it doesnt look like it has obvious errors [16:51:16] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Parsoid: testreduce: correct gitRepoPath [puppet] - 10https://gerrit.wikimedia.org/r/299786 (https://phabricator.wikimedia.org/T90668) (owner: 10Mobrovac) [16:51:18] mutante: i don't think so, at this point, like i said i'm concentrating on the php side right now. [16:51:32] that would support the php config ..yea [16:51:45] (03PS3) 10Chad: Gerrit: Go ahead and ensure lets_encrypt everywhere other than ytterbium [puppet] - 10https://gerrit.wikimedia.org/r/299678 [16:52:00] beta is also failing, fwiw. [16:52:01] yeah I don't see anything on /var/log/ocg.log on ocg1001 [16:52:05] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Book&bookcmd=render_article&arttitle=English+language&oldid=98720&writer=rdf2latex [16:52:31] the php logs would be really interesting [16:52:42] i guess i'll hunt for them in logstash now [16:53:30] cscott, do you have access to production fluorine? [16:53:51] (03PS3) 10Alex Monk: Delegate 208.80.155.128/25 (labs instances) PTR records to labs-ns* so they can be managed automatically [dns] - 10https://gerrit.wikimedia.org/r/299513 (https://phabricator.wikimedia.org/T104521) [16:53:53] That's where the PHP logs are. If so, you can also use that if logstash isn't working out for whatever reason. [16:54:12] "Request to http://ocg.svc.eqiad.wmnet:8000 resulted in error" [16:55:07] not much context in this message! [16:55:48] (03CR) 10Chad: "No change, and compilation failures from early magically disappeared :) https://puppet-compiler.wmflabs.org/3394/" [puppet] - 10https://gerrit.wikimedia.org/r/299678 (owner: 10Chad) [16:58:16] from ganglia, looks like we've been down almost 24 hours, so maybe i should be looking further back in the server admin log [16:58:57] curl http://ocg.svc.eqiad.wmnet:8000 from a random appserver works [16:59:07] or from ocg1001 itself [16:59:13] yep for me too [16:59:19] can someone who fully understands how ganglia deals with timezones pull out the exact UTC downtime from the network graph on https://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&m=cpu_report&c=PDF+servers+eqiad&h=ocg1001.eqiad.wmnet&tab=m&vn=&hide-hf=false&mc=2&z=small&metric_group=NOGROUPS ? [16:59:22] * gehel also confirms [16:59:25] i checked for ferm changes or something.. nope [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160719T1700). [17:01:10] mutante: yeah, seemed to work from tin as well, confirming what icinga is saying. [17:01:22] this bug should be an interesting one when we track it down [17:02:02] Monday 18:00 and the ganglia servers has UTC [17:02:13] well, date says so on uranium.. [17:02:36] or a little before that? [17:02:43] the graph i'm seeing *seems* to be in my local time zone (EDT) and says network traffic stopped around 13:00 monday EDT [17:02:56] 06Operations, 06Performance-Team, 10Thumbor: add thumbor to production infrastructure - https://phabricator.wikimedia.org/T139606#2476087 (10fgiunchedi) wrt packaging, I've uploaded internally the thumbor-related packages for jessie, namely: ``` libthumbor_1.3.2-0+wmf1.dsc python-thumbor-community-core_0.4.... [17:03:18] godog: <3 [17:03:23] thank you [17:03:23] mutante: so i guess that's consistent. server admin log is in UTC, right? [17:04:00] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2476092 (10RobH) @papaul confirms he has 18 on site (and will work via the sub-task to audit them). @Cmjohnson confirms he has 22 onsite (and will work via the sub-task to audit them). [17:04:11] 17:28 logmsgbot: demon@tin Synchronized wmf-config/InitialiseSettings.php: globally set $wgHTTPProxy (duration: 00m 26s) [17:04:15] i wonder if that did it? [17:04:18] ori: yw! gill.es put a ton of effort into those [17:05:29] cscott: yes, SAL also looks like UTC. ack [17:05:47] is chad still raound? [17:06:15] i seem to recall that OCG uses the `urldownloader` service [17:06:59] 06Operations, 10Monitoring: add pdu redundancy checking to server/router/switch checks in icinga - https://phabricator.wikimedia.org/T109903#1562129 (10Gehel) Some additional info as discussed with @RobH: * We don't have direct monitoring of PDU in ulsfo, we do get email alerts from UnitedLayer (those alerts... [17:07:20] cscott, looks like the PHP side uses Http::post to contact the service. [17:07:24] (03CR) 10Cscott: "This might have broken OCG; see T140789." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299571 (https://phabricator.wikimedia.org/T140658) (owner: 10Chad) [17:07:32] (03Abandoned) 10Alex Monk: dnsrecursor labsaliaser: Set up instance-$instance.$project.wmflabs.org domains for every instance with a public IP [puppet] - 10https://gerrit.wikimedia.org/r/299503 (https://phabricator.wikimedia.org/T104521) (owner: 10Alex Monk) [17:07:44] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2476105 (10Cmjohnson) Swapped the cable today. Let me know if gets better. [17:07:50] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 4 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2471564 (10cscott) This patch might have broken OCG; see T140789. [17:08:01] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2476111 (10Legoktm) p:05Normal>03Unbreak! [17:08:03] Bleh. [17:08:16] Probably same problem as ExtensionDistributor. [17:09:07] cscott, which would indeed have been affected by the proxy change. [17:09:09] url-downloader seems to be failing for things that are internal which isn't fun. [17:09:17] hi [17:09:18] yeah [17:09:22] proxy can't reach internal stuff [17:09:25] cscott: the HTTP_PROXY thing makes sense [17:09:30] 06Operations, 10OCG-PDFRenderer: Export to PDF fails on enwiki - https://phabricator.wikimedia.org/T140789#2475986 (10cscott) A number of independent confirmations that the OCG servers appear to be healthy and answering queries on port 8000. However, logstash reports "Request to http://ocg.svc.eqiad.wmnet:800... [17:09:34] Yeah, so forcing $wgHTTPProxy is bad. [17:09:34] ostriches: https://phabricator.wikimedia.org/T140658#2471564 [17:09:42] ED talks to Gerrit which is internal [17:09:45] When it's *internal* but not *local* (per MW's knowledge) [17:09:58] 06Operations, 06Discovery, 06Maps: Icinga is randomly loosing connectivity to maps1002 - https://phabricator.wikimedia.org/T138782#2476134 (10Gehel) @Cmjohnson Thanks! I'll keep an eye on [[ https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=maps1002 | Icinga history for that host ]]. We'll see if... [17:10:01] I'll revert my config change for that. [17:10:02] yes, ED looks like the same issue [17:10:07] IIRC, we explicitly had to use urldownloader because its the only way to reach internal stuff. [17:10:18] so making urldownloader break on internal stuff is... suboptimal. [17:10:45] PROBLEM - parsoid on ruthenium is CRITICAL: Connection refused [17:10:52] <_joe_> known ^^ [17:11:00] can we just quick-revert https://gerrit.wikimedia.org/r/299571 for now? [17:11:11] (03PS1) 10Chad: Revert "Set $wgHTTPProxy globally instead of relying on getenv()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299793 [17:11:19] oh, looks liek ostriches already volunteered to do that. [17:11:20] I'm doing it now [17:11:23] I broke it [17:11:25] great. thanks! [17:11:49] i wish icinga had given more prompt notification of the problem. anyone have any ideas about that? [17:12:19] sounds like nobody noticed the problems with extension downloader for a while as well? [17:12:35] Yeah we probably don't have great logging in that part of MW for the failure. [17:12:41] i guess a custom icinga check could use url-downloader [17:13:08] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2476145 (10HJiang-WMF) SSH public key: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDaary3YNmqf/fNSfBuhAkOQjOJhAKmwUjlKM5eQXlBMTyYdkFxdhEzBVKQeba2O8WSoazfx9fgmt... [17:13:10] (03PS2) 10Chad: Revert "Set $wgHTTPProxy globally instead of relying on getenv()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299793 [17:13:15] (03CR) 10Chad: [C: 032] Revert "Set $wgHTTPProxy globally instead of relying on getenv()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299793 (owner: 10Chad) [17:13:58] (03Merged) 10jenkins-bot: Revert "Set $wgHTTPProxy globally instead of relying on getenv()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299793 (owner: 10Chad) [17:14:17] 06Operations, 10OCG-PDFRenderer: Export to PDF fails on enwiki - https://phabricator.wikimedia.org/T140789#2476159 (10cscott) Seems that the proxy change caused the urldownloader service not to be able to reach internal hosts, which broken ExtensionDistributor as well (T140753). Doing a revert of T140658 (htt... [17:14:41] <_joe_> now please find out what's the issue there and fix the real issue [17:14:50] !log dbstore1002 swapping disk at slot 6 [17:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:28] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: Don't set everywhere, breaks internal to us but external to MW requests (eg gerrit, ocg, etc) (duration: 00m 25s) [17:15:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:15:43] cscott, legoktm: ED and OCG should be fixed. [17:15:58] urldownloader should ignore wgHttpProxy? [17:16:04] ostriches: confirmed. thanks! [17:16:09] * legoktm clears extdist caches [17:16:09] No, urldownloader doesn't know who's talking to it [17:16:25] 06Operations, 10OCG-PDFRenderer: Export to PDF fails on enwiki - https://phabricator.wikimedia.org/T140789#2476165 (10cscott) Confirmed fixed at 1:15:43 PM EDT. [17:16:32] http://en.wikipedia.beta.wmflabs.org/w/index.php?title=Special:Book&bookcmd=rendering&return_to=English+language&collection_id=63eadd3017ae8a1085da4518baa7d65bb7ae79ab&writer=rdf2latex seems to be working now. [17:16:35] In progress. [17:16:47] 06Operations, 10OCG-PDFRenderer: Export to PDF fails on enwiki - https://phabricator.wikimedia.org/T140789#2476169 (10cscott) 05Open>03Resolved [17:17:04] cscott: It's that we don't have ways to fine-tune "use a proxy" vs "don't use a proxy" except in calling code like ExtensionDist. [17:17:17] So each extension could end up with a "use the proxy" setting, depending on your use-case. [17:17:32] Rather, each *call* to a different http endpoint. [17:17:48] Could have a varying usages of proxies. [17:17:50] (03PS1) 10Jcrespo: Changes on m3 grants to include unpuppetized users & dbproxy1003 [puppet] - 10https://gerrit.wikimedia.org/r/299796 (https://phabricator.wikimedia.org/T138460) [17:18:09] Hard to anticipate from core. It basically says "is this internal to the wiki yes/no. if no, use a proxy if we've got it" [17:18:38] And then you can explicitly say "use this proxy" or "don't use any proxy" from callers, but the callers then have to have that configurable somehow. [17:18:42] It's all rather ugly I think [17:18:53] ganglia confirms that the ocg servers are getting network traffic again [17:19:31] (03PS2) 10Jcrespo: Changes on m3 grants to include unpuppetized users & dbproxy1003 [puppet] - 10https://gerrit.wikimedia.org/r/299796 (https://phabricator.wikimedia.org/T138460) [17:19:34] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2476193 (10Paladox) It works now [17:19:41] 06Operations, 10MediaWiki-extensions-ExtensionDistributor: ExtensionDistributor gives error message "Unable to fetch extension list!" - https://phabricator.wikimedia.org/T140753#2476195 (10Legoktm) 05Open>03Resolved a:03demon Fixed by @demon by reverting the changes to $wgHttpProxy. Caches might need ~30... [17:19:45] PROBLEM - MegaRAID on dbstore1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [17:20:01] thanks ostriches [17:20:06] yw [17:20:20] i'll admit this is a part of OCG/the service architecture i don't fully understand. [17:20:45] The core patch fixes the issue with environment variables but I'm still not a fan of our "should I proxy" logic. [17:21:00] cscott: It's not actually ocg, it's the Collection extension or w/e it's called :) [17:21:08] That's the part I broke [17:21:11] ostriches: https://github.com/wikimedia/mediawiki/blob/master/includes/DefaultSettings.php#L7998 is that set properly? [17:21:59] (03PS3) 10Dzahn: Gerrit: Fix redirect to commit-msg, vary on $host [puppet] - 10https://gerrit.wikimedia.org/r/299706 (owner: 10Chad) [17:22:01] ostriches: well, it's the part of the whole system architecture where user requests get forwarded over the internal network to the OCG servers via the Collection extension. [17:22:17] legoktm: Ahhhh, that *would* work probably. Maybe. [17:22:36] I'm curious what else calls isLocalURL() that could get funny [17:22:40] * Command-line scripts are not affected by this setting and will always use [17:22:40] * proxy if it is configured. [17:22:43] that part is weird [17:22:43] But as far as HttpFunctions is concerned that would work [17:22:49] Bleh. [17:22:50] Ew. [17:23:28] Wait what? How on earth would isLocalUrl() know if a command line can serve from a localhost? [17:23:41] (Or not, rather) [17:24:14] (03CR) 10Dzahn: [C: 032] Gerrit: Fix redirect to commit-msg, vary on $host [puppet] - 10https://gerrit.wikimedia.org/r/299706 (owner: 10Chad) [17:24:27] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2476206 (10Cmjohnson) Replaced the disk, it's rebuilding Enclosure Device ID: 32 Slot Number: 6 Drive's position: DiskGroup: 0, Span: 3, Arm: 0 Enclosure position: 1 Device Id: 6 WWN: 5000C50096BF9FDC Seque... [17:24:40] legoktm: It also lacks any support for https, which is fun. [17:24:54] yay! [17:25:07] I'm going to find food now, I'll be back in a bit [17:26:15] (03CR) 10Dzahn: "@paladox deployed on lead" [puppet] - 10https://gerrit.wikimedia.org/r/299706 (owner: 10Chad) [17:26:49] 06Operations, 10ops-eqiad, 10DBA: dbstore1002 disk errors - https://phabricator.wikimedia.org/T140337#2476215 (10Cmjohnson) Return shipping information USPS: 9202 3946 5301 2421 2675 30 FEDEX: 9611918 2393026 70517551 [17:27:43] (03PS4) 10Dzahn: Gerrit: Go ahead and ensure lets_encrypt everywhere other than ytterbium [puppet] - 10https://gerrit.wikimedia.org/r/299678 (owner: 10Chad) [17:27:45] (03CR) 10Paladox: "@chad could you test please since I doint use git-review so I carnt test." [puppet] - 10https://gerrit.wikimedia.org/r/299706 (owner: 10Chad) [17:28:12] (03CR) 10Catrope: [C: 04-2] "Not yet, no consensus, see phab task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277529 (https://phabricator.wikimedia.org/T130009) (owner: 10Catrope) [17:28:14] (03CR) 10Dzahn: [C: 032] "per compiler link (noop), we are going to use Letsencrypt for Gerrit" [puppet] - 10https://gerrit.wikimedia.org/r/299678 (owner: 10Chad) [17:28:25] 06Operations, 10Ops-Access-Requests, 06Editing-Analysis: Requesting access to research groups for Helen Jiang - https://phabricator.wikimedia.org/T140659#2476220 (10Jdforrester-WMF) Management confirmation. [17:30:25] (03PS3) 10Dzahn: Gerrit: Enable proper backups from new hosts [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:34:16] ostriches: fyi: https://phabricator.wikimedia.org/T140763#2475348 [17:34:18] re branch cut [17:35:51] (03PS1) 10RobH: decom pc1001-1003 dns entries [dns] - 10https://gerrit.wikimedia.org/r/299800 [17:36:39] (03CR) 10RobH: [C: 032] decom pc1001-1003 dns entries [dns] - 10https://gerrit.wikimedia.org/r/299800 (owner: 10RobH) [17:37:04] greg-g: ack. [17:37:35] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: decommission aluminium, replace it with frqueue1002 - https://phabricator.wikimedia.org/T140676#2472440 (10Cmjohnson) aluminium cables have been removed. [17:38:14] 06Operations, 10ops-eqiad, 10media-storage: diagnose failed disks on ms-be1027 - https://phabricator.wikimedia.org/T140374#2476266 (10Cmjohnson) Disk was sent to SF office not data center. Working on getting the disk sent or a new one sent from Dell. [17:40:20] (03PS3) 10Jcrespo: Changes on m3 grants to include unpuppetized users & dbproxy1003 [puppet] - 10https://gerrit.wikimedia.org/r/299796 (https://phabricator.wikimedia.org/T138460) [17:40:38] (03CR) 10Jcrespo: [C: 032 V: 032] Changes on m3 grants to include unpuppetized users & dbproxy1003 [puppet] - 10https://gerrit.wikimedia.org/r/299796 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [17:41:36] !log starting mobileapps deployment [17:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:42:04] (03CR) 10Dzahn: "looks like this removes the current backup job from old server https://puppet-compiler.wmflabs.org/3395/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:43:12] 06Operations, 10ops-eqiad, 06DC-Ops, 13Patch-For-Review: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#2476292 (10RobH) 05Open>03Resolved a:03RobH I just removed all dns entires, killed the switch ports, & confirmed with @Cmjohnson that these systems have been wiped. [17:43:14] 06Operations, 10ops-eqiad, 06DC-Ops: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#2476295 (10RobH) [17:43:29] !log mobileapps deployed aa9115a [17:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:18] (03CR) 10Chad: "Typo in hiera, will amend." [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:44:25] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2087114 (10RobH) [17:44:27] 06Operations, 10ops-eqiad, 06DC-Ops: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#2476310 (10RobH) 05Open>03Resolved All of the assorted tasks instead are linked off T128821. [17:44:36] 06Operations, 10ops-eqiad, 06DC-Ops: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#1388150 (10jcrespo) labsdb1002 can be removed, too- it is essentially dead (although maybe the disks are ours?). [17:45:07] 06Operations, 10ops-eqiad, 06DC-Ops: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#2476331 (10RobH) @jcrespo: Awesome! thanks for the info. [17:45:20] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2087114 (10RobH) >>! In T103374#2476328, @jcrespo wrote: > labsdb1002 can be removed, too- it is essentially dead (although maybe the disks are ours?). [17:45:33] so many tasks that do similar things [17:45:49] cleanup is channel spammy ;] [17:46:52] robh, the task is https://phabricator.wikimedia.org/T126946 [17:46:55] (03CR) 10Dzahn: [C: 04-1] "yea, in the catalog there is no more "backup::set" after this change on ytterbium" [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:47:38] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2476338 (10RobH) [17:47:39] noted, thank you =] [17:47:41] 06Operations, 10hardware-requests: reclaim and return all cisco servers - https://phabricator.wikimedia.org/T128821#2476337 (10RobH) [17:47:50] now linked as sub-task, so i'll get over to it and process [17:48:07] decommission all things! [17:48:12] (03CR) 10Dzahn: Gerrit: Enable proper backups from new hosts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:49:36] !log applying new grants to m3 dbs in preparation for db1043 failover/proxy implementation [17:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:43] (03PS4) 10Chad: Gerrit: Enable proper backups from new hosts [puppet] - 10https://gerrit.wikimedia.org/r/299705 [17:50:36] amend conflict :) [17:50:38] (03PS5) 10Dzahn: Gerrit: Enable proper backups from new hosts [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:51:01] i had hit enter before PS4 appeared [17:54:37] (03PS6) 10Dzahn: Gerrit: Enable proper backups from new hosts [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:55:01] (03CR) 10Dzahn: [C: 032] "yep, this looks good:, no change on ytterbium, proper backup on lead http://puppet-compiler.wmflabs.org/3396/" [puppet] - 10https://gerrit.wikimedia.org/r/299705 (owner: 10Chad) [17:56:21] !log cr1-eqiad: restart chassis-control immediately (should not be traffic affecting) [17:56:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:57:16] RECOVERY - Juniper alarms on cr1-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms [17:58:03] cool [17:58:30] 06Operations, 10ops-eqiad, 10netops: Replace cr1/2-eqiad PSUs/fantrays with high-capacity ones - https://phabricator.wikimedia.org/T140765#2476399 (10faidon) Sidenote: on cr1-eqiad, we were bitten again by the same issue as T89999. Like before, I ran `restart chassis-control immediately` which resolved the i... [18:00:44] (03PS2) 10Dzahn: admin: create shell account for mpany [puppet] - 10https://gerrit.wikimedia.org/r/299702 (https://phabricator.wikimedia.org/T140399) [18:01:32] (03CR) 10Dzahn: [C: 032] "going ahead with this. we'll have to add the users before we add them to groups anyways. then we can add them all at once in one change th" [puppet] - 10https://gerrit.wikimedia.org/r/299702 (https://phabricator.wikimedia.org/T140399) (owner: 10Dzahn) [18:07:42] .win 82 [18:11:16] (03PS1) 10Jcrespo: Set db1048 as the primary master on the m3 proxy (not yet in use) [puppet] - 10https://gerrit.wikimedia.org/r/299805 (https://phabricator.wikimedia.org/T138460) [18:11:34] gerrit client for android, fwiw: https://f-droid.org/forums/topic/mgerrit/ | https://play.google.com/store/apps/details?id=com.jbirdvegas.mgerrit [18:14:02] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 4 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#2476469 (10demon) >>! In T140658#2476106, @cscott wrote: > This patch might have broken OCG; see T140789. Yep, th... [18:14:48] (03PS5) 10Dzahn: typos file: add 'mariabd' and 'eqad' [puppet] - 10https://gerrit.wikimedia.org/r/298033 [18:15:11] (03CR) 10Dzahn: [C: 032] typos file: add 'mariabd' and 'eqad' [puppet] - 10https://gerrit.wikimedia.org/r/298033 (owner: 10Dzahn) [18:17:55] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2476507 (10AndyRussG) @BBlack What are the advantages to using a regex rather than an explicit list? [[ https://www.mediawiki.org/... [18:19:09] (03PS1) 10Chad: add-ldap-user: Don't use sillyshell, it's silly (and doesn't exist anymore) [puppet] - 10https://gerrit.wikimedia.org/r/299812 (https://phabricator.wikimedia.org/T86668) [18:20:53] RECOVERY - parsoid on ruthenium is OK: HTTP OK: HTTP/1.1 200 OK - 1514 bytes in 0.382 second response time [18:22:44] ostriches: oh, I know I merged your patch arleady, but I'm not sure how it's going to interact with stuff like blog.wm.o which is used by the RSS extension and needs a proxy since it's external [18:24:37] legoktm: Yeah, it's a catch 22. [18:24:44] (03PS2) 10Dzahn: Gerrit: Don't install defaults file, package provides it [puppet] - 10https://gerrit.wikimedia.org/r/299163 (owner: 10Chad) [18:24:54] legoktm: T140658#2476469 kind of summarizes it [18:24:55] T140658: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658 [18:25:04] yeah, read that [18:31:46] (03PS2) 10Jcrespo: Set db1048 as the primary master on the m3 proxy (not yet in use) [puppet] - 10https://gerrit.wikimedia.org/r/299805 (https://phabricator.wikimedia.org/T138460) [18:32:22] (03CR) 10Jcrespo: [C: 032 V: 032] Set db1048 as the primary master on the m3 proxy (not yet in use) [puppet] - 10https://gerrit.wikimedia.org/r/299805 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [18:34:29] (03PS3) 10Dzahn: Gerrit: Don't install defaults file, package provides it [puppet] - 10https://gerrit.wikimedia.org/r/299163 (owner: 10Chad) [18:34:49] (03CR) 10Dzahn: [C: 032] "yep, file exists with identical content on both servers" [puppet] - 10https://gerrit.wikimedia.org/r/299163 (owner: 10Chad) [18:38:24] PROBLEM - haproxy failover on dbproxy1003 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [18:40:25] mutante: the wikis are not affected by https://httpoxy.org right ? [18:41:13] matanya: there was a fix for that [18:41:45] mutante: Ah, i see https://phabricator.wikimedia.org/T140658 [18:41:48] thanks [18:41:57] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1630902 (10Mattflaschen-WMF) Do we have a list of known causes for this? Started a list in the description of this (Probably most of these were not caused by that bug, I just want to have a list). [18:42:32] 06Operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#2476598 (10Mattflaschen-WMF) [18:43:14] matanya: https://gerrit.wikimedia.org/r/#/c/299568/2/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb [18:43:25] yes [18:45:13] MatmaRex: https://phabricator.wikimedia.org/T140763#2476261 ? [18:45:18] (I'd like to get started) [18:45:38] jynus: FYI, ORES is deployed on wikidata on June 15th, ores_classification table is 389 MBs now [18:45:58] mmmmm [18:46:15] I will have a look at it later [18:46:17] by removing a super redundant index we might improve it a lot, but I don't how that's okay for you [18:46:29] removing indexes is easy [18:46:35] just set a task [18:46:38] https://github.com/wikimedia/mediawiki-extensions-ORES/blob/master/sql/ores_classification.sql#L21 [18:46:43] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2476603 (10BBlack) It would just be simpler, but we can do a list. The data we have in Cookies_to_remove is just from Ori's 20-mi... [18:47:03] sure [18:47:18] https://www.irccloud.com/pastebin/TZTns7Kl/ [18:47:25] jynus: here's the details ^ [18:47:55] you can run it on nlwiki, ptwiki, ruwiki, fawiki, trwiki, but they would be much smaller [18:50:35] Amir1, that is not how things work [18:51:13] you need to follow https://wikitech.wikimedia.org/wiki/Schema_changes [18:51:24] 06Operations, 10DBA, 06Revision-Scoring-As-A-Service, 07Schema-change: Remove oresc_rev index - https://phabricator.wikimedia.org/T140803#2476605 (10Ladsgroup) [18:51:38] thanks [18:51:42] definitely, I just made the phab card [18:52:34] 06Operations, 10DBA, 06Revision-Scoring-As-A-Service, 07Blocked-on-schema-change, 07Schema-change: Remove oresc_rev index - https://phabricator.wikimedia.org/T140803#2476618 (10jcrespo) [18:52:41] you also need to commit the change to tables.sql [18:52:51] and a migration script for update.php [18:53:14] (but that is mediawiki, I am only interested on the WMF side) [18:53:34] (03CR) 10Thcipriani: [C: 031] Move the Analytics Refinery role to scap3 [puppet] - 10https://gerrit.wikimedia.org/r/299719 (https://phabricator.wikimedia.org/T129151) (owner: 10Elukey) [18:54:28] jynus: it's an extension, I don't think that's necessary. Correct me If I'm wrong [18:54:36] if you are interested, the mediawiki side is on https://www.mediawiki.org/wiki/Development_policy#Database_patches [18:54:58] I do not know, that should be discussed with mediawiki devels [18:55:00] I tried to do some schema changes for core before [18:55:13] I am happy with the ticket for wmf [18:56:18] now, if you make wmf and mediawiki (no matter if it is an extension) drift, the DBA will get sad [18:58:26] ostriches: sorry, i was afk [18:58:43] ostriches: yeah, that looks good then, i didn't realize it's already merged [18:59:55] !log Disabling puppet on restbase1013.eqiad.wmnet : T134016 [18:59:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:00:04] ostriches: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160719T1900). [19:00:04] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:00:38] (03PS4) 10Dzahn: zuul: switch gerrit server to lead [puppet] - 10https://gerrit.wikimedia.org/r/299592 (https://phabricator.wikimedia.org/T125018) [19:00:56] jynus: I ask from Wikidata people how their DB changes was implemented and I use that [19:01:04] would it be okay? [19:01:11] sure [19:01:21] as I said, devel issue, not operations [19:01:35] MatmaRex: Ok awesome, branching now [19:01:37] choo choo [19:01:48] it is just that my first question is: where is this change merged? [19:02:13] (03CR) 10Dzahn: "@Hashar amended to simply switch to "lead" as in PS1" [puppet] - 10https://gerrit.wikimedia.org/r/299592 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [19:02:40] I haven't made the patch yet. [19:02:47] I'll do [19:03:10] 👌 [19:03:12] Could I get someone to merge https://gerrit.wikimedia.org/r/#/c/299784/ for me? It's another Cassandra instance bootstrap, an ongoing part of T134016 [19:03:12] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:03:18] just let me check with wikidata people and then I make it happen [19:03:29] *very* routine at this point [19:03:39] (03CR) 10Chad: "Shouldn't it go as a part of I312d405c then?" [puppet] - 10https://gerrit.wikimedia.org/r/299592 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [19:03:57] (03PS2) 10Jcrespo: Enable instance restbase1013-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/299784 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:04:29] (03CR) 10Jcrespo: [C: 032 V: 032] Enable instance restbase1013-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/299784 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [19:04:43] jynus: thank you so much! [19:05:20] you should have puppet run rights [19:05:35] (03CR) 10Dzahn: "yea, agree, merging that into one change" [puppet] - 10https://gerrit.wikimedia.org/r/299592 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [19:05:48] jynus: :) [19:06:26] it was more like a question, that is right, isn't it? [19:06:53] jynus: I don't have +2, no. [19:06:56] no [19:07:03] oh! [19:07:04] I said puppet right [19:07:05] jynus: yes. [19:07:06] *run [19:07:10] i see what you mean [19:07:15] yeah, i got it from here [19:07:44] I was asking if you needed to do a manual run, and assumend you didn't need me for that [19:08:08] jynus: no, i have root on these machines and can run it [19:08:15] great [19:08:19] (03PS4) 10Dzahn: WIP: Gerrit: Swap lead to point at production data [puppet] - 10https://gerrit.wikimedia.org/r/298673 (owner: 10Chad) [19:08:19] jynus: but thanks! [19:09:08] (03Abandoned) 10Dzahn: zuul: switch gerrit server to lead [puppet] - 10https://gerrit.wikimedia.org/r/299592 (https://phabricator.wikimedia.org/T125018) (owner: 10Dzahn) [19:12:24] !log Starting bootstrap of restbase1013-c.eqiad.wmnet : T134016 [19:12:27] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:12:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:00] 06Operations, 10Analytics, 10MediaWiki-extensions-CentralNotice, 10Traffic: Generate a list of junk CN cookies being sent by clients - https://phabricator.wikimedia.org/T132374#2476731 (10AndyRussG) >>! In T132374#2476603, @BBlack wrote: > It would just be simpler, but we can do a list. Fantastic, thx!!... [19:14:01] (03PS1) 10Andrew Bogott: Change the case of the rabbitmq collector class [puppet] - 10https://gerrit.wikimedia.org/r/299822 [19:14:11] (03PS1) 10Cmjohnson: Removing dns entries for db1058 [dns] - 10https://gerrit.wikimedia.org/r/299823 [19:14:36] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for db1058 [dns] - 10https://gerrit.wikimedia.org/r/299823 (owner: 10Cmjohnson) [19:17:07] (03CR) 10Andrew Bogott: [C: 032] Change the case of the rabbitmq collector class [puppet] - 10https://gerrit.wikimedia.org/r/299822 (owner: 10Andrew Bogott) [19:17:51] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2476755 (10Cmjohnson) DNS Removed...@jcrespo I do see some entries in puppet manifests/role/coredb.pp: 'hosts' => { 'eqiad' => [ 'db1021', 'db1026', 'db1037', 'db1045', 'db1049',... [19:21:28] 06Operations, 10Analytics-Cluster, 06Analytics-Kanban, 10EventBus, and 2 others: Better monitoring for Zookeeper - https://phabricator.wikimedia.org/T137302#2476785 (10Nuria) 05Open>03Resolved [19:21:45] (03PS1) 10BryanDavis: logstash: Parse nginx access logs for wdqs [puppet] - 10https://gerrit.wikimedia.org/r/299825 [19:23:24] (03CR) 10BryanDavis: "We should probably test this somewhere. Is there a wdqs setup that logs to the beta cluster logstash?" [puppet] - 10https://gerrit.wikimedia.org/r/299825 (owner: 10BryanDavis) [19:23:36] (03CR) 10Chad: "Doesn't have to go in prior to the upgrade, mainly some cleanups to make life easier going forward." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/299164 (owner: 10Chad) [19:25:11] 06Operations, 10ops-eqiad, 13Patch-For-Review: Decommission broken db1058 - https://phabricator.wikimedia.org/T134360#2476813 (10jcrespo) That is a deprecated script, and I am waiting for this week's failover to nuke it completely (coredb otherwise is not in use). [19:31:09] !log demon@tin Purged l10n cache for 1.28.0-wmf.6 [19:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:15] jynus: here's the patch: https://gerrit.wikimedia.org/r/#/c/299827/ [19:31:20] !log demon@tin Purged l10n cache for 1.28.0-wmf.7 [19:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:32] !log demon@tin Purged l10n cache for 1.28.0-wmf.7 [19:31:44] !log demon@tin Purged l10n cache for 1.28.0-wmf.8 [19:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:54] !log demon@tin Purged l10n cache for 1.28.0-wmf.9 [19:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:32:57] Amir1, looks good to me [19:33:22] I will contact you again after it got merged [19:33:24] link it and I will deploy the schema change ASAP (not today) [19:33:27] thank you [19:33:36] I am being strict [19:33:40] thanks [19:33:44] (03PS1) 10Chad: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299829 [19:33:49] because look what happens if not: [19:34:08] https://phabricator.wikimedia.org/T132416 [19:34:09] 06Operations, 10Wikimedia-Apache-configuration, 07HHVM, 07Wikimedia-log-errors: Fix Apache proxy_fcgi error "Invalid argument: AH01075: Error dispatching request to" (Causing HTTP 503) - https://phabricator.wikimedia.org/T73487#2476874 (10hashar) Awesome! As a side note, the logstash syslog filter discard... [19:34:18] (not related to anything you do) [19:34:33] I saw the task before, I totally understand [19:34:34] hope you understand why it is important to keep consistency [19:34:43] definitely [19:35:28] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused [19:36:31] (03PS1) 10Yuvipanda: tools: Fix webservice toolschecker check [puppet] - 10https://gerrit.wikimedia.org/r/299831 [19:37:07] (03CR) 10Chad: [C: 032] group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299829 (owner: 10Chad) [19:37:27] (03PS2) 10Yuvipanda: tools: Fix webservice toolschecker check [puppet] - 10https://gerrit.wikimedia.org/r/299831 [19:37:42] (03Merged) 10jenkins-bot: group0 to wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299829 (owner: 10Chad) [19:38:29] !log demon@tin Started scap: group0 to wmf.11 [19:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:10] I've got a 503 error on Wikidata [19:39:14] Is this fixed? [19:39:49] abian, url? [19:39:54] Okay, seems so [19:40:08] which url gave you the error? [19:40:16] https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Constraint_violations/P570&curid=15087958&diff=358294930&oldid=358041578 [19:41:21] (I am getting a bit lost on the new kibanad interface :-P) [19:44:05] I see the error now: Memcached error for key "{memcached-key}" on server "{memcached-server}": SERVER ERROR [19:46:10] abian, you seem to have been lucky by getting a rare error code [19:46:36] I will monitor memcache in case it happens again, but it doesn't seem very frequent [19:47:34] * abian throws confetti [19:47:42] !log Lowering compaction throughput from 60MB/s to 45MB/s on restbase1013-c.eqiad.wmnet : T134016 [19:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:01] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:48:43] it could be an actual software problem "item too big" to fit into a memcache value? [19:49:01] the page you mentioned was certainly quite large [19:49:06] I still have the HTML content of the error page, but I think that won't be useful for you [19:49:13] although the error was not repatable [19:49:25] no need, I can see the errors on the log [19:49:55] if it doesn't happen again, just do not worry [19:50:03] I will keep monitoring it, just in case [19:51:56] Great, thanks :) [19:52:06] thanks to you for reporting it! [19:53:17] does anybody know how to find the old kibana dashboards list? [19:53:35] e.g. I want to see the memcache one [19:53:58] oh, I found it [19:54:10] for some reason it was not showing me the home page [19:54:32] 06Operations, 06Services, 06Services-next, 07Security-General: Create UserData service to protect sensitive user-related information - https://phabricator.wikimedia.org/T140813#2476991 (10GWicke) [19:54:50] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Create UserData service to protect sensitive user-related information - https://phabricator.wikimedia.org/T140813#2477012 (10GWicke) [19:55:07] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Create UserData service to protect sensitive user-related information - https://phabricator.wikimedia.org/T140813#2476991 (10GWicke) [19:55:15] 06Operations, 10ops-codfw, 10ops-eqiad: ship 7 ex4200s from codfw to eqiad - https://phabricator.wikimedia.org/T140655#2477018 (10Jgreen) [19:55:47] 06Operations, 06Performance-Team, 06Services, 07Availability, and 3 others: Create restbase BagOStuff subclass (session storage) - https://phabricator.wikimedia.org/T137272#2363223 (10GWicke) [19:55:54] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Create UserData service to protect sensitive user-related information - https://phabricator.wikimedia.org/T140813#2476991 (10GWicke) [19:57:36] !log Reducing stream throughput on restbase1013-{a,b} to 20MB/s : T134016 [19:57:37] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [19:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:59:02] jynus: the new kibana does take a bit of getting used to. https://logstash.wikimedia.org/app/kibana#/dashboard/default should always get you back to the "start" [19:59:18] I am open-minded [19:59:35] I just feel a bit dumb sometimes [20:00:02] I'm still having a "who moved my cheese!" reaction to kibana4 [20:00:14] I was almost ready to jump back to grep! [20:01:58] bd808: read that book [20:02:55] I have. I have mixed feelings about it [20:03:03] bd808: yeah, same. [20:03:07] i was made to read it :/ [20:03:21] !log demon@tin Finished scap: group0 to wmf.11 (duration: 24m 52s) [20:03:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:46] (03CR) 10Chad: "Oh dur, you're right. Just delisting not hiding." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/298544 (https://phabricator.wikimedia.org/T116948) (owner: 10Awight) [20:04:53] urandom: sorry :/ That probably means that someone was trying to use it as a stick to beat your team [20:05:04] bd808: you are correct. [20:06:08] !log Disabling Puppet on restbase2003.codfw.wmnet : T134016 [20:06:09] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:06:12] I read it during a summer of "trying to think like an MBA" [20:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:06:41] bd808: yikes; why on Earth would you want to do that? :) [20:06:42] which was a useful exercise for understanding my sales team at the time [20:06:49] auh [20:07:33] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2477110 (10GWicke) [20:07:52] I find it easier to debate things when I can see the other side with some empathy [20:08:09] yeah, fair enough [20:08:11] and I had many thing to debate with sales ;) [20:09:34] Could I get someone to merge https://gerrit.wikimedia.org/r/#/c/299785/ for me? It's another Cassandra bootstrap, routine, safe, and the last one I will ask for today. :) [20:13:50] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: Connection refused eevans Node is bootstrapping - The acknowledgement expires at: 2016-07-21 20:13:34. [20:17:58] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [20:19:37] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2477201 (10GWicke) [20:19:49] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5098654 keys - replication_delay is 0 [20:21:36] !log Lowering compaction throughput from 45MB/s to 35MB/s on restbase1013-c.eqiad.wmnet : T134016 [20:21:37] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:21:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:28:02] !log Throttling stream throughput to 20MB/s on all rack 'b' instances : T134016 [20:28:03] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:30:44] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2477289 (10GWicke) [20:31:00] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2477306 (10GWicke) p:05Triage>03Normal [20:44:55] !log Lowering compaction throughput from 35MB/s to 20MB/s on restbase1013-c.eqiad.wmnet : T134016 [20:44:56] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:42] !log Lowering compaction throughput to 20MB/s on restbase1013-{a,b}.eqiad.wmnet : T134016 [20:45:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:51] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [20:47:08] 06Operations, 06Reading-Infrastructure-Team, 06Services, 06Services-next, 07Security-General: Protect sensitive user-related information with a UserData / auth / session service - https://phabricator.wikimedia.org/T140813#2477370 (10GWicke) [20:48:51] (03PS1) 10Jcrespo: Correct ip for db1043: 10.64.16.32, not 10.64.16.33 [puppet] - 10https://gerrit.wikimedia.org/r/299856 (https://phabricator.wikimedia.org/T138460) [20:48:58] 06Operations, 10ops-codfw, 10ops-eqiad: ship 7 ex4200s from codfw to eqiad - https://phabricator.wikimedia.org/T140655#2477375 (10Papaul) I kept 2 EX4200 on site and shipped 7 EX4200 to Eqiad . Below are the S/N of what I shipped in Eqiad: BP0212064096 BP0211166169 BP0211500124 BP0212234923 BP0211166117 B... [20:50:15] 06Operations, 10ops-codfw, 10ops-eqiad: ship 7 ex4200s from codfw to eqiad - https://phabricator.wikimedia.org/T140655#2477378 (10Papaul) a:05Papaul>03Cmjohnson @Cmjohnson I am assigning this task to you once you received the switches you can resolve the task. Thanks. [20:50:48] (03PS2) 10Jcrespo: Correct ip for db1043: 10.64.16.32, not 10.64.16.33 [puppet] - 10https://gerrit.wikimedia.org/r/299856 (https://phabricator.wikimedia.org/T138460) [20:53:58] mutante: ping? [20:54:58] urandom: need a merge for restbase upgrade? [20:55:11] mutante: a bootstrap, yeah [20:55:17] https://gerrit.wikimedia.org/r/#/c/299785/ [20:55:40] mutante: the upgrades are done, if you can believe it :) [20:56:04] oh! cool! [20:56:31] (03PS2) 10Dzahn: Enable instance restbase2003-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/299785 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:56:40] (03CR) 10Dzahn: [C: 032] "136.32.192.10.in-addr.arpa domain name pointer restbase2003-c.codfw.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/299785 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:57:03] (03CR) 10Jcrespo: [C: 032 V: 032] Correct ip for db1043: 10.64.16.32, not 10.64.16.33 [puppet] - 10https://gerrit.wikimedia.org/r/299856 (https://phabricator.wikimedia.org/T138460) (owner: 10Jcrespo) [20:57:26] (03PS3) 10Dzahn: Enable instance restbase2003-c for bootstrap [puppet] - 10https://gerrit.wikimedia.org/r/299785 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:58:26] (03CR) 10Dzahn: [V: 032] "leroy isn't checking .yaml" [puppet] - 10https://gerrit.wikimedia.org/r/299785 (https://phabricator.wikimedia.org/T134016) (owner: 10Eevans) [20:58:58] mutante: thanks! [20:59:11] jynus: do you want the IP change right now? [20:59:25] yes [20:59:40] ok, merged 2 changes [20:59:42] sorry, I was having serious conversations on another channel [20:59:43] on master [20:59:48] no problem at all [20:59:49] lol [20:59:54] of very important isssues [21:00:09] and got distracted [21:00:18] thank you, mutante [21:00:37] that will make an ongoing alert disappear [21:01:32] (03CR) 10Chad: "Aaron do you still need this? Chris I'm guess no?" [puppet] - 10https://gerrit.wikimedia.org/r/298832 (owner: 10Chad) [21:03:16] ok, that sould recover the alert [21:03:38] and we are now ready for the m3-phab failover [21:03:58] RECOVERY - haproxy failover on dbproxy1003 is OK: OK check_failover servers up 2 down 0 [21:07:21] 06Operations, 10Continuous-Integration-Infrastructure (phase-out-gallium), 13Patch-For-Review: Migrate CI services from gallium to contint1001 - https://phabricator.wikimedia.org/T137358#2477448 (10hashar) [21:07:24] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2477447 (10hashar) 05Open>03Resolved [21:07:35] 06Operations, 06Release-Engineering-Team, 10Continuous-Integration-Infrastructure (phase-out-gallium): Port Zuul package 2.1.0-95-g66c8e52 from Precise to Jessie - https://phabricator.wikimedia.org/T137279#2363409 (10hashar) Need to further bump it.. [21:08:07] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2477453 (10jcrespo) We are now ready to do this, unless some disaster happens. We just need to set m3-master as read only and merge this during the maintenance: https://g... [21:08:14] !log Bootstrapping restbase2003-c.codfw.wmnet : T134016 [21:08:15] T134016: RESTBase Cassandra cluster: Increase instance count to 3 - https://phabricator.wikimedia.org/T134016 [21:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:25] 06Operations, 10DBA, 10Phabricator, 13Patch-For-Review: Upgrade m3 (phabricator) db servers - https://phabricator.wikimedia.org/T138460#2477455 (10jcrespo) [21:12:30] 06Operations, 10Ops-Access-Requests: Requesting access to labtest root for bd808 - https://phabricator.wikimedia.org/T140830#2477476 (10Andrew) [21:13:09] bblack, any idea what's going on here with the host command? https://phabricator.wikimedia.org/T139438 [21:18:28] (03PS1) 10Gehel: WIP - configure new relevance forge servers [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) [21:21:17] (03CR) 10EBernhardson: "trivial comment about LVS" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [21:22:07] ebernhardson: you're fast! [21:22:19] my phone beeps :P [21:23:08] (03CR) 10Gehel: WIP - configure new relevance forge servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/299865 (https://phabricator.wikimedia.org/T137256) (owner: 10Gehel) [21:23:21] !log temporarily lowered compaction throughput on all 1012 instances from 60mb/s to 20mb/s via `nodetool setcompactionthroughput 20` (T140825) [21:23:23] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [21:23:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:23:37] ebernhardson: the curse of our modern world... [21:23:50] Time to unplug and get some sleep... have fun! [21:28:34] (03PS1) 10Dzahn: planet: add phabricator and labs tools blog feeds [puppet] - 10https://gerrit.wikimedia.org/r/299867 [21:31:04] (03CR) 10Paladox: [C: 031] planet: add phabricator and labs tools blog feeds [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [21:31:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 27 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:32:10] (03PS2) 10Dzahn: planet: add phabricator and labs tools blog feeds [puppet] - 10https://gerrit.wikimedia.org/r/299867 [21:33:52] MaxSem, i added kartographer depl window in 25 min [21:34:13] (03PS3) 10Dzahn: planet: add phabricator and labs tools blog feeds [puppet] - 10https://gerrit.wikimedia.org/r/299867 [21:37:39] (03CR) 1020after4: [C: 031] "Awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [21:37:44] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 12 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [21:38:44] PROBLEM - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: Connection refused [21:41:47] (03PS1) 10MaxSem: Enable Kartographer on Meta, ca:, he: and mk: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299870 [21:41:55] yurik, ^ [21:43:06] (03CR) 10Rush: "My tools/labs blog is "beta" as an idea. Let me see if we want to continue it officially." [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [21:43:43] !log temporarily lowering compaction throughput on all eqiad restbase cassandra instances from 60mb/s to 20mb/s via `nodetool setcompactionthroughput 20` (T140825) [21:43:44] T140825: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825 [21:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:35] 06Operations, 06Discovery, 06Maps, 10Maps-data: Maps - enable Geoshapes on production - https://phabricator.wikimedia.org/T138525#2477618 (10Yurik) 05Open>03Resolved a:03Yurik [21:51:17] (03PS2) 10Yurik: Enable Kartographer on Meta, ca:, he: and mk: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299870 (https://phabricator.wikimedia.org/T139946) (owner: 10MaxSem) [21:52:13] (03PS3) 10Yurik: Enable Kartographer on Meta, ca:, he: and mk: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299870 (https://phabricator.wikimedia.org/T139946) (owner: 10MaxSem) [21:52:42] (03CR) 10Yurik: [C: 031] Enable Kartographer on Meta, ca:, he: and mk: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299870 (https://phabricator.wikimedia.org/T139946) (owner: 10MaxSem) [21:54:28] !log demon@tin Synchronized php-1.28.0-wmf.10/extensions/AbuseFilter: Backported fix for logspam (duration: 00m 38s) [21:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:55] (03PS4) 10Dzahn: planet: add phabricator and labs tools blog feeds [puppet] - 10https://gerrit.wikimedia.org/r/299867 [21:56:45] (03PS5) 10Dzahn: planet: add phabricator releng blog feed [puppet] - 10https://gerrit.wikimedia.org/r/299867 [22:00:05] maxsem and yurik: Respected human, time to deploy Kartographer deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160719T2200). Please do the needful. [22:00:55] (03PS5) 10Andrew Bogott: WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) [22:01:18] MaxSem, you want to do it, or should i? [22:01:40] (03CR) 10MaxSem: [C: 032] Enable Kartographer on Meta, ca:, he: and mk: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299870 (https://phabricator.wikimedia.org/T139946) (owner: 10MaxSem) [22:03:23] (03CR) 10jenkins-bot: [V: 04-1] WIP: Horizon tab for modifying instance puppet config [puppet] - 10https://gerrit.wikimedia.org/r/294342 (https://phabricator.wikimedia.org/T91990) (owner: 10Andrew Bogott) [22:03:50] (03Merged) 10jenkins-bot: Enable Kartographer on Meta, ca:, he: and mk: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299870 (https://phabricator.wikimedia.org/T139946) (owner: 10MaxSem) [22:06:13] pulled on mw1099 [22:06:18] 06Operations, 06Discovery, 10Elasticsearch, 10hardware-requests, and 2 others: Decommission elastic1001-1016 - https://phabricator.wikimedia.org/T139758#2477668 (10debt) [22:06:47] Krenair: not sure... check all the recursors that "host" is using directly with dig? [22:07:08] bblack, can you make host show which recursors it's using? [22:07:11] (03PS1) 10Dzahn: install_server: let osmium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/299878 (https://phabricator.wikimedia.org/T132530) [22:07:16] Krenair: it would use /etc/resolv.conf [22:07:29] Krenair: my first guess, though, would be the second lookup is AAAA [22:07:46] host hostname [server] [22:07:46] it won't get any AAAA records, no IPv6 in labs [22:08:18] krenair@bastion-01:~$ grep nameserver /etc/resolv.conf [22:08:18] nameserver 208.80.155.118 [22:08:18] nameserver 208.80.154.20 [22:08:21] krenair@bastion-01:~$ [22:08:24] Krenair: yeah but NXDOMAIN doesn't mean "no AAAA records", it means "this hostname doesn't exist, for any kind of record". It's common for crappy DNS servers to emit NXDOMAIN when they just lack a record, though. [22:08:38] I would imagine the relevant authdns is doing that... [22:08:39] those IPs are labs-recursor0.wikimedia.org. and labs-recursor1.wikimedia.org. [22:09:21] bblack@neodymium:~$ dig @208.80.155.118 promethium.wikitextexp.eqiad.wmflabs AAAA | grep HEADER [22:09:24] ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 22194 [22:09:34] so whatever's serving the A-records, it needs to not return NXDOMAIN for the AAAA [22:09:46] do we know for sure it's looking for AAAAs? [22:10:08] I would assume so, since host usually shows both, e.g. [22:10:08] bblack@neodymium:~$ host cp1045.eqiad.wmnet [22:10:09] cp1045.eqiad.wmnet has address 10.64.32.97 [22:10:09] cp1045.eqiad.wmnet has IPv6 address 2620:0:861:103:10:64:32:97 [22:10:28] in any case, AAAA->NXDOMAIN is a problem. it will confuse caches and clients. [22:10:32] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/299870/ (duration: 00m 29s) [22:10:33] This is information it should show in verbose mode, but that'd be too helpful [22:10:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:10:48] far too helpful [22:10:56] * Krenair grumbles [22:11:22] is the "wmflabs." domain served directly from the recursors, or from labs-nsX? [22:11:38] ACKNOWLEDGEMENT - cassandra-c CQL 10.192.32.136:9042 on restbase2003 is CRITICAL: Connection refused eevans Node is bootstrapping - The acknowledgement expires at: 2016-07-21 22:11:18. [22:11:40] yurik, appears to work [22:12:04] MaxSem, cool, lets enable it and work on the posting [22:12:04] recurors primarily, they forward to labs-ns [22:12:05] nevermind, I looked, it appears "wmflabs." (as opposed to wmflabs.org) is just hacked into the labs recursors, not the labs authdns [22:12:56] this particular example is one that's added by a script on the recurors though, labs-ns* won't have it [22:13:21] does labs-ns have wmflabs in general (as opposed to wmflabs.org again) [22:13:23] (this one is unique in that it points to a physical host in the labs-support network, nova doesn't know about it) [22:13:33] yes [22:13:35] ok [22:13:58] well, something is doing something wrong, it's hard to say what layer there [22:14:06] either pdns-recursor, or the script feeding it data, or both [22:14:14] I think we can make the script handle this properly [22:14:28] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 26 probes of 238 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:14:53] It might be a case of picking up AAAA record requests for domains we know A records for in this case, and returning 0, {} [22:15:47] (03PS1) 10Dzahn: install_server: let osmium use mw-raid1 partman [puppet] - 10https://gerrit.wikimedia.org/r/299879 (https://phabricator.wikimedia.org/T132530) [22:15:49] if you want an example of what it's supposed to look like on the dig/protocol side: [22:15:52] dig @ns0.wikimedia.org geoiplookup.wikimedia.org AAAA [22:16:20] Ah you query for AAAA but it returns A? [22:16:30] as well as SOA [22:16:45] that answers with no records in the "answer" section (because that host has no AAAA). It does incidentally drop the A record in the "additional" section just to be helpful, but that's optional. The important thing is the response code is NOERROR rather than NXDOMAIN in the header section. [22:16:58] the SOA in the authority is required when there's no answer record [22:17:20] (the TTL of that SOA record tells the query-er how long they can cache non-existence) [22:17:47] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 22 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:18:35] (03CR) 10Greg Grossmeier: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/299867 (owner: 10Dzahn) [22:18:40] most true authdns servers will get these basic things right on their own [22:18:54] (03PS2) 10Dzahn: install_server: let osmium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/299878 (https://phabricator.wikimedia.org/T132530) [22:19:01] but yeah when stuffing records into a recursor without full authserver logic, things can be tricky. still there, should be a way to make it behave correctly. [22:19:47] (03PS2) 10BBlack: insecure post: 100% failure, loophole closed [puppet] - 10https://gerrit.wikimedia.org/r/299532 (https://phabricator.wikimedia.org/T136674) [22:19:48] It appears our authdns would not normally send the A in response to an AAAA query, but it does NOERROR status & the SOA record [22:19:49] (03CR) 10Ori.livneh: [C: 031] nutcracker: default verbosity to 4 [puppet] - 10https://gerrit.wikimedia.org/r/299146 (https://phabricator.wikimedia.org/T136078) (owner: 10Filippo Giunchedi) [22:20:28] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 10 probes of 238 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:23:30] Krenair: yeah the A in the additional is completely optional. it's just an optimization thing (because the querying client is likely to do back-to-back queries of A+AAAA, so our server sends the opposite address variant in the Additional section always to be helpful) [22:23:39] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 13 probes of 245 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [22:24:03] (03CR) 10BBlack: [C: 032] insecure post: 100% failure, loophole closed [puppet] - 10https://gerrit.wikimedia.org/r/299532 (https://phabricator.wikimedia.org/T136674) (owner: 10BBlack) [22:26:54] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016), 13Patch-For-Review: Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2477796 (10BBlack) 05Open>03Resolved [22:26:56] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, and 2 others: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#2477797 (10BBlack) [22:27:04] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2477800 (10BBlack) [22:27:09] 06Operations, 10Traffic, 07HTTPS, 05MW-1.27-release-notes, and 2 others: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1490448 (10BBlack) 05Open>03Resolved [22:29:50] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2477809 (10BBlack) [22:33:50] (03PS1) 10Dzahn: osmium: rsync home dirs to hafnium for migration [puppet] - 10https://gerrit.wikimedia.org/r/299889 (https://phabricator.wikimedia.org/T132530) [22:39:16] (03CR) 10Ori.livneh: puppetmaster: generate prometheus targets from ganglia (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/299539 (https://phabricator.wikimedia.org/T126785) (owner: 10Filippo Giunchedi) [22:39:20] (03PS17) 10Thcipriani: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:45:10] (03PS18) 10Thcipriani: Logstash_checker script for canary deploys [puppet] - 10https://gerrit.wikimedia.org/r/292505 (https://phabricator.wikimedia.org/T110068) (owner: 10GWicke) [22:49:39] (03PS1) 10Ori.livneh: rcstream: log X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/299892 (https://phabricator.wikimedia.org/T140128) [22:50:19] (03PS3) 10Dzahn: install_server: let osmium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/299878 (https://phabricator.wikimedia.org/T132530) [22:50:32] (03CR) 10Dzahn: [C: 032] install_server: let osmium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/299878 (https://phabricator.wikimedia.org/T132530) (owner: 10Dzahn) [22:50:41] (03CR) 10Ori.livneh: [C: 032 V: 032] rcstream: log X-Forwarded-Proto [puppet] - 10https://gerrit.wikimedia.org/r/299892 (https://phabricator.wikimedia.org/T140128) (owner: 10Ori.livneh) [23:00:04] RoanKattouw, ostriches, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160719T2300). [23:00:04] Dereckson, James_F, and Pchelolo: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:45] Heya. [23:00:58] Hi. I can SWAT this evening. [23:02:20] Pchelolo: there? [23:02:29] Dereckson: here [23:03:17] Pchelolo: https://gerrit.wikimedia.org/r/#/c/295494/ is already in wmf10? [23:03:36] Doesn't look like it [23:03:47] Dereckson: Protip: "Included in" drop down on gerrit. [23:04:07] oh, nice new feature [23:04:17] Ancient feature ;-) [23:05:32] Dereckson: so all is ok with it? the backport gerrit is https://gerrit.wikimedia.org/r/#/c/299778/ [23:06:32] Pchelolo: if it needs the new YAML schema to work, you also need to cherry-pick https://gerrit.wikimedia.org/r/#/c/295494/ [23:06:46] (03PS1) 10Ori.livneh: Fix-up for I39d2d7db576: move log_format directive to top level [puppet] - 10https://gerrit.wikimedia.org/r/299898 [23:07:05] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:07:19] ostriches: AaronSchulz https://logstash.wikimedia.org/app/kibana#/dashboard/mediawiki-errors looks awesome now (after that abusefilter patch, I reckon) [23:07:20] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I39d2d7db576: move log_format directive to top level [puppet] - 10https://gerrit.wikimedia.org/r/299898 (owner: 10Ori.livneh) [23:07:34] Dereckson: as far as I know the yaml is deployed separately from the MW branch, so it's ok that it's not in wmf10 [23:08:04] Dereckson: mobrovac have verified that the schemas are deployed this morning [23:08:39] ack [23:09:00] (03PS4) 10Dzahn: install_server: let osmium use jessie-installer [puppet] - 10https://gerrit.wikimedia.org/r/299878 (https://phabricator.wikimedia.org/T132530) [23:09:04] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [23:10:09] (03PS4) 10Dereckson: dblists: Switch VisualEditor to a negative rather than positive list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [23:10:40] (03CR) 10Dereckson: [C: 032] dblists: Switch VisualEditor to a negative rather than positive list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [23:10:50] 06Operations, 10Cassandra, 10RESTBase-Cassandra, 06Services: Throttle compaction throughput limit in line with instance count - https://phabricator.wikimedia.org/T140825#2477922 (10GWicke) Effect on iowait: {F4289757} disk write throughput: {F4289759} GC time: {F4289762} Read latency has only shown m... [23:12:10] (03Merged) 10jenkins-bot: dblists: Switch VisualEditor to a negative rather than positive list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296929 (owner: 10Jforrester) [23:14:14] bblack, I can get pdns to return no records (NOERROR status), and the host command is happy... but the SOA is required right? [23:14:20] (03PS2) 10Dzahn: install_server: let osmium use mw-raid1 partman [puppet] - 10https://gerrit.wikimedia.org/r/299879 (https://phabricator.wikimedia.org/T132530) [23:14:43] James_F: 5bde14e1e dblists: Switch VisualEditor to a negative rather than positive list on mw1099 [23:15:01] Up. [23:15:17] greg-g: Better. I wish I knew why Math was complaining about Restbase and Cassandra though [23:15:37] (03PS3) 10Dzahn: install_server: let osmium use mw-raid1 partman [puppet] - 10https://gerrit.wikimedia.org/r/299879 (https://phabricator.wikimedia.org/T132530) [23:16:04] (03CR) 10Dzahn: [C: 032] install_server: let osmium use mw-raid1 partman [puppet] - 10https://gerrit.wikimedia.org/r/299879 (https://phabricator.wikimedia.org/T132530) (owner: 10Dzahn) [23:16:11] Mostly "Expected width > 0." [23:16:14] (03PS3) 10Dereckson: dblists: Delete no-longer-used visualeditor-default.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296930 (owner: 10Jforrester) [23:16:19] Couple of cassandra complaints though [23:16:24] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296930 (owner: 10Jforrester) [23:16:36] Dereckson: Which test server did you use? 1099? [23:16:40] James_F: yes [23:17:10] We switched to 1099, so 1017 can be used for other test procedures [23:17:12] Dereckson: Looks OK, will test a little more. [23:17:14] * James_F nods. [23:17:23] (03Merged) 10jenkins-bot: dblists: Delete no-longer-used visualeditor-default.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296930 (owner: 10Jforrester) [23:18:03] Dereckson: Yup, LGTM. [23:18:19] Dereckson: Will need to sync in a complicated way to avoid fatals, BTW. [23:19:07] new dblists first, then IS/CS, then remove the old dblists? [23:19:29] Or just scap :) [23:19:57] (and for IS/CS, that would have needed CS first to get the dblist names) [23:20:03] yes, just scap is probably better [23:20:18] wait, where'd deploy markers go in logstash? [23:20:27] James_F: 296930 on mw1099 [23:21:00] Dereckson: It's an unreferenced config file. It's either fatalling like mad or it's fine. [23:21:01] Krenair: ideally, but you can probably live without it, I guess? [23:21:23] Dereckson: hm... is the EventBus change already up? I don't see new events for some reason, hold on [23:21:47] Krenair: I doubt you'll get the SOA result from hacking records into a recursor, that's usually pretty limited (it's meant to do small one-off things) [23:21:51] Pchelolo: Zuul has merged it, but I didn't deployed that yet. [23:22:04] yeah [23:22:11] Dereckson: lemme check something one more time [23:22:17] Pchelolo: sure [23:22:17] 2 minnutes [23:22:17] Dereckson: (It looks fine.) [23:22:51] 296767 needs a manual merge [23:23:02] greg-g: :(( kiban4 doesn't support them [23:23:42] greg-g: the next best thing is https://logstash.wikimedia.org/app/kibana#/dashboard/Deployments [23:24:52] (03CR) 10Dereckson: [C: 04-1] "Should be deployed after main TitleBlacklist switch to extension registration." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296767 (owner: 10Dereckson) [23:25:21] James_F: actually, 296767 needs 296770 [23:25:26] bd808: muther.... [23:25:28] Dereckson: I'm seeing some logs I shouldn't see from mediawiki.org related to this change.. Could we rollback https://gerrit.wikimedia.org/r/#/c/299778/ ? Everyone who can help in debugging it is asleep now, so I'd better be on the safe side [23:25:58] greg-g: tracked upstream at least -- https://github.com/elastic/kibana/issues/2706 [23:26:04] Pchelolo: yes, we can, but you ack'ed I *didn't* deployed it? [23:26:56] Dereckson: ye, but it went to group0 with a regular train, everything looked good, but now I started digging deeper and something is suspicious [23:27:07] * Dereckson nods. [23:27:18] Dereckson: sorry for that.. [23:27:22] not a problem [23:27:25] bd808: before I looked at all of the "+1" comments I was about to subscribe, but I don't need that kind of spam [23:28:43] (03PS20) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [23:29:35] Ugh. [23:29:37] https://phabricator.wikimedia.org/T140848 [23:29:49] (03PS1) 10Alex Monk: labs dnsrecursor metaldns: Don't return NXDOMAIN when we don't have a record of the right type but do recognise the domain [puppet] - 10https://gerrit.wikimedia.org/r/299903 (https://phabricator.wikimedia.org/T139438) [23:31:01] (03CR) 10Alex Monk: "(tested on labs-dnsrecursor-test.openstack.eqiad.wmflabs)" [puppet] - 10https://gerrit.wikimedia.org/r/299903 (https://phabricator.wikimedia.org/T139438) (owner: 10Alex Monk) [23:31:02] Dereckson: Ah, damn, I vaguely remembered (wrongly it turns out) that 296770 had already been done, sorry. [23:31:07] * James_F rebases. [23:31:13] Pchelolo: commit reverted in https://gerrit.wikimedia.org/r/299904, Zuul is merging that [23:31:44] Dereckson: thank you. Sorry for the mess [23:31:56] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [23:32:15] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [23:32:47] James_F: or we can add 296770 to SWAT, it's ready [23:33:30] bd808: seriously, is not deploy events such a core thing to know in these graphs? sorry, I just needed to rant. [23:34:05] (03Restored) 10Chad: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz DziewoƄski) [23:34:32] Dereckson: Happy to do so if you are. [23:34:41] greg-g: there are many WTFs in the kibana4 rewrite. I'm sure there are a lot of "oh shiny" additions too. On the plus side it seems like elastic.co is actually working on the product now [23:34:59] ok, added to the table [23:35:11] (03PS2) 10Dereckson: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296770 [23:35:13] bd808: as long as annotations don't become an open core add-on ;) [23:35:13] (03PS2) 10Chad: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) (owner: 10Bartosz DziewoƄski) [23:35:26] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296770 (owner: 10Dereckson) [23:35:45] greg-g: now you are going to make me rant >_< [23:36:09] sorry :) :) [23:36:13] (03Merged) 10jenkins-bot: Use extension registration for TitleBlacklist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296770 (owner: 10Dereckson) [23:36:15] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:36:35] 296770 on mw1099 [23:37:18] (03PS3) 10Bartosz DziewoƄski: Disable $wgAbuseFilterProfile for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/283243 (https://phabricator.wikimedia.org/T132200) [23:37:38] * James_F checks. [23:38:30] looks goot do me in logs (fatalmonitor, Kibana view on mw1099) [23:38:34] Dereckson: LGTM too. [23:39:32] (03PS1) 10Chad: Remove SiteConfiguration::isLocalVHost() from test class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/299906 [23:39:36] (03PS3) 10Dereckson: Cleanup: Move never-altered WikiLoveDefault into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292623 (owner: 10Jforrester) [23:40:15] (03CR) 10Dereckson: [C: 031] "Nice idea." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292623 (owner: 10Jforrester) [23:40:24] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292623 (owner: 10Jforrester) [23:41:33] too_complex_to_determinize_exception: Determinizing automaton with 26734 states and 39282 transitions would result in more than 10000 states. [23:41:35] lol ya think? [23:41:38] (03Merged) 10jenkins-bot: Cleanup: Move never-altered WikiLoveDefault into CommonSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292623 (owner: 10Jforrester) [23:42:06] James_F: 292623 (WikiLove no-op) on mw1099 [23:42:44] (03PS21) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [23:42:52] * James_F checks. [23:43:18] Dereckson: Yeah, I think that's good. [23:43:20] mwrepl still says 1 for $wgDefaultUserOptions['wikilove-enabled'] [23:43:26] Hmm. [23:44:01] (03CR) 10jenkins-bot: [V: 04-1] Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [23:44:11] (03PS3) 10Dereckson: Cleanup: Note a couple of items that are varied in Labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292624 (owner: 10Jforrester) [23:44:35] (03CR) 10Dereckson: [C: 032] "SWAT, no-op" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292624 (owner: 10Jforrester) [23:44:40] Dereckson: Going to do https://gerrit.wikimedia.org/r/#/c/296767/ too then? [23:45:23] (03PS2) 10Dzahn: osmium: rsync home dirs to hafnium for migration [puppet] - 10https://gerrit.wikimedia.org/r/299889 (https://phabricator.wikimedia.org/T132530) [23:45:28] (03Merged) 10jenkins-bot: Cleanup: Note a couple of items that are varied in Labs only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/292624 (owner: 10Jforrester) [23:46:02] no-op IS comments change on mw1099 [23:46:06] (03PS3) 10Dzahn: osmium: rsync home dirs to hafnium for migration [puppet] - 10https://gerrit.wikimedia.org/r/299889 (https://phabricator.wikimedia.org/T132530) [23:47:20] James_F: arg, 296767 wasn't useful for a full scap process, the goal was to allow IS/CS/IS without losing the variable [23:47:55] (03PS2) 10Dereckson: Get rid of $wmgTitleBlacklistUsernameSources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296767 [23:48:55] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296767 (owner: 10Dereckson) [23:49:32] (03Merged) 10jenkins-bot: Get rid of $wmgTitleBlacklistUsernameSources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/296767 (owner: 10Dereckson) [23:49:57] James_F: 296767 on mw1099 [23:50:40] mwrepl gives correct value for $wgTitleBlacklistUsernameSources [23:50:43] Dereckson: Looks OK. [23:50:52] Dereckson: Thank you! [23:51:34] (03PS22) 10MaxSem: Script to do the initial data load from OSM for Maps project [puppet] - 10https://gerrit.wikimedia.org/r/293105 (owner: 10Gehel) [23:53:22] James_F: something unrelated in the mw1099 logs [23:53:51] (03CR) 10Dzahn: [C: 032] osmium: rsync home dirs to hafnium for migration [puppet] - 10https://gerrit.wikimedia.org/r/299889 (https://phabricator.wikimedia.org/T132530) (owner: 10Dzahn) [23:54:08] 07Blocked-on-Operations, 06Operations, 10Kartographer, 10Wikimedia-Extension-setup, and 4 others: Enable Interactive Maps (Kartographer) on Macedonian Wikipedia - https://phabricator.wikimedia.org/T139946#2478078 (10MaxSem) 05stalled>03Resolved [23:54:32] Looks good to me, ready for scap. [23:58:49] !log dereckson@tin Started scap: wmf-config/ upgrade: Gerrit changes 296770, 296767, 296929, 296930, 292623, 292624 [23:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master