[00:00:31] (03Merged) 10jenkins-bot: Remove extraneous namespace shorcut alias on ru.wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278206 (owner: 10Dereckson) [00:01:01] (03CR) 10Alex Monk: [C: 032] "zomg admin-granted bot rights?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276410 (https://phabricator.wikimedia.org/T129087) (owner: 10Pmlineditor) [00:01:21] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/278206/ (duration: 00m 25s) [00:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:01:44] (03Merged) 10jenkins-bot: Added filemover and flood user group to bnwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276410 (https://phabricator.wikimedia.org/T129087) (owner: 10Pmlineditor) [00:02:38] Krenair: you're stingy with the comment :D [00:02:49] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/276410/ (duration: 00m 26s) [00:02:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:03:12] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:03:32] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:05:32] all good Dereckson? [00:06:43] Yes, works. [00:06:55] (03CR) 10Alex Monk: [C: 032] Enable extension WikiLove on bnwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276934 (https://phabricator.wikimedia.org/T129728) (owner: 10Pmlineditor) [00:06:56] Their translation of "group-flood" is "Bot users" [00:07:26] (03Merged) 10jenkins-bot: Enable extension WikiLove on bnwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276934 (https://phabricator.wikimedia.org/T129728) (owner: 10Pmlineditor) [00:07:51] Krenair: did you add the wikilove table? [00:08:11] as you were typing that question, yes [00:08:14] ok :) [00:08:14] krenair@tin:/srv/mediawiki-staging (master)$ mwscript extensions/WikimediaMaintenance/createExtensionTables.php bnwiki wikilove [00:08:14] Creating wikilove tables...done! [00:08:15] krenair@tin:/srv/mediawiki-staging (master)$ [00:09:11] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/276934/ (duration: 00m 26s) [00:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:24] Testing. [00:09:24] Dereckson, ^ [00:09:58] !log reboot elastic1021.eqiad.wmnet for kernel upgrade [00:09:58] Still waiting JS caching update. [00:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:14] (previous Wikilove deployment, it were okay after 2-3 minutes) [00:10:50] Works. [00:12:30] https://en.wikipedia.org/wiki/User_talk:Krenair#for_creating_wikilove_tables [00:13:42] :) [00:13:55] :) [00:15:10] legoktm, matt_flaschen: hey [00:15:14] found an echo issue [00:15:57] just got a mail from Echo containing [00:16:27] ah... might not be echo [00:16:34] that's a valid image [00:16:37] but gmail shows it as broken [00:16:55] Images are disabled by default in gmail, aren't they? [00:19:14] Krenair, this is a known bug. [00:19:18] In Gmail :) [00:19:21] But yeah [00:19:21] heh, ok [00:19:40] You commented on it too. :) [00:19:41] sorry for bothering you, at first I assumed the link was broken [00:19:42] https://phabricator.wikimedia.org/T127794 [00:19:44] lol [00:20:36] wasn't even that long ago [00:20:42] (03PS2) 10Dzahn: puppet-lint: fix or disable remaining alignment warns [puppet] - 10https://gerrit.wikimedia.org/r/278195 [00:21:23] yea, i mean.. dont we also block svg in phab uploads [00:21:29] as opposed to png [00:21:41] because it can be malicious [00:22:29] we do have to filter SVG stuff in MW, IIRC [00:23:12] might be why google also doesnt want to support it in the proxy [00:23:48] (03PS3) 10Dzahn: puppet-lint: fix or disable remaining alignment warns [puppet] - 10https://gerrit.wikimedia.org/r/278195 [00:24:06] https://phabricator.wikimedia.org/T130177 <-- VE team still wants to give green light for this kind of requests? [00:24:39] They evaluate the quantum of content vs. discussion pages to validate or recommend to wait for Flow instead? [00:26:41] (03CR) 10Dzahn: [C: 032] "almost all are just the special comments to disable linting for just a couple specific lines, consider them FIXME's" [puppet] - 10https://gerrit.wikimedia.org/r/278195 (owner: 10Dzahn) [00:34:53] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [00:35:21] PROBLEM - puppet last run on mira is CRITICAL: CRITICAL: puppet fail [00:36:52] uhm yes.. already looking [00:38:20] !log reboot elastic1022.eqiad.wmnet for kernel upgrade [00:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:38:26] no error that i can see,what's up icinga-wm [00:38:31] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [00:38:36] duh [00:39:20] lol @ motd of mira [00:41:19] 6Operations, 6Commons, 10MediaWiki-Uploading, 6Multimedia: Special:UploadStash thumbnails failing to generate with 500 & 503 - https://phabricator.wikimedia.org/T130204#2133093 (10matmarex) >>! In T130204#2132939, @Tgr wrote: > Could be the same issue as T90599? I don't think so. I'm pretty sure this was... [00:41:38] (03PS1) 10Dereckson: Revert "Remove extraneous namespace shorcut alias on ru.wikibooks" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278213 [00:41:40] (03PS1) 10Dzahn: puppet-lint: remove exception for alignment check [puppet] - 10https://gerrit.wikimedia.org/r/278214 (https://phabricator.wikimedia.org/T93645) [00:42:21] RECOVERY - puppet last run on mira is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [00:42:54] (03PS2) 10Dzahn: puppet-lint: remove exception for alignment check [puppet] - 10https://gerrit.wikimedia.org/r/278214 (https://phabricator.wikimedia.org/T93645) [00:42:58] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: remove exception for alignment check [puppet] - 10https://gerrit.wikimedia.org/r/278214 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:43:38] Krenair: by the way, all the shorcuts for ru.wikibooks were correct, I tested "MOD" then latin "M", change had a correct cyrillic "М". [00:43:47] great [00:44:25] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: remove exception for alignment check [puppet] - 10https://gerrit.wikimedia.org/r/278214 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:46:23] 6Operations, 7Puppet, 13Patch-For-Review: Import vs autoload: the puppet parser is a bad joke that stopped being funny years ago. - https://phabricator.wikimedia.org/T119042#2133121 (10Dzahn) microsites roles moved https://gerrit.wikimedia.org/r/#/c/275034/ [00:51:33] (03PS1) 10Dzahn: puppet-lint: last 3 files with alignment issue, globally [puppet] - 10https://gerrit.wikimedia.org/r/278219 [00:52:50] (03CR) 10Dzahn: [C: 032] "ok, that should be it and then jenkins can vote on it and you will never see me do these again.. ever" [puppet] - 10https://gerrit.wikimedia.org/r/278219 (owner: 10Dzahn) [00:54:02] (03PS3) 10Dzahn: puppet-lint: remove exception for alignment check [puppet] - 10https://gerrit.wikimedia.org/r/278214 (https://phabricator.wikimedia.org/T93645) [00:54:10] 6Operations, 7Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2133124 (10Dzahn) disable linting for the few remaining special cases for indentation / arrows: https://gerrit.wikimedia.org/r/#/c/278195/ https://gerrit.wikimedia.org/r/#/... [00:55:47] (03CR) 10Dzahn: [C: 032] "operations-puppet-puppetlint-strict SUCCESS in 40s !" [puppet] - 10https://gerrit.wikimedia.org/r/278214 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [00:58:22] (03PS5) 10Dzahn: ores: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [01:09:01] (03CR) 10Dzahn: [C: 032] "ores-staging-01, ores-lb-02, ores-redis-01, ores-worker-03, ores-worker-04, ores-web-01, ores-web-02, ores-worker-01, ores-worker-02 - i c" [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [01:13:26] (03CR) 10Dzahn: "ran puppet on every single instance, confirmed no-op. ores-web-02 has an unrelated issue "Cannot allocate memory"" [puppet] - 10https://gerrit.wikimedia.org/r/270102 (owner: 10Tim Landscheidt) [01:13:42] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [01:15:31] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [01:31:46] !log maxsem@tin Synchronized php-1.27.0-wmf.17/extensions/CirrusSearch/: Emergency fix https://gerrit.wikimedia.org/r/#/c/278224/ (duration: 00m 36s) [01:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:52] dcausse, ^ [01:32:05] MaxSem: thanks, testing [01:50:30] (03PS1) 10Dzahn: gerrit: avoid defining class inside class [puppet] - 10https://gerrit.wikimedia.org/r/278225 [01:51:31] (03PS2) 10Dzahn: gerrit: avoid defining class inside class [puppet] - 10https://gerrit.wikimedia.org/r/278225 [01:52:27] (03CR) 10jenkins-bot: [V: 04-1] gerrit: avoid defining class inside class [puppet] - 10https://gerrit.wikimedia.org/r/278225 (owner: 10Dzahn) [01:56:08] (03PS3) 10Dzahn: gerrit: avoid defining class inside class [puppet] - 10https://gerrit.wikimedia.org/r/278225 [02:07:34] (03PS1) 10Dzahn: gerrit: move role classes to module, split in 2 files [puppet] - 10https://gerrit.wikimedia.org/r/278228 [02:10:56] (03PS2) 10Dzahn: gerrit: move role classes to module, split in 2 files [puppet] - 10https://gerrit.wikimedia.org/r/278228 [02:17:58] (03PS1) 10Dzahn: ganglia: fix "defined type defined inside a class" [puppet] - 10https://gerrit.wikimedia.org/r/278229 [02:26:13] (03PS1) 10Dzahn: mediawiki/refreshlinks: move cronjob define out of class [puppet] - 10https://gerrit.wikimedia.org/r/278230 [02:29:20] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.17) (duration: 09m 58s) [02:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:37] (03PS1) 10Dzahn: mediawiki/updatequerypages: move defines out of class [puppet] - 10https://gerrit.wikimedia.org/r/278231 [02:37:28] (03PS2) 10Dzahn: mediawiki/updatequerypages: move defines out of class [puppet] - 10https://gerrit.wikimedia.org/r/278231 [02:38:00] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Mar 18 02:38:00 UTC 2016 (duration 8m 40s) [02:38:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:25] (03PS1) 10Dzahn: base/syslogs: fix "defined typed defined inside a class" [puppet] - 10https://gerrit.wikimedia.org/r/278232 [02:47:43] (03PS1) 10Dzahn: mha: let lint ignore nested classes/defines [puppet] - 10https://gerrit.wikimedia.org/r/278233 [03:44:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 56.67% of data above the critical threshold [5000000.0] [03:58:33] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [04:00:22] (03PS1) 10KartikMistry: WIP: cxserver: Read config from cxserver/deploy [puppet] - 10https://gerrit.wikimedia.org/r/278235 [05:29:32] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [05:41:43] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [05:55:32] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [06:02:51] RECOVERY - cassandra-b CQL 10.64.32.203:9042 on restbase1012 is OK: TCP OK - 0.001 second response time on port 9042 [06:04:17] <_joe_> uhm this is no good [06:04:54] <_joe_> AaronSchulz: I think we should revert back to persistent redis connections [06:06:06] (03CR) 10Aaron Schulz: "To see if it lowers the long-standing spam of redis connection errors any." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 (owner: 10Aaron Schulz) [06:07:23] _joe_: I was wondering if they were broken since the last hhvm upgrade (since it's experimental) and the flood of exceptions immediately stopped afterwards. You can try to flip it on again and revert if something bad happens. [06:09:58] of course there must me lots of TIME_WAIT socket spam now [06:10:06] <_joe_> AaronSchulz: exactly :) [06:10:08] s/me/be [06:10:37] that's why I turned them on for runners to begin with ;) [06:12:12] <_joe_> I mean if that doesn't work, I'll rollback and just enlarge the conntrack table a bit on the jobrunners [06:13:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [06:14:44] ok [06:18:11] RECOVERY - Check size of conntrack table on mw1166 is OK: OK: nf_conntrack is 67 % full [06:20:33] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [06:20:34] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [06:22:22] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: Puppet has 1 failures [06:23:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:28:28] * AaronSchulz heads out [06:29:39] !log restarting elasticsearch server elastic1023.eqiad.wmnet [06:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:41] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:21] PROBLEM - puppet last run on ms-be2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:01] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.036 second response time on port 9042 [06:33:02] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [06:33:42] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:42:49] (03CR) 10Jcrespo: "Merging without an actual review? WTF?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/278055 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [06:44:31] PROBLEM - Check size of conntrack table on mw1166 is CRITICAL: CRITICAL: nf_conntrack is 94 % full [06:44:32] PROBLEM - Check size of conntrack table on mw1161 is CRITICAL: CRITICAL: nf_conntrack is 93 % full [06:44:42] PROBLEM - Check size of conntrack table on mw1167 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [06:44:43] PROBLEM - Check size of conntrack table on mw1169 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [06:44:52] PROBLEM - Check size of conntrack table on mw1165 is CRITICAL: CRITICAL: nf_conntrack is 91 % full [06:46:04] (03CR) 10Jcrespo: "And more merges that break production without review." [puppet] - 10https://gerrit.wikimedia.org/r/278056 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [06:46:39] (03CR) 10Jcrespo: "More production breakages." [puppet] - 10https://gerrit.wikimedia.org/r/278065 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [06:46:51] PROBLEM - Check size of conntrack table on mw1162 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [06:48:02] RECOVERY - Check size of conntrack table on mw1166 is OK: OK: nf_conntrack is 69 % full [06:48:02] RECOVERY - Check size of conntrack table on mw1161 is OK: OK: nf_conntrack is 67 % full [06:48:12] RECOVERY - Check size of conntrack table on mw1167 is OK: OK: nf_conntrack is 65 % full [06:48:21] RECOVERY - Check size of conntrack table on mw1169 is OK: OK: nf_conntrack is 54 % full [06:48:21] RECOVERY - Check size of conntrack table on mw1165 is OK: OK: nf_conntrack is 59 % full [06:48:32] RECOVERY - Check size of conntrack table on mw1162 is OK: OK: nf_conntrack is 52 % full [06:50:41] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:12] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:42] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:57:02] PROBLEM - Check size of conntrack table on mw1169 is CRITICAL: CRITICAL: nf_conntrack is 92 % full [06:58:21] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:42] RECOVERY - puppet last run on ms-be2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:24] (03Abandoned) 10KartikMistry: WIP: CX: Use ordered_yaml instead of ordered_json [puppet] - 10https://gerrit.wikimedia.org/r/263550 (owner: 10KartikMistry) [07:00:42] RECOVERY - Check size of conntrack table on mw1169 is OK: OK: nf_conntrack is 38 % full [07:12:23] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [07:21:22] PROBLEM - MariaDB Slave Lag: s1 on dbstore2002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 616.32 seconds [07:23:48] mmm [07:24:41] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:25:33] it is swapping like crazy [07:30:11] RECOVERY - MariaDB Slave Lag: s1 on dbstore2002 is OK: OK slave_sql_lag Replication lag: 280.24 seconds [07:30:41] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [07:31:43] 6Operations, 10Wikimedia-Stream: redis not up after reboot on rcs machines - https://phabricator.wikimedia.org/T130147#2133265 (10Joe) a:3Joe [07:32:12] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 55.17% of data above the critical threshold [5000000.0] [07:38:45] (03PS1) 10Jcrespo: Reduce memory usage by TokuDB and InnoDB main buffers [puppet] - 10https://gerrit.wikimedia.org/r/278238 (https://phabricator.wikimedia.org/T107282) [07:39:38] (03CR) 10Jcrespo: [C: 032] Reduce memory usage by TokuDB and InnoDB main buffers [puppet] - 10https://gerrit.wikimedia.org/r/278238 (https://phabricator.wikimedia.org/T107282) (owner: 10Jcrespo) [07:41:53] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2133313 (10elukey) @ArielGlenn Hello! Really curious about the document that Daniel pointed out above.. Is it impossible to serve dumps only... [07:41:59] !log restarting dbstore2002 to apply new mysql config [07:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:42:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [07:44:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [07:54:12] PROBLEM - Check size of conntrack table on mw1168 is CRITICAL: CRITICAL: nf_conntrack is 90 % full [07:56:01] RECOVERY - Check size of conntrack table on mw1168 is OK: OK: nf_conntrack is 38 % full [08:06:07] !log rebooting mira for kernel update [08:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:11:55] !log rearmed keyholder on mira [08:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:24] 6Operations, 6Performance-Team: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2133351 (10ori) >>! In T129963#2126064, @elukey wrote: > ori: I would be super interested in working on it, but before starting it would be great to discuss what are the key metrics to... [08:15:22] PROBLEM - cassandra CQL 10.192.32.125:9042 on restbase2004 is CRITICAL: Connection refused [08:15:42] PROBLEM - cassandra service on restbase2004 is CRITICAL: CRITICAL - Expecting active but unit cassandra is failed [08:19:52] <_joe_> uhm, checking cassandra [08:20:06] 6Operations, 10Analytics, 10Datasets-General-or-Unknown, 10Traffic: http://dumps.wikimedia.org should redirect to https:// - https://phabricator.wikimedia.org/T128587#2079753 (10Peachey88) Appears it was previously on a separate certificate https://wikitech.wikimedia.org/w/index.php?title=Httpsless_domains... [08:22:43] RECOVERY - cassandra service on restbase2004 is OK: OK - cassandra is active [08:22:44] <_joe_> !log started cassandra on restbase2004 [08:22:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:23:12] PROBLEM - puppet last run on mw2152 is CRITICAL: CRITICAL: puppet fail [08:24:11] RECOVERY - cassandra CQL 10.192.32.125:9042 on restbase2004 is OK: TCP OK - 0.041 second response time on port 9042 [08:42:19] (03PS1) 10Alexandros Kosiaris: stdlib: import deep_merge function [puppet] - 10https://gerrit.wikimedia.org/r/278241 [08:50:02] RECOVERY - puppet last run on mw2152 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:50:26] <_joe_> akosiaris: uhm I think I wrote a better version of it [08:51:06] (03PS1) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278242 (https://phabricator.wikimedia.org/T124200) [08:51:08] (03PS1) 10Alexandros Kosiaris: ores: define slaveof as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/278243 (https://phabricator.wikimedia.org/T124200) [08:51:58] _joe_: it's in our puppet repo? couldn't find something [08:52:15] akosiaris: problem with the bus. I'm going to be late for 10am. I'll ping you as soon as I am back... [08:52:24] gehel: ok [08:52:26] (03CR) 10jenkins-bot: [V: 04-1] ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278242 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [08:52:33] <_joe_> akosiaris: it was in a change, where later we didn't use it [08:52:38] (03CR) 10jenkins-bot: [V: 04-1] ores: define slaveof as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/278243 (https://phabricator.wikimedia.org/T124200) (owner: 10Alexandros Kosiaris) [08:52:40] <_joe_> so it's in some old refs :P [08:53:03] <_joe_> akosiaris: re: redis, have you seen what I did with the other redis clusters? [08:53:16] I 've seen the multidc one [08:53:16] <_joe_> we should do the same multidc replica for ores too [08:53:18] (03PS2) 10Alexandros Kosiaris: stdlib: import deep_merge function [puppet] - 10https://gerrit.wikimedia.org/r/278241 [08:53:22] quite ingenious [08:53:44] <_joe_> akosiaris: I did the same for the jobqueue redises here https://gerrit.wikimedia.org/r/#/c/276980/ [08:54:03] I was aiming for just eqiad for now [08:54:14] but yes, all in all we will need it in codfw too [08:54:32] <_joe_> it should be pretty easy to convert anyways [08:54:33] so if I can apply it without too much hassle... [08:55:31] (03Abandoned) 10Elukey: Remove rdb1001 from the Redis Job Queues for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276452 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [08:56:03] damned puppet lint... [08:57:21] and of course puppet-lints suggestion is all wrong... [09:01:00] !log rolling reboot of mw1001 to mw1016 for kernel upgrade [09:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:01:23] (03PS2) 10Alexandros Kosiaris: ores: Collapse the redis configs into one stanza [puppet] - 10https://gerrit.wikimedia.org/r/278242 (https://phabricator.wikimedia.org/T124200) [09:01:25] (03PS2) 10Alexandros Kosiaris: ores: define slaveof as a parameter [puppet] - 10https://gerrit.wikimedia.org/r/278243 (https://phabricator.wikimedia.org/T124200) [09:03:08] (03PS1) 10Elukey: Remove rdb1005 from the Job Queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278244 (https://phabricator.wikimedia.org/T123675) [09:03:54] ---^ _joe_ [09:07:21] (03PS1) 10Elukey: Remove rdb1005 from the Job Runner configs for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278245 (https://phabricator.wikimedia.org/T123675) [09:08:11] !log Issuing nodetool scrub -s -- local_group_wikipedia_T_parsoid_html data on restbase2004.eqiad.wmnet : T130254 [09:08:12] T130254: Investigate recent OOM events on restbase2004 - https://phabricator.wikimedia.org/T130254 [09:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:09:04] (03Abandoned) 10Elukey: Remove rdb1005 from the Job Runner configs for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278245 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [09:10:46] (03PS1) 10Elukey: Remove rdb1005 from the Job Runners config for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278246 (https://phabricator.wikimedia.org/T123675) [09:13:28] <_joe_> elukey: I'd start with 1005 [09:15:18] _joe_ yep I should have removed 1005 in https://gerrit.wikimedia.org/r/#/c/278246/ and https://gerrit.wikimedia.org/r/#/c/278244/1 [09:17:56] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove rdb1005 from the Job Runners config for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278246 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [09:18:07] akosiaris: I'm back [09:18:15] w [09:18:16] wb [09:18:19] * gehel is ready to learn new and great things... [09:19:15] !log rolling reboot of swift frontend servers in codfw for kernel upgrade [09:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:16] (03CR) 10Giuseppe Lavagetto: [C: 031] Remove rdb1005 from the Job Queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278244 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [09:21:12] (03CR) 10Elukey: [C: 032] Remove rdb1005 from the Job Queue pool for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278244 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [09:24:02] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Remove rdb1005 from the Redis Job Queues for maintenance (duration: 01m 07s) [09:24:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:26:04] (03PS4) 10Gehel: Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) [09:27:20] (03CR) 10Alexandros Kosiaris: [C: 031] Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [09:27:32] gehel: I think we are finally ready to merge that [09:28:19] akosiaris: just a few questions to make sure I understand what I'm doing... [09:29:14] I suppose they involve conftool ? [09:29:20] in services.yaml, the "weight:" is the default weight of each backend? So we could send out more traffic to a more powerfull node by overriding it in the nodes folder? Correct? [09:29:26] yes [09:29:37] er [09:29:43] not exactly [09:30:07] <_joe_> weight is a dynamic state, not strictly config [09:30:08] you override it by using the following command [09:30:13] <_joe_> so it stays in conftool [09:31:05] confctl --tags dc=eqiad,cluster=sca,service=elasticsearch-ssl --action set/weight=20 elastic1001.eqiad.wmnet [09:31:13] ran on palladium as root [09:31:21] <_joe_> cluster=elasticsearch [09:31:24] <_joe_> or [09:31:34] er yes [09:31:39] why only dynamic? [09:31:52] (03CR) 10Elukey: [C: 032] Remove rdb1005 from the Job Runners config for maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/278246 (https://phabricator.wikimedia.org/T123675) (owner: 10Elukey) [09:32:06] <_joe_> because else we'd still use config files to sync around [09:32:17] It looks to me like something I would like to be traced in version control (but what do I know). [09:32:19] <_joe_> I am unconvinced we need conftool-data [09:32:42] !log removed rdb1005 from the Job Runners config for maintenance [09:32:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:32:46] 6Operations, 10DBA, 13Patch-For-Review: Puppetize pt-heartbeat on MariaDB10 masters and its corresponding checks on icinga - https://phabricator.wikimedia.org/T114752#2133418 (10jcrespo) check dbstore2001, it seems to have issues with pt-heartbeat. [09:32:51] <_joe_> gehel: why? [09:32:59] <_joe_> I mean what's the benefit exactly? [09:33:00] ok, we could go full service discovery and have each node announce itself to the cluster [09:33:37] random question: can this weight be used to gracefully depool a node and avoid rejecting inflight queries when we restart es nodes? [09:33:38] <_joe_> gehel: that would make sense if we had an elastic enviromnent, in our situation we should just deduplicate the inventory we have [09:33:44] <_joe_> dcausse: yes [09:33:56] <_joe_> dcausse: in fact, that's called "draining" [09:33:58] if the weight of a node is a decision I take, I want it to be versionned. If it is something intrinsic to a node (weight ~ # of CPU for example) than I don't want it to be versionned [09:34:07] _joe_: ok, thanks [09:34:11] <_joe_> gehel: ^^ dcausse just answered you [09:34:21] <_joe_> (why it's dynamic) [09:34:38] make sense... [09:34:52] next question, what is the "pooled" attribute? [09:35:12] pooled = traffic goes to backend appserver, depooled = no traffic [09:35:13] <_joe_> gehel: that decides if the node is included in the ipvs config by pybal [09:35:49] <_joe_> gehel: so - pooled = yes means if the node is up, it will be in the destinations pool for the specific ipvs service [09:35:52] so that can be used to have "hot standbies" ? [09:36:17] <_joe_> pooled = no means it will be checked for health and accounted for in all calculations for depool threshold [09:36:32] <_joe_> pooled = inactive means the load balancer will not consider the node at all [09:36:49] actually it is meant to be used for code deployment. pool=>no, deploy code, check all is good, pool=> yes [09:36:58] we are not there yet though [09:37:00] <_joe_> akosiaris: yes [09:37:01] !log forcing puppet agent and restarting jobchron on all the Job Runners and VideoScalers as rdb1005 has been removed from the configs. [09:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:37:14] <_joe_> akosiaris: well with scap3 it should be possible [09:37:34] <_joe_> anyways, let me make conftool suck marginally less :) [09:37:34] yeah, I said we are no there yet but we are on the way to get there :-) [09:38:38] akosiaris, _joe_: the conf for elasticsearch services has "pooled: no", which means that backends are not included in ipvs service? [09:39:01] * gehel is puzzled ... [09:39:08] that's the default [09:39:16] it only means you have to pool the hosts yourself [09:39:34] <_joe_> gehel: that means that when you add a new server it is depooled by default [09:39:35] but we can change that to yes if you find no reason for the default to be no [09:39:43] so if I deploy a new node, it will not be included by default. Make sense [09:39:49] yup [09:39:50] <_joe_> I strongly advise against that akosiaris [09:39:57] well, it depends on the service I think [09:40:07] * gehel has no opinion yet [09:40:15] there might be cases where it makes sense to have the default to yes [09:40:23] that being said, I 've up to now always defaulted to no [09:41:30] last question (I think) why do we split services in different config files? Just readability? [09:42:27] that one I 've never had to think about up to now, I 'll let the designer (_joe_) reply [09:42:31] brb [09:45:47] <_joe_> gehel: yes [09:46:03] <_joe_> but I'm completely open to suggestions on how to make it better [09:46:29] !log rolling-reboot ms-be1* for kernel updates [09:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:57] _joe_: I'm just making sure I'm not assuming something wrong. It seems like a very reasonnable reason to me... [09:50:10] back [09:50:26] Ok, so ready for deployment ... [09:55:24] so, I should 1) check which LVS is the active one (and update that one last), 2) deactivate puppet on impacted LVS servers to make sure they are not updated ahead of time 3) merge the change 4) apply it on inactive LVS 5) check logs 6) apply on active LVS [09:55:35] Am I missing something obvious? [09:57:09] er, not exactly [09:57:35] akosiaris: of course not ;-) [09:57:42] so skip 2. and change 4 with "restart pybal on inactive LVS" and 6 with "restart pybal on active LVS" [09:58:15] so, 2 will not hurt if you do it, but it is not needed. puppet WILL NOT restart pybal (and that's on purpose) [09:58:25] akosiaris: puppet agent is not sufficient to activate the change? [09:58:29] rgr [09:58:35] <_joe_> nope [09:58:53] <_joe_> by choice, as it is a really critical system [09:59:13] no cause we need a human to time the pybal restarts and assess the situation [09:59:17] and that reminds me [09:59:50] add a 5.5) wait 5 mins for BGP to converge [10:00:39] of course BGP will converge faster than that but it is like that 'sync ; sync ; sync' command [10:01:21] give it enough time and don't just restart pybal on both hosts together or very close to each other [10:02:05] * akosiaris waits anxiously to see if gehel will win his t-shirt today [10:02:10] akosiaris: ok. So lvs for eqiad / elasticsearch are 1003, 1006, 1009 and 1012, for codfw, 2003 and2006. Correct? [10:02:17] yup [10:02:26] * gehel has enough clean t-shirts at the moment :P [10:02:52] not that kind of t-shirt... [10:03:01] <_joe_> gehel: modifying lvs is the real first "line-of-fire" assignment [10:03:16] * gehel is not feeling the pressure at all :P [10:03:17] you will find out what kind of t-shirt we refer to eventually [10:03:18] I mean if seach doesn't work, it is not a big deal, just users not being able to reach the data they want to read [10:03:19] <_joe_> you know that if you fuck up, you might cause the whole site to be down :) [10:03:42] ahahaha [10:03:48] how do I check which LVS is active? I understood in the docs that I should be able to ssh search.svc.eqiad.wmnet... [10:03:49] <_joe_> jynus: if he does some mistake here, it's also "mediawiki" that goes down [10:03:53] I am sure gehel feels much more relaxed now [10:03:56] true [10:04:19] I was being sarcastic. Search down ~= site down [10:05:25] Seems I need to log as root on LVS? At least my SSH key seems to be refused. [10:06:02] I do see a gehel account there [10:06:28] my mistake, I get a connection refused. Am I using the wrong bastion ? [10:06:30] !log rolling reboot of swift backend servers in codfw for kernel upgrade [10:06:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:01] gehel: which lvs are you trying to connect to ? [10:07:06] full hostname please [10:07:57] I understood from the docs that by trying to SSH into the service, I would end up in the active LVS (my networking-fu is at its limit here...) [10:08:11] er, no... [10:08:48] just ssh into lvs1003.wikimedia.org and lvs1006.wikimedia.org and lvs1009.eqiad.wmnet and lvs1012.eqiad.wmnet [10:08:55] notice how the hostnames change btw [10:09:10] that's because lvs1001-lvs1006 are legacy and mean to be replaced [10:09:18] that's just mean!! [10:09:48] I know, which is why I just saved you 5 minutes ;-) [10:10:32] akosiaris: thanks! [10:10:57] * gehel is ready to do some real damage [10:11:13] how do I check which one is active? [10:12:07] it's always the lowest numbered one that is not dead for now. It's governed by BGP config [10:12:16] lvs1009 has a lot of logging activity [10:12:58] seems that a lot of the backend checks are failing for lvs1009 [10:13:06] hmmm too many down on lvs1009 indeed [10:13:59] we should probably have a look into this before adding my change... [10:15:02] PROBLEM - DPKG on mw1152 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:15:19] gehel https://phabricator.wikimedia.org/T104458 [10:15:24] I 'd say known [10:16:06] or not ... [10:16:24] not sure how to read that task... is there something still to do there ? [10:16:40] there is one ethernet interface at least offline on lvs1009 [10:18:25] akosiaris: I'm reading, but not understanding all of it... [10:18:39] ah, ok known it seems [10:18:40] lvs1009 eth1 asw-c xe-8/0/28 + move to ?? (needs uplink module in 3, 4, or 6) [10:18:46] but for some reason stuck since Dec ? [10:19:36] 6Operations, 10ops-eqiad, 6DC-Ops, 10Traffic, and 2 others: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1417800 (10akosiaris) I suppose the above table means there is still some actions to be done, hence the task not being closed yet ? [10:19:52] ah, the sfp problems ... [10:20:00] akosiaris: but no link on eth3, not eth1 [10:20:18] yes but eth1 does not have LLDP info either [10:20:19] gehel@lvs1009:~$ sudo ethtool eth3 | grep Link [10:20:19] Link detected: no [10:20:24] which is worrying as well [10:20:32] sudo lldpctl [10:20:50] but obviously the migration is not done yet [10:21:09] I 'd say ignore it for now, just go on with the changes [10:21:19] lvs1009 and lvs1012 are not serving traffic anyway [10:21:24] not will they in any case [10:21:33] lvs1003 is the primary and lvs1006 the backup [10:22:14] how did you check that lvs1003 is the primary? Yesterday, you told me that in principle it is the lowest one... [10:22:21] RECOVERY - DPKG on mw1152 is OK: All packages OK [10:22:29] lvs1003 is the lowest one [10:22:34] lvs1003 and lvs1006 are a group [10:22:41] lvs1001 and lvs1004 another group [10:22:50] and lvs1002 and lvs1005 another [10:22:55] it was the "in principle" part that I wanted to check [10:23:20] I understood that there was exception to the rule... so wanted to check [10:23:34] ah, no, there is no exception to the rule [10:23:48] unless lvs1003 has crashed... [10:24:04] yes in which case, lvs1006 will take over [10:24:13] (or pybal is restarted btw) [10:24:27] so, when lvs1003 crashes or pybal is restarted [10:24:27] ok, let me rebase my change one more time and I'll merge it [10:24:50] pybal will stop advertising (for a short while in the restart case) the routes [10:25:01] the junipers then will just route the traffic to lvs1006 [10:25:13] (03PS5) 10Gehel: Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) [10:25:41] when lvs1003 is back up (or pybal has been started fully in the restart case) it will take over once more and traffic will flow back to it since it's preferred [10:25:57] make sense... [10:26:38] (03CR) 10Gehel: [C: 032] Enabling HTTPS access to elasticsearch via LVS. [puppet] - 10https://gerrit.wikimedia.org/r/277956 (https://phabricator.wikimedia.org/T124444) (owner: 10Gehel) [10:26:55] btw the failover happens on the routers BGP configuration, NOT pybal [10:27:08] so, NOT on the lvs level [10:27:19] there is nothing on lvs1003 that tells it that it is the primary [10:27:27] !log activating elasticsearch HTTPS on LVS for eqiad - https://gerrit.wikimedia.org/r/#/c/277956/ [10:27:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:27] change merged, running puppet --noop [10:29:31] that looks reasonnably like what I would expect, applying puppet change [10:31:29] restarting pybal on lvs1006 [10:32:13] (03PS1) 10Jcrespo: Configure for the first time db1074-db1078 [puppet] - 10https://gerrit.wikimedia.org/r/278255 (https://phabricator.wikimedia.org/T130351) [10:33:18] !log gehel: restarting pybal on lvs1006 [10:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:33:43] gehel: so now you need to activate the nodes [10:33:51] akosiaris: thanks for the log! [10:33:58] assuming everything is ok after the pybal restart [10:34:19] !log reboot ms-fe1003 for kernel upgrade [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:34:31] akosiaris: looking at the logs on 1009, I see Mar 18 10:32:11 lvs1006 pybal[4133]: [pybal] INFO: Created LVS service 'search-https_9243' which looks good. [10:35:24] I also see a lot of "New enabled server ..." but none about elasticsearch. Why is that? [10:35:25] I see ipvsadm is happy [10:35:45] [search-https_9243] INFO: New enabled server elastic1010.eqiad.wmnet, weight 30 ? [10:36:55] (03PS1) 10Ema: Skip backends, not directors in test VCL code [puppet] - 10https://gerrit.wikimedia.org/r/278256 (https://phabricator.wikimedia.org/T128188) [10:37:27] akosiaris: I'm blind ... [10:38:16] er, so we had the obvious mistake in our config [10:38:20] sigh [10:38:25] sudo journalctl -u pybal -n 1000 | grep elastic [10:38:39] so normally we would need one more step [10:38:43] actually enabling all this [10:38:51] lemme submit a quick fix [10:39:39] (03PS2) 10Jcrespo: Configure for the first time db1074-db1078 [puppet] - 10https://gerrit.wikimedia.org/r/278255 (https://phabricator.wikimedia.org/T130351) [10:40:02] akosiaris: feel free to fix whatever you want! Thanks! [10:40:12] so, it's a one liner, submitting it right now [10:40:16] (03PS1) 10Alexandros Kosiaris: elasticsearch: Actually use the new elasticsearch-ssl service [puppet] - 10https://gerrit.wikimedia.org/r/278257 [10:40:19] ^ [10:40:34] we 've defined the new elasticsearch-ssl services but never told pybal to use it [10:40:52] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] elasticsearch: Actually use the new elasticsearch-ssl service [puppet] - 10https://gerrit.wikimedia.org/r/278257 (owner: 10Alexandros Kosiaris) [10:41:43] you did add that line in the "search" section, not the "search-https" section [10:41:50] what ? [10:42:10] shit! [10:42:33] akosiaris: It looked strange to me, but ... [10:42:50] thank god pybal will not depool all of them [10:42:52] (03PS3) 10Jcrespo: Configure for the first time db1074-db1078 [puppet] - 10https://gerrit.wikimedia.org/r/278255 (https://phabricator.wikimedia.org/T130351) [10:43:12] phew.. thanks [10:43:17] * akosiaris feels like an idiot [10:43:31] * gehel is happy to not be the one who made the typo! [10:44:02] akosiaris: you need to already be pretty smart to be in a position to make that mistake :P [10:44:20] thanks for sugar coating it [10:44:25] :P [10:44:31] just the truth... [10:45:03] (03PS1) 10Alexandros Kosiaris: Fix typo introduced in the previous commit [puppet] - 10https://gerrit.wikimedia.org/r/278258 [10:45:03] it's a good thing you spotted it before I restart pybal [10:45:14] if the code is worth writing, it is worth reviewing... [10:45:27] well, we would not be having a problem until we restarted pybal on lvs1003 [10:45:32] but it would be too late by then [10:45:34] let me do the restart, at least you can blame the new guy [10:45:41] lol [10:45:53] so, lemme fix that mistake I made [10:46:17] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "gehel spotted it. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/278258 (owner: 10Alexandros Kosiaris) [10:46:45] so now [10:46:47] Running conftool-sync on /etc/conftool/data [10:46:47] WARNING:conftool:Setting pooled to the default value no [10:46:47] WARNING:conftool:Setting weight to the default value 10 [10:46:47] eqiad: Creating node elastic1025.eqiad.wmnet for cluster elasticsearch/elasticsearch-ssl [10:46:50] and so on [10:46:54] on palladium [10:47:55] now if you do on palladium confctl --find --action get elastic1025.eqiad.wmnet [10:47:55] you get [10:47:55] {"elastic1025.eqiad.wmnet": {"pooled": "yes", "weight": 30}} [10:47:55] {"elastic1025.eqiad.wmnet": {"pooled": "no", "weight": 10}} [10:47:57] what is this conftool-sync? I missed that step... [10:48:17] it is done automagically at the end of puppet-merge if there is an actual change in lvs config [10:48:28] we had missed that step which is why you did not see it [10:48:36] ok [10:48:45] so, elasticsearch-ssl hosts depooled as per default [10:48:58] !log dbstore2002 just crashed [10:49:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:17] so I can run puppet again on all LVS and restart pybal on 1009 again? [10:49:23] yes [10:49:28] s/1009/1006/ [10:49:50] yeah, 1009 and 1012 should be done as well at some point, just to keep the consistency [10:50:08] ok, running puppet --noop on lvs1003, 1006, 1009 and 1012 [10:50:40] looks reasonable, applying puppet [10:51:42] !log restarting pybal on lvs1006 [10:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:54] (03PS2) 10Ema: Skip backends, not directors in test VCL code [puppet] - 10https://gerrit.wikimedia.org/r/278256 (https://phabricator.wikimedia.org/T128188) [10:53:56] akosiaris: "sudo journalctl -u pybal -n 1000 | grep elastic" still returns nothing ... [10:54:28] sudo ipvsadm -L -t 10.2.2.30:9243 [10:54:28] Prot LocalAddress:Port Scheduler Flags [10:54:28] -> RemoteAddress:Port Forward Weight ActiveConn InActConn [10:54:28] TCP search.svc.eqiad.wmnet:9243 wrr [10:54:40] so now we are at the stage we should have been earlier [10:54:44] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2133600 (10fgiunchedi) @joe indeed, I was confused by not all lvs in `high-traffic2` reporting the error, but that's due to T112781 and T104458 [10:54:59] that is if we had merge the correct change in the first place [10:55:03] so, there is one last step [10:55:14] it is to set to "pooled=yes" all the hosts [10:55:33] as root on palladium [10:55:45] (03PS4) 10Jcrespo: Configure for the first time db1074-db1078 [puppet] - 10https://gerrit.wikimedia.org/r/278255 (https://phabricator.wikimedia.org/T130351) [10:56:09] confctl --tags dc=eqiad,cluster=elasticsearch,service=elasticsearch-ssl --action set pooled=yes elastic1001.eqiad.wmnet [10:56:20] feel free to for loop it ;-) [10:56:22] <_joe_> confctl --tags dc=eqiad,cluster=elasticsearch,service=elasticsearch-ssl --action set/pooled=yes all [10:56:30] <_joe_> ghe^^ [10:56:33] <_joe_> err [10:56:36] 6Operations, 5Continuous-Integration-Scaling, 7HHVM: Provide a HHVM package for jessie-wikimedia matching version of trusty-wikimedia - https://phabricator.wikimedia.org/T125821#2133603 (10Paladox) @hashar Also we can now select the php flavours so we would need to update he description. [10:56:37] oh all works ? [10:56:41] <_joe_> gehel: ^^ mine is correct :) [10:56:47] I 've never managed to make that work [10:56:47] <_joe_> akosiaris: yes [10:56:52] <_joe_> wat? [10:57:02] somehow I 've always failed [10:57:14] <_joe_> means you're not sure of what you're doing [10:57:14] I 'd usually be under the pressure so I would fallback to the for loop [10:57:20] ahahahaha [10:57:22] rotfl [10:57:37] yeah, obviously I was not sure the "all" would work [10:57:51] <_joe_> no that was a pun [10:58:02] hmm I think I failed with "--find" [10:58:03] not sure [10:58:04] lemme see [10:58:19] <_joe_> you know that when you select a sizable part of a cluster conftool will ask you to type in "Yes, I am sure of what I am doing." [10:58:23] !log activating elasticsearch-ssl service on LVS / eqiad [10:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:02] _joe_: Am I allowed to copy/paste that confirmation? :P [10:59:09] 6Operations, 10Monitoring, 10Pybal: pybal backends health check streamlb could depool server - https://phabricator.wikimedia.org/T130143#2133605 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi [10:59:33] ERROR:conftool:Failure writing to the kvstore: Backend error: The request requires user authentication : Insufficient credentials [11:01:15] sudo -i [11:01:20] not sudo -s [11:01:47] or symlink .etcdrc to /root's from your home dir if you want sudo -s (this is what I have done) [11:02:01] of course ... [11:02:48] ipvsadm looks good, I've seen activity in pybal logs on lvs1006 [11:03:15] confirmed [11:03:22] now, test it ;-) [11:03:23] Can I now restart pybal on 1009 and 1012= [11:03:34] yeah, but not yet 1003 [11:03:37] ok [11:04:11] !log restarting pybal on lvs1009 [11:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:04:44] still plenty of error in logs on lvs1009 (expected) [11:04:53] !log restarting pybal on lvs1012 [11:04:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:05:45] !log rolling reboot of mw1161 to mw1169 for kernel upgrade [11:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:07:15] akosiaris: Mar 18 11:05:16 lvs1012 pybal[5168]: Memory allocation problem [11:07:20] should I be worried? [11:12:39] <_joe_> gehel: that's the ipvsadm output, and yes, we should be moderately worried [11:13:21] sudo ipvsadm -L -t 10.2.2.30:9244 [11:13:21] Memory allocation problem [11:13:21] akosiaris@lvs1006:~$ sudo ipvsadm -L -t 10.2.2.30:9243 [11:13:21] Prot LocalAddress:Port Scheduler Flags [11:13:21] -> RemoteAddress:Port Forward Weight ActiveConn InActConn [11:13:22] TCP search.svc.eqiad.wmnet:9243 wrr [11:13:23] -> elastic1001.eqiad.wmnet:9243 Route 10 0 0 [11:13:24] -> elastic1002.eqiad.wmnet:9243 Route 10 0 0 [11:13:28] notice the difference [11:13:36] 9244 instead of 9243 [11:13:40] and the output error ... [11:13:55] It's hugely worrying the first time you see it [11:14:03] then you realize it's just crap... [11:14:54] very bad input handling ... anyway, where did you see this gehel ? [11:15:42] gehel@lvs1012:~$ sudo journalctl -u pybal -n 1000 | grep Memory [11:15:59] but I do not see the same on 1006 or 1009 [11:16:52] (03CR) 10Jcrespo: [C: 032] Configure for the first time db1074-db1078 [puppet] - 10https://gerrit.wikimedia.org/r/278255 (https://phabricator.wikimedia.org/T130351) (owner: 10Jcrespo) [11:17:23] I'm wondering what could be different on lvs1012 ... [11:19:01] akosiaris: I'll be back in 5', need some coffee... [11:19:12] hmmm only once.. probably checking if it existed ? [11:23:58] !log powercycled mw1163, hung on reboot and serial console stuck [11:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:26:43] akosiaris: so let's ignore it? [11:27:03] akosiaris: how do I test that change before restarting lvs1003? [11:32:35] (03PS1) 10Elukey: Revert "Remove rdb1005 from the Job Queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278262 [11:33:34] ---^ _joe_ all good, rdb1005's keys have been restored and everything looks good. Re-adding it to the mw pool [11:35:22] (03CR) 10Gehel: [C: 031] "trivial enough..." [puppet] - 10https://gerrit.wikimedia.org/r/278232 (owner: 10Dzahn) [11:35:28] I think I have avoided the paging on the new dbs setup [11:36:10] but just in case (I cannot ack or disable checks that do not exist), they are db1074 up to db1078 [11:38:18] (03CR) 10Filippo Giunchedi: [WIP]: write cassandra instance yaml descriptors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [11:40:26] yeah let's ignore it [11:40:42] gehel: wget https://search.svc.eqiad.wmnet/something from a bastionhost ? [11:40:47] (03CR) 10Elukey: [C: 032] Revert "Remove rdb1005 from the Job Queue pool for maintenance." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278262 (owner: 10Elukey) [11:40:56] er wget https://search.svc.eqiad.wmnet:9243/something [11:41:56] akosiaris: but that's going to work only *after* restarting pybal on lvs1003, no ? [11:42:29] akosiaris: is there any way to test before activation (appart from checking the logs, ipvsadmin, ...) [11:42:40] ah yeah, it's using the same IP.. damn [11:42:46] well, no then [11:43:08] hmm unless [11:43:24] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Add rdb1005 back to the Redis Job Queues after maintenance (duration: 01m 22s) [11:43:26] if it was using a different IP, that would be advertised to BGP only from the restarted hosts, so I could check it, right? [11:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:43:40] yes [11:43:47] but you can test it from lvs1006 itself [11:43:51] * gehel still thinks that BGP is half magic, half unicorns [11:44:17] 6Operations, 6Commons, 10MediaWiki-Page-deletion: API request failed (internal_api_error_MWException): [408e8b0f] Exception Caught: Could not acquire lock for 'Full_size_20150703094950ನಿಲ್ಲದ_ಬರವಣಿಗೆ.jpg.' - https://phabricator.wikimedia.org/T130359#2133692 (10Steinsplitter) [11:44:26] niah you can't either I think.. [11:44:41] akosiaris: does not seem to work... [11:44:47] it's firewalled off to $internal [11:45:22] it's a new service, so in this case, if it is broken, no big deal as long as I don't break anything else... [11:45:40] I think it's safe to restart pybal on lvs1003 [11:45:50] got an error while doing sync-file to mw1167.eqiad.wmnet, never had this use case.. Can I force the sync on the host somehow? [11:45:58] ok, I'll do just that ... stand by please... [11:46:10] !log restarting pybal on lvs1003 [11:46:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:46:29] * gehel is crossing fingers [11:47:20] (03PS1) 10Jcrespo: Depool db1024 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278264 (https://phabricator.wikimedia.org/T130351) [11:50:23] gehel: wget https://search.svc.eqiad.wmnet:9243 works for a box with an internal IP [11:50:38] (03PS1) 10Mobrovac: Beta: RESTBase: Use mathoid from BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/278265 [11:50:43] akosiaris: no obvious error in the logs, but they seem much less verbose on lvs1003 than on other LVS nodes [11:51:07] so, I think we are OK [11:51:40] search definitely still works [11:51:48] akosiaris: I dont see any lines about "New enabled server" on lvs1003 (I saw a lot of them on others" [11:52:08] akosiaris: yes, I confirm, search works and new service seems OK [11:52:44] great.. you got your first LVS service working then [11:52:51] congrats [11:52:56] * gehel will not get his new T-shirt today... [11:52:59] (03PS1) 10Elukey: Revert "Remove rdb1005 from the Job Runners config for maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/278266 [11:53:08] oh it will happen, don't worry [11:53:09] akosiaris: thanks for the help and all the time you spent on me! [11:53:23] it was a pleasure ;) [11:53:54] akosiaris: you are a good teacher! I enjoyed learning all that! [11:54:17] I am ? wow, first time I hear that. I usually hear the opposite [11:54:20] * gehel is closing all consoles pointing to dangerous production systems [11:54:36] akosiaris: you might be better on IRC than IRL .. [11:54:37] "you are a bad teacher, you don't explain this enough, you don't have any patience" [11:54:46] and so on [11:55:08] (03PS2) 10Elukey: Revert "Remove rdb1005 from the Job Runners config for maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/278266 [11:55:10] well, you were asking all the good questions. No "how do I do this?" stuff and the like [11:55:17] akosiaris: you took 2.5 hours to walk me through this fairly simple change, that's at least some amount of patience... [11:55:32] it's probably IRC ... [11:55:35] :P [11:55:40] Thanks! Now time to go get some food... [11:55:45] er, same here [11:56:42] (03CR) 10Elukey: [C: 032] Revert "Remove rdb1005 from the Job Runners config for maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/278266 (owner: 10Elukey) [11:57:21] !log restarting elasticsearch server elastic1024.eqiad.wmnet [11:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:58:18] !log Added rdb1005 back to the jobrunners puppet config after maintenance. [11:58:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:00:19] !log Forcing puppet agent run on all the Jobrunners and videoscalers since rdb1005 is now back in service. Will also restart jobchron as well. [12:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:56] 6Operations, 10ops-codfw, 6DC-Ops: ms-be2010.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T129117#2133708 (10fgiunchedi) a:3Papaul [12:10:38] (03PS1) 10Jcrespo: Configure labsdb1008 for the first time [puppet] - 10https://gerrit.wikimedia.org/r/278268 (https://phabricator.wikimedia.org/T126946) [12:11:05] 6Operations, 10Deployment-Systems, 6Release-Engineering-Team: setup automatic deletion of old l10nupdate - https://phabricator.wikimedia.org/T130317#2133712 (10hashar) [12:12:04] (03PS2) 10Jcrespo: Configure labsdb1008 for the first time [puppet] - 10https://gerrit.wikimedia.org/r/278268 (https://phabricator.wikimedia.org/T126946) [12:12:18] (03CR) 10jenkins-bot: [V: 04-1] Configure labsdb1008 for the first time [puppet] - 10https://gerrit.wikimedia.org/r/278268 (https://phabricator.wikimedia.org/T126946) (owner: 10Jcrespo) [12:12:45] hey, I beated you, jenkins, do not complain now [12:15:49] !log finished ms-be1* rolling reboot [12:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:59] (03PS2) 10Elukey: Beta: RESTBase: Use mathoid from BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/278265 (owner: 10Mobrovac) [12:17:28] (03CR) 10Elukey: [C: 032] Beta: RESTBase: Use mathoid from BetaCluster [puppet] - 10https://gerrit.wikimedia.org/r/278265 (owner: 10Mobrovac) [12:18:12] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2133716 (10mforns) Thanks @Dzahn ! [12:18:14] (03CR) 10Hashar: [C: 031] "Puppet compile for ytterbium.wikimedia.org shows that it is a noop https://puppet-compiler.wmflabs.org/2095/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/278225 (owner: 10Dzahn) [12:20:58] (03CR) 10Jcrespo: [C: 032] Configure labsdb1008 for the first time [puppet] - 10https://gerrit.wikimedia.org/r/278268 (https://phabricator.wikimedia.org/T126946) (owner: 10Jcrespo) [12:21:05] (03PS3) 10Jcrespo: Configure labsdb1008 for the first time [puppet] - 10https://gerrit.wikimedia.org/r/278268 (https://phabricator.wikimedia.org/T126946) [12:30:09] 6Operations, 10DBA, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2133756 (10jcrespo) a:5Cmjohnson>3jcrespo [12:35:40] !log finished ms-fe1* rolling reboot [12:35:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:48] same goes for labsdb1008 - I made sure it will not page, but csnnot be 100% sure (new install) [12:43:38] !log restarting elasticsearch server elastic1025.eqiad.wmnet [12:43:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:55:01] (03PS1) 10Ladsgroup: Flake8 for labstore and wdqs [puppet] - 10https://gerrit.wikimedia.org/r/278270 [12:56:49] PROBLEM - MariaDB Slave IO: s2 on db1076 is CRITICAL: CRITICAL slave_io_state could not connect [12:57:29] PROBLEM - MariaDB Slave Lag: s2 on db1076 is CRITICAL: CRITICAL slave_sql_lag could not connect [12:57:50] PROBLEM - MariaDB Slave SQL: s2 on db1076 is CRITICAL: CRITICAL slave_sql_state could not connect [12:58:39] PROBLEM - MariaDB Slave IO: s3 on db1075 is CRITICAL: CRITICAL slave_io_state could not connect [12:58:51] hmm? [12:59:09] PROBLEM - MariaDB Slave Lag: s3 on db1075 is CRITICAL: CRITICAL slave_sql_lag could not connect [12:59:13] jynus? [12:59:30] PROBLEM - MariaDB Slave SQL: s3 on db1075 is CRITICAL: CRITICAL slave_sql_state could not connect [12:59:36] PROBLEM - MariaDB Slave IO: s3 on db1078 is CRITICAL: CRITICAL slave_io_state could not connect [12:59:56] PROBLEM - MariaDB Slave Lag: s3 on db1078 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:00:02] PROBLEM - MariaDB Slave IO: s2 on db1074 is CRITICAL: CRITICAL slave_io_state could not connect [13:00:29] PROBLEM - MariaDB Slave SQL: s3 on db1078 is CRITICAL: CRITICAL slave_sql_state could not connect [13:00:36] PROBLEM - MariaDB Slave Lag: s2 on db1074 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:00:44] paravoid: taking a look [13:01:03] PROBLEM - MariaDB Slave SQL: s2 on db1074 is CRITICAL: CRITICAL slave_sql_state could not connect [13:01:34] PROBLEM - MariaDB Slave IO: s3 on db1077 is CRITICAL: CRITICAL slave_io_state could not connect [13:03:06] PROBLEM - MariaDB Slave Lag: s3 on db1077 is CRITICAL: CRITICAL slave_sql_lag could not connect [13:03:10] paravoid: those are the new ones [13:03:28] PROBLEM - MariaDB Slave SQL: s3 on db1077 is CRITICAL: CRITICAL slave_sql_state could not connect [13:03:31] just racked yesterday, no user impact [13:03:37] I'll put them in downtime on icinga [13:03:54] https://phabricator.wikimedia.org/T130351 [13:04:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] Introducing changeprop role and puppet module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275772 (https://phabricator.wikimedia.org/T128463) (owner: 10Mobrovac) [13:06:00] (03PS1) 10Ladsgroup: First flake8 pass on LDAP [puppet] - 10https://gerrit.wikimedia.org/r/278271 [13:06:11] !log restarting elasticsearch server elastic1026.eqiad.wmnet [13:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:06:54] 6Operations, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2133821 (10elukey) [13:07:20] PROBLEM - Check size of conntrack table on mw1169 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:07:29] yeah just in time [13:07:38] I opened https://phabricator.wikimedia.org/T130364 for the same issue :D [13:07:51] PROBLEM - Check size of conntrack table on mw1164 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:07:51] PROBLEM - Check size of conntrack table on mw1168 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:08:20] PROBLEM - Check size of conntrack table on mw1162 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:08:30] PROBLEM - Check size of conntrack table on mw1161 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:08:56] akosiaris: I almost forgot, I now need to also enable LVS on codfw. I'll try to do that on my own this time... [13:09:00] PROBLEM - Check size of conntrack table on mw1163 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:09:40] tons of connections in TIME_WAIT [13:10:02] 6Operations, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2133837 (10elukey) ``` elukey@mw1163:~$ sudo netstat -tunap | awk '{print $6}' | sort | uniq -c 14 - 1 1073/python 1 1152/hhvm 5 1205/rsyslogd 8 1964/ntpd... [13:10:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] Assign changeprop service to scb cluster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/275891 (https://phabricator.wikimedia.org/T128463) (owner: 10Mobrovac) [13:10:51] PROBLEM - Check size of conntrack table on mw1165 is CRITICAL: CRITICAL: nf_conntrack is 99 % full [13:10:55] moritzm: ---^ shall we disable temporarly ferm? [13:11:01] RECOVERY - Check size of conntrack table on mw1169 is OK: OK: nf_conntrack is 0 % full [13:11:40] RECOVERY - Check size of conntrack table on mw1168 is OK: OK: nf_conntrack is 0 % full [13:11:53] doing that [13:12:05] if iptables is loaded the conntrack is used IIRC unless the notrack is set in the RAW table [13:12:10] RECOVERY - Check size of conntrack table on mw1161 is OK: OK: nf_conntrack is 68 % full [13:12:10] PROBLEM - Check size of conntrack table on mw1167 is CRITICAL: CRITICAL: nf_conntrack is 98 % full [13:13:31] RECOVERY - Check size of conntrack table on mw1164 is OK: OK: nf_conntrack is 0 % full [13:13:51] RECOVERY - Check size of conntrack table on mw1162 is OK: OK: nf_conntrack is 0 % full [13:14:01] RECOVERY - Check size of conntrack table on mw1167 is OK: OK: nf_conntrack is 0 % full [13:14:31] RECOVERY - Check size of conntrack table on mw1163 is OK: OK: nf_conntrack is 0 % full [13:14:40] RECOVERY - Check size of conntrack table on mw1165 is OK: OK: nf_conntrack is 0 % full [13:15:00] volans: yep we are disabling ferm on those hosts [13:15:43] an now no errors in mediawiki-errors [13:16:25] what is the size of the conntrack on those hosts? [13:16:33] 256k connections [13:16:45] (sorry at a tech conference... :-P ) [13:19:34] 6Operations, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2133841 (10elukey) ``` elukey@mw1161:~$ sudo netstat -tuap | grep TIME_WAIT | awk '{print $5}' | sort | uniq -c 8 2620:0:861:101:10::6379 8 2620:0:861:103:10::6379 688 db1... [13:21:25] 6Operations, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2133842 (10elukey) Could it be related to https://gerrit.wikimedia.org/r/#/c/276904/1/wmf-config/jobqueue-eqiad.php ? [13:22:29] !log restarting pybal on lvs2006.codfw.wmnet [13:22:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:41] !log enabling all nodes for service search.svc.codfw.wmnet:9243 (elastic-https) on codfw [13:22:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:58] opened a phab task for the issue pointed out by moritzm https://phabricator.wikimedia.org/T130364 (cc ori, _joe_) [13:24:10] !log restarting pybal on lvs2003.codfw.wmnet [13:24:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:15] * gehel crossing fingers again... [13:24:33] ah snap not here, buuuu I am confusing channels [13:24:36] sorry [13:26:41] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2133854 (10Gehel) LVS has been configured and activated for eqiad and codfw. Elsaticsearch is available through HTTPS via the usual service... [13:39:29] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Enable metric collection on nginx for elasticsearch - https://phabricator.wikimedia.org/T130365#2133864 (10Gehel) [13:41:34] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Should we have a specific check for SSL certificate expiration on elasticsearch - https://phabricator.wikimedia.org/T130366#2133879 (10Gehel) [13:43:41] !log restarting elasticsearch server elastic1027.eqiad.wmnet [13:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:28] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2130333 (10Ottomata) Hm, this is a different group that `statistics-web-users`? Hm. I'm fine with this, although I think we could accomplish this by includi... [13:55:12] (03PS1) 10Gehel: Adding a `$ensure` parameter to nginx::status_site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/278276 [13:56:11] PROBLEM - puppet last run on mw2167 is CRITICAL: CRITICAL: puppet fail [13:56:18] (03PS2) 10Gehel: Adding a `$ensure` parameter to nginx::status_site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/278276 (https://phabricator.wikimedia.org/T130365) [13:59:27] (03PS1) 10Gehel: Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) [14:02:38] !log restarted eventlog1001.eqiad.wmnet and eventlog2001.codfw.wmnet for kernel upgrade [14:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:36] paravoid- I mentioned it on the backlog (as you used to do with network), and I cannot ack or disable checks that are asynchronously added to our monitoring [14:05:05] I cannot conditionally produce checks, because the pooling state is not on the infrastructure [14:06:36] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 4 others: Look into encrypting Elasticsearch traffic - https://phabricator.wikimedia.org/T124444#2133957 (10demon) >>! In T124444#2133854, @Gehel wrote: > LVS has been configured and activated for eqiad and codfw. Elsaticsearch is avail... [14:08:14] 6Operations, 10CirrusSearch, 6Discovery, 3Discovery-Search-Sprint, and 3 others: Should we have a specific check for SSL certificate expiration on elasticsearch - https://phabricator.wikimedia.org/T130366#2133879 (10Gehel) Icinga check command `ssl-cert-check` is defined in `modules/nagios_common/files/che... [14:08:26] As I have now added SSL to elasticsearch, does it make sense to add an icinga check for the validity of the SSL certs? [14:09:13] Those certs are Puppet certs, so we might already have a check on them (I did not find it, but that does not mean much). [14:24:01] RECOVERY - puppet last run on mw2167 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:26:38] jynus: as a workaround, the only way I found is just to run manually puppet on the host and then go "quickly" to icinga to downtime it, not optimal of course... [14:26:51] nor practical :) [14:28:15] it doesn't work [14:28:32] puppet on the host doesn't create thee check [14:28:46] puppet on neon does [14:29:14] which takes a long time to execute, and adds double indetermination to the equation [14:29:20] yes, I mean icinga host [14:29:38] or you need to "monitor" icinga until they appear, very unpractical [14:29:40] the real fix is to not alert if the server is depooled [14:30:01] but for that we need orchestration in infrastructure side [14:30:18] which we will, eventually :-/ [14:30:48] we could add an automatic downtime of ~1h for new checks maybe [14:31:10] I'm thinking also for other services, not mysql only [14:32:38] not sure how easy that would be [14:35:58] 6Operations, 13Patch-For-Review: No postinst, preinst, etc for linux-image-3.19.0-2-amd64 - https://phabricator.wikimedia.org/T122284#2134011 (10MoritzMuehlenhoff) 5Open>3Resolved This is fixed in the 4.4 package, I won't fix that in the 3.19 kernel any more (the kernel meta package solves that anyway) [14:36:50] !log restarting elasticsearch server elastic1028.eqiad.wmnet [14:36:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:11] (03PS1) 10Filippo Giunchedi: cassandra: bootstrap restbase1013-a [puppet] - 10https://gerrit.wikimedia.org/r/278285 (https://phabricator.wikimedia.org/T125842) [14:45:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: bootstrap restbase1013-a [puppet] - 10https://gerrit.wikimedia.org/r/278285 (https://phabricator.wikimedia.org/T125842) (owner: 10Filippo Giunchedi) [14:47:42] 6Operations, 10RESTBase-Cassandra: restbase1007.eqiad.wmnet CPU temperature? - https://phabricator.wikimedia.org/T130370#2134035 (10Eevans) [14:48:57] 6Operations, 10ops-eqiad, 10RESTBase-Cassandra: restbase1007.eqiad.wmnet CPU temperature? - https://phabricator.wikimedia.org/T130370#2134050 (10fgiunchedi) [14:50:56] (03PS1) 10Elukey: Enable persistent connections between Job Queues and Job Runners. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278286 (https://phabricator.wikimedia.org/T130364) [14:52:37] 6Operations, 10RESTBase, 10hardware-requests, 13Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2134061 (10Cmjohnson) @fgiunchedi Regarding the remaining 2 restbases...I will not have enough ssds to add the last 2 (can only do 1). A... [14:56:18] 6Operations, 10RESTBase, 10hardware-requests, 13Patch-For-Review: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842#2134067 (10fgiunchedi) @Cmjohnson yup, we'll be decomissioning restbase1003 and restbase1004 early next week once restbase1013 is fully... [14:57:49] (03CR) 10Alexandros Kosiaris: "minor inline comment. Otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [14:57:55] (03CR) 10Alexandros Kosiaris: [C: 04-1] Pass deploy user from service::node [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [14:59:25] (03PS3) 10Ema: Skip backends, not directors in test VCL code [puppet] - 10https://gerrit.wikimedia.org/r/278256 (https://phabricator.wikimedia.org/T128188) [14:59:34] (03CR) 10Ema: [C: 032 V: 032] Skip backends, not directors in test VCL code [puppet] - 10https://gerrit.wikimedia.org/r/278256 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [15:00:17] (03CR) 10Giuseppe Lavagetto: [C: 032] "This is needed because the jobrunners are filling up their conntrack tables. Also, it should help performance, although it seemed problema" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278286 (https://phabricator.wikimedia.org/T130364) (owner: 10Elukey) [15:00:36] <_joe_> elukey: let's deploy this, we're kind of in an emergency [15:00:41] (03Merged) 10jenkins-bot: Enable persistent connections between Job Queues and Job Runners. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278286 (https://phabricator.wikimedia.org/T130364) (owner: 10Elukey) [15:00:54] <_joe_> and I am unsure this was really a problem when aaron disabled this [15:01:08] _joe_ ack, deploying [15:01:13] <_joe_> but let's keep an eye on it [15:01:18] <_joe_> both of us :) [15:02:13] !log bootstrap restbase1013-a [15:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:32] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Re-enabled persistence between Job Queues and Job Runners. (duration: 00m 30s) [15:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:44] PROBLEM - Restbase root url on restbase1013 is CRITICAL: Connection refused [15:05:45] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:05:55] (03PS5) 10Thcipriani: Pass deploy user from service::node [puppet] - 10https://gerrit.wikimedia.org/r/277423 [15:06:14] PROBLEM - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: Connection refused [15:06:17] <_joe_> elukey: results on a random jobrunner don't seem that good tbh [15:06:45] PROBLEM - Apache HTTP on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:06:51] _joe_ do we need to restart the job runners? [15:07:07] <_joe_> elukey: nope, this is the mediawiki config [15:07:11] (03PS2) 10Milimetric: [WIP] Re-organize analytics dumps to their own page [puppet] - 10https://gerrit.wikimedia.org/r/269696 [15:07:12] <_joe_> that's handled by hhvm [15:07:45] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.80, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [15:08:06] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.32.205:9042 on restbase1013 is CRITICAL: Connection refused eevans Node is boostrapping. - The acknowledgement expires at: 2016-03-19 15:07:44. [15:09:11] _joe_ mmmm ok so theoretically when the last round of TIME_WAIT finish we should see less sockets used, no? [15:09:19] ACKNOWLEDGEMENT - restbase endpoints health on restbase1013 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.32.80, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) eevans Node is bootstrapping. [15:09:35] <_joe_> elukey: yes [15:10:17] ACKNOWLEDGEMENT - Restbase root url on restbase1013 is CRITICAL: Connection refused eevans Node is bootstrapping. [15:13:25] the other fix would be to page on service lost, not on host issues [15:15:53] 6Operations: Replace role::kafka::*::config classes with puppet functions. - https://phabricator.wikimedia.org/T130371#2134079 (10Ottomata) [15:17:30] 6Operations, 13Patch-For-Review, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2134092 (10elukey) TIME_WAITs dropped but a lot of errors in https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors [15:19:35] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [15:19:43] 6Operations, 13Patch-For-Review, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2134094 (10Joe) Yes, I'd say it's pretty clear we're seeing an issue in how redis persistent connections are handled by HHVM 3.12 [15:19:53] <_joe_> elukey: let's rollback [15:20:14] _joe_ sure [15:20:44] (03PS1) 10Elukey: Revert "Enable persistent connections between Job Queues and Job Runners." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278287 [15:21:57] (03CR) 10Elukey: [C: 032] Revert "Enable persistent connections between Job Queues and Job Runners." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278287 (owner: 10Elukey) [15:22:22] <_joe_> so, next course of action would be checking if an older hhvm works better [15:22:29] <_joe_> but not today [15:23:03] (03PS1) 10Ottomata: Increase number of map tasks for camus webrequest to 72 [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) [15:23:21] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: REVERT - Re-enabled persistence between Job Queues and Job Runners. (duration: 00m 19s) [15:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:41] 6Operations, 13Patch-For-Review, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2134112 (10elukey) Reverted with https://gerrit.wikimedia.org/r/#/c/278287/ [15:31:28] (03CR) 10Ottomata: [C: 031] Port varnishreqstats and varnishstatsd to new VSL API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277790 (https://phabricator.wikimedia.org/T128788) (owner: 10Ema) [15:36:02] (03PS1) 10Muehlenhoff: Bump connection tracking table size on job runners [puppet] - 10https://gerrit.wikimedia.org/r/278290 [15:37:06] (03CR) 10Alexandros Kosiaris: [C: 032] Pass deploy user from service::node [puppet] - 10https://gerrit.wikimedia.org/r/277423 (owner: 10Thcipriani) [15:38:16] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:44:21] (03CR) 10Joal: [C: 031] "LGTM !" [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) (owner: 10Ottomata) [15:46:08] (03CR) 10Ottomata: "Will merge this on Monday after we increase partitions for webrequest_text and webrequest_upload Kafka topics" [puppet] - 10https://gerrit.wikimedia.org/r/278288 (https://phabricator.wikimedia.org/T127351) (owner: 10Ottomata) [15:47:04] 6Operations, 10DBA: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#2134154 (10jcrespo) a:5jcrespo>3None [15:48:31] (03PS2) 10Muehlenhoff: Bump connection tracking table size on job runners [puppet] - 10https://gerrit.wikimedia.org/r/278290 [15:49:02] 6Operations, 10DBA, 6Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2134158 (10jcrespo) List of tables to reimport: {P2792} [15:50:16] (03CR) 10Muehlenhoff: [C: 032 V: 032] Bump connection tracking table size on job runners [puppet] - 10https://gerrit.wikimedia.org/r/278290 (owner: 10Muehlenhoff) [15:52:05] 6Operations, 15User-mobrovac: Replace role::kafka::*::config classes with puppet functions. - https://phabricator.wikimedia.org/T130371#2134162 (10mobrovac) [15:54:08] (03PS3) 10Filippo Giunchedi: diamond: send labs instance metrics via graphite/carbon [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) [15:55:19] (03CR) 10Filippo Giunchedi: [C: 04-1] "ready to be merged on monday" [puppet] - 10https://gerrit.wikimedia.org/r/268360 (https://phabricator.wikimedia.org/T121861) (owner: 10Filippo Giunchedi) [15:58:02] 6Operations, 13Patch-For-Review, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2134183 (10elukey) Increased the nf_conntrack with https://gerrit.wikimedia.org/r/#/c/278290/1 [16:03:07] (03CR) 10Gilles: [C: 031] Bump jobqueue "connectTimeout" to 300ms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278090 (owner: 10Aaron Schulz) [16:09:01] 6Operations, 13Patch-For-Review, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2134202 (10MoritzMuehlenhoff) I've have merged a puppet change to bump the connection table on the job runners to 512k (it's only effective with the next reboot, but... [16:12:25] (03PS4) 10Ottomata: eventlogging: Remove server-side udp to kafka forwarder [puppet] - 10https://gerrit.wikimedia.org/r/276615 (https://phabricator.wikimedia.org/T129402) (owner: 10Madhuvishy) [16:12:54] (03PS5) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [16:14:12] (03CR) 10Ottomata: [C: 032 V: 032] "No server side events since March 11" [puppet] - 10https://gerrit.wikimedia.org/r/276615 (https://phabricator.wikimedia.org/T129402) (owner: 10Madhuvishy) [16:14:14] (03PS1) 10Tim Landscheidt: Fix misleading/not supported variable references [puppet] - 10https://gerrit.wikimedia.org/r/278295 [16:16:27] 6Operations, 10ops-codfw, 6DC-Ops: ms-be2010.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T129117#2095571 (10Papaul) Drive replacement complete. [16:16:44] 6Operations, 10ops-codfw, 6DC-Ops: ms-be2010.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T129117#2134260 (10Papaul) a:5Papaul>3fgiunchedi [16:18:34] 6Operations, 10Analytics, 10Analytics-Cluster, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#2134264 (10Ottomata) 5Open>3declined The other analytics hardware requests are currently in pending approval will take up the most of the analytics remaind... [16:19:32] !log reboot ms-be2010 to pick up new disk ordering [16:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:25:36] 6Operations, 10ops-codfw, 6DC-Ops: ms-be2010.codfw.wmnet: slot=0 dev=sda failed - https://phabricator.wikimedia.org/T129117#2134281 (10fgiunchedi) 5Open>3Resolved disk rebuilding [16:26:09] (03CR) 10Ottomata: [C: 031] "+1 for the 3 files I am familiar with. Someone else should +1 for the others." [puppet] - 10https://gerrit.wikimedia.org/r/278295 (owner: 10Tim Landscheidt) [16:26:45] (03CR) 10Andrew Bogott: "this looks great to me, I'm running the puppet compiler on the openstack bits." [puppet] - 10https://gerrit.wikimedia.org/r/278295 (owner: 10Tim Landscheidt) [16:27:28] (03CR) 10Andrew Bogott: [C: 031] "Puppet compiler approves!" [puppet] - 10https://gerrit.wikimedia.org/r/278295 (owner: 10Tim Landscheidt) [16:27:40] (03PS2) 10Andrew Bogott: Fix misleading/not supported variable references [puppet] - 10https://gerrit.wikimedia.org/r/278295 (owner: 10Tim Landscheidt) [16:28:36] (03CR) 10Ottomata: [C: 031] "One more thing!" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) (owner: 10Elukey) [16:29:35] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859025.00 seconds [16:29:44] PROBLEM - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859331.00 seconds [16:29:55] PROBLEM - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859414.00 seconds [16:30:05] PROBLEM - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859321.00 seconds [16:30:25] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 859225.00 seconds [16:30:35] that is me, fixing it (it was broken, but the alert didn't showed it) [16:30:46] !log bumped connection tracking table size on mw1161-mw1169 to 524288 to cope with currently elevated connections on those (T130364) [16:30:47] T130364: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364 [16:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:12] (03CR) 10Andrew Bogott: [C: 032] "compiler-approved" [puppet] - 10https://gerrit.wikimedia.org/r/278295 (owner: 10Tim Landscheidt) [16:32:56] (03CR) 10Andrew Bogott: "Thanks, Tim!" [puppet] - 10https://gerrit.wikimedia.org/r/278295 (owner: 10Tim Landscheidt) [16:33:07] 6Operations: reinstall bast4001 with jessie - https://phabricator.wikimedia.org/T123674#2134330 (10Dzahn) [16:37:38] !log restarted hhvm on mw1205 [16:37:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:38:15] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 70128 bytes in 0.181 second response time [16:39:05] RECOVERY - Apache HTTP on mw1205 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 615 bytes in 0.253 second response time [16:41:20] (03PS4) 10Tim Landscheidt: Tools: Unpuppetize host_aliases [puppet] - 10https://gerrit.wikimedia.org/r/241582 (https://phabricator.wikimedia.org/T109485) [16:42:46] ACKNOWLEDGEMENT - MariaDB Slave Lag: s3 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858997.00 seconds Jcrespo recovering after replication lag [16:42:47] ACKNOWLEDGEMENT - MariaDB Slave Lag: s4 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858896.00 seconds Jcrespo recovering after replication lag [16:42:47] ACKNOWLEDGEMENT - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 857745.00 seconds Jcrespo recovering after replication lag [16:42:47] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858640.00 seconds Jcrespo recovering after replication lag [16:42:47] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 858731.00 seconds Jcrespo recovering after replication lag [16:44:15] (03PS5) 10Elukey: Add automatic failover to Hadoop Namenodes. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/277984 (https://phabricator.wikimedia.org/T129838) [16:45:05] (03PS1) 10Ottomata: python-etcd no longer needed for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/278304 [16:45:47] (03PS2) 10Ottomata: python-etcd no longer needed for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/278304 [16:47:14] (03CR) 10Ottomata: [C: 032] python-etcd no longer needed for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/278304 (owner: 10Ottomata) [16:53:50] !log starting enwiki import to labs from dbstore1002 (expect lag and consistency problems during the hot import) [16:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:57:26] PROBLEM - High lag on wdqs1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1800.0] [16:59:52] 6Operations, 10ops-codfw, 6DC-Ops: mw2066 to mw2074 don't reboot cleanly - https://phabricator.wikimedia.org/T130008#2134465 (10Papaul) 5Open>3Resolved IDRAC update complete on all those systems. [17:05:09] !log restarting elasticsearch server elastic1029.eqiad.wmnet [17:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:05:38] (03PS1) 10Ori.livneh: Allow finer-grained control over debug logging via XWD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 [17:05:51] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2134546 (10Dzahn) Having separate groups for -users ,-admins and -roots is pretty standard across our admin module.. But feel free to upload a patch and chang... [17:06:27] (03PS2) 10Ori.livneh: Allow finer-grained control over debug logging via XWD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 [17:12:07] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2134563 (10Dzahn) The existing group was called statistics-web-users and didn't need root. The new group in the same role is for the same thing but with addit... [17:13:33] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2134566 (10Ottomata) I think using `statistics-admins` here could be enough. We’d just have to include this group on stat1001, which I think is totally fine. [17:16:23] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: stat1001 access + sudo rights for nuria and mforns - https://phabricator.wikimedia.org/T130226#2134572 (10Dzahn) Ok, could you please rename the ticket and upload a patch? [17:19:03] (03PS1) 10Dzahn: admin: remove unused group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278311 (https://phabricator.wikimedia.org/T130226) [17:19:43] (03PS2) 10Dzahn: admin: remove unused group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278311 (https://phabricator.wikimedia.org/T130226) [17:20:12] (03CR) 10Dzahn: [C: 032] "https://phabricator.wikimedia.org/T130226#2133929" [puppet] - 10https://gerrit.wikimedia.org/r/278311 (https://phabricator.wikimedia.org/T130226) (owner: 10Dzahn) [17:26:12] (03PS4) 10Dzahn: gerrit: avoid defining class inside class [puppet] - 10https://gerrit.wikimedia.org/r/278225 [17:27:14] (03CR) 10Dzahn: [C: 032] "thanks for compiling, hashar :)" [puppet] - 10https://gerrit.wikimedia.org/r/278225 (owner: 10Dzahn) [17:28:48] (03PS3) 10Ottomata: admin: remove unused group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278311 (https://phabricator.wikimedia.org/T130226) (owner: 10Dzahn) [17:28:56] (03CR) 10Ottomata: [C: 032 V: 032] admin: remove unused group statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278311 (https://phabricator.wikimedia.org/T130226) (owner: 10Dzahn) [17:29:40] (03CR) 10Dzahn: "confirmed noop on ytterbium" [puppet] - 10https://gerrit.wikimedia.org/r/278225 (owner: 10Dzahn) [17:30:21] (03PS1) 10BryanDavis: logstash: Remove obsolete role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/278314 [17:30:23] (03PS1) 10BryanDavis: logstash: Make truncated MediaWiki json easier to find [puppet] - 10https://gerrit.wikimedia.org/r/278315 [17:30:25] (03PS1) 10Ottomata: Include statistics-admins on stat1001 (role statistics::web), include nuria in that group [puppet] - 10https://gerrit.wikimedia.org/r/278316 (https://phabricator.wikimedia.org/T130226) [17:30:58] !log restarting elasticsearch server elastic1030.eqiad.wmnet [17:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:49] (03PS3) 10Dzahn: gerrit: move role classes to module, split in 2 files [puppet] - 10https://gerrit.wikimedia.org/r/278228 [17:35:06] (03CR) 10Dzahn: [C: 032] "noop on ytterbium and antimony http://puppet-compiler.wmflabs.org/2099/" [puppet] - 10https://gerrit.wikimedia.org/r/278228 (owner: 10Dzahn) [17:35:10] oh, sorry, you already did with debug [17:35:17] wrong channel :p [17:41:04] RECOVERY - High lag on wdqs1001 is OK: OK: Less than 30.00% above the threshold [600.0] [17:41:56] (03PS1) 10Ori.livneh: Add apache::mod::security [puppet] - 10https://gerrit.wikimedia.org/r/278318 [17:41:58] (03PS1) 10Ori.livneh: Reduce the number of jobrunner procs on mw11* hosts [puppet] - 10https://gerrit.wikimedia.org/r/278319 (https://phabricator.wikimedia.org/T130364) [17:42:01] (03CR) 10Dzahn: [C: 031] Include statistics-admins on stat1001 (role statistics::web), include nuria in that group [puppet] - 10https://gerrit.wikimedia.org/r/278316 (https://phabricator.wikimedia.org/T130226) (owner: 10Ottomata) [17:42:18] (03PS12) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [17:42:24] (03PS9) 10EBernhardson: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) [17:42:36] (03PS13) 10Elukey: Varnish 4 API porting. [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/276439 (https://phabricator.wikimedia.org/T124278) [17:42:55] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:43:48] (03CR) 10Dzahn: [C: 031] Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [17:43:50] (03PS2) 10Ori.livneh: Reduce the number of jobrunner procs on mw11* hosts [puppet] - 10https://gerrit.wikimedia.org/r/278319 (https://phabricator.wikimedia.org/T130364) [17:45:12] (03Abandoned) 10Dzahn: admin: add mforns, nuria to statistics-web-roots [puppet] - 10https://gerrit.wikimedia.org/r/278044 (https://phabricator.wikimedia.org/T130226) (owner: 10Dzahn) [17:45:27] (03PS2) 10Ori.livneh: Add apache::mod::security (not used anywhere) [puppet] - 10https://gerrit.wikimedia.org/r/278318 [17:45:46] (03CR) 10Ori.livneh: [C: 032 V: 032] Reduce the number of jobrunner procs on mw11* hosts [puppet] - 10https://gerrit.wikimedia.org/r/278319 (https://phabricator.wikimedia.org/T130364) (owner: 10Ori.livneh) [17:45:57] (03PS1) 10Odder: Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278320 (https://phabricator.wikimedia.org/T70728) [17:46:21] (03CR) 10jenkins-bot: [V: 04-1] Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278320 (https://phabricator.wikimedia.org/T70728) (owner: 10Odder) [17:50:26] Hi folks, sending out next week's issue of Tech News to the translators soon, and I'd like confirmation regarding the new dates of the Dallas centre test, if possible. April 19 and 21? April 18 and 20? I've seen both in different places. [17:51:07] JohanJ: the week of April 18th [17:51:53] (03PS2) 10Dzahn: base/syslogs: fix "defined typed defined inside a class" [puppet] - 10https://gerrit.wikimedia.org/r/278232 [17:52:02] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/2100/" [puppet] - 10https://gerrit.wikimedia.org/r/278232 (owner: 10Dzahn) [17:52:18] (03PS1) 10Odder: Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278321 (https://phabricator.wikimedia.org/T70728) [17:52:23] greg-g: no specific dates decided? I'm mainly concerned about the dates when the wikis will be read-only for a while. [17:52:40] But if week of April 18 is what's been decided and nothing more specific, that's good to know. (: [17:53:33] (03Abandoned) 10Odder: Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278320 (https://phabricator.wikimedia.org/T70728) (owner: 10Odder) [17:53:45] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: puppet fail [17:54:05] PROBLEM - torrus.wikimedia.org UI on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Torrus Top: Wikimedia not found on https://torrus.wikimedia.org:443/torrus - 614 bytes in 0.060 second response time [17:54:07] JohanJ: that's all I know from the email mark sent [17:54:31] i'll look at torrus [17:55:53] !log on bohrium: /etc/apache2/sites-enabled/.links2 ; was causing puppet to refresh apache2 on each run [17:55:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:56:14] (03PS1) 10Dzahn: statistics: remove deleted group from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/278322 [17:56:56] (03PS2) 10Dzahn: statistics: remove deleted group from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/278322 [17:57:05] (03PS3) 10Dzahn: statistics: remove deleted group from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/278322 [17:57:17] (03CR) 10Dzahn: [C: 032 V: 032] statistics: remove deleted group from hieradata [puppet] - 10https://gerrit.wikimedia.org/r/278322 (owner: 10Dzahn) [17:57:46] greg-g: https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter&diff=370439&oldid=369290 <-- paravoid updated the schedule on Wikitech with the dates of April 18 and 20, but having seen 19 and 21 suggested elsewhere, I wanted to double-check. [17:57:55] ^ that one will fix stat1001 [17:58:44] JohanJ: I'd go by what faidon (p.aravoid) says on that page [17:59:02] !log netmon1001: failed torrus service - recovery steps as outlined on wikitech [[Torrus]] [17:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:14] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [17:59:34] RECOVERY - torrus.wikimedia.org UI on netmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 2492 bytes in 0.075 second response time [18:01:03] (03PS2) 10Dzahn: ganglia: fix "defined type defined inside a class" [puppet] - 10https://gerrit.wikimedia.org/r/278229 [18:01:29] (03CR) 10Gehel: Create a portal-master dir in beta to serve master branch of portals (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) (owner: 10EBernhardson) [18:06:34] (03PS5) 10Smalyshev: Add caching headers for nginx [puppet] - 10https://gerrit.wikimedia.org/r/274864 (https://phabricator.wikimedia.org/T126730) [18:08:01] !log restarting elasticsearch server elastic1031.eqiad.wmnet [18:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:20] \o/ last one [18:08:28] oh noes [18:08:41] MaxSem: And why noes? [18:09:05] * gehel does not understand some of the subtleties of the English language ... [18:09:10] cuz suspiciously good to be true! \m/ [18:09:35] MaxSem: Hey, you did not hire any sysadmin greenie! [18:10:30] (03PS10) 10Gehel: Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) (owner: 10EBernhardson) [18:12:57] (03CR) 10Gehel: [C: 032] Create a portal-master dir in beta to serve master branch of portals [puppet] - 10https://gerrit.wikimedia.org/r/276397 (https://phabricator.wikimedia.org/T129427) (owner: 10EBernhardson) [18:23:51] A commit merged today in MediaWiki core will be for wmf18 or wmf19? [18:23:53] (03CR) 10Filippo Giunchedi: Adding metric collection to nginx in the context of elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [18:26:28] Dereckson: should be for wmf18. https://www.mediawiki.org/wiki/MediaWiki_1.27/Roadmap [18:30:30] 6Operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#2134817 (10Ottomata) [18:33:04] o/ sabya [18:33:13] o/ halfak [18:33:26] so, sabya and I are having some puppet trouble. We're looking for some help. [18:33:47] (03PS1) 10Ori.livneh: Adjust port range of ferm rule for redis on app servers [puppet] - 10https://gerrit.wikimedia.org/r/278326 (https://phabricator.wikimedia.org/T130364) [18:33:48] (03PS1) 10Ori.livneh: Enable reuse of sockets in TIME_WAIT state on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/278327 (https://phabricator.wikimedia.org/T130364) [18:33:50] sabya, can you give us a good pointer to your new puppet work? [18:34:14] https://phabricator.wikimedia.org/T106638#2134687 [18:34:39] halfak: i wanted to let you know the location of the ores role classes has changed, only the location though, not the name of them and not the content [18:34:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Adjust port range of ferm rule for redis on app servers [puppet] - 10https://gerrit.wikimedia.org/r/278326 (https://phabricator.wikimedia.org/T130364) (owner: 10Ori.livneh) [18:34:51] halfak: let me see if i can help with the puppet issue [18:34:56] https://gerrit.wikimedia.org/r/#/c/277824/ [18:34:59] mutante, oh! So, how will this work for instances that we already have running [18:35:31] halfak: it changed nothing for them at all. i ran puppet manually on every single instance. no-op [18:35:47] halfak: but while i was doing that.. i found an issue with one instance running out of memory and made a ticket [18:36:14] (03CR) 10jenkins-bot: [V: 04-1] Enable reuse of sockets in TIME_WAIT state on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/278327 (https://phabricator.wikimedia.org/T130364) (owner: 10Ori.livneh) [18:36:31] sabya: why did you remove the 'require ores::base' line? [18:36:33] the instances don't see any difference because the names of the classes stay the same and are automatically loaded [18:37:04] gotcha mutante. Could you link me to the memory issue ticket? [18:37:09] I added in a later version. [18:37:15] they are now in the proper autoloader layout, module/role/manifests/ores/ [18:37:27] also you have one file for each role, instead of one large file [18:37:59] halfak: https://phabricator.wikimedia.org/T130338 [18:38:04] ori ^ but not pushed for review yet [18:38:10] it is in a local commit. [18:38:23] sabya: but why did you remove it in the first place, in https://gerrit.wikimedia.org/r/#/c/277824/ ? [18:38:45] i was getting class redeclared error. [18:38:48] if you require the base class, that will ensure ORES is available by the time puppet tries to configure / manage the service [18:38:59] halfak: and this fyi how the location changed and how it's one file for each role https://gerrit.wikimedia.org/r/#/c/270102/ [18:39:08] sabya: ok, so do this: [18:39:28] mutante, any thoughts on this out of memory issue? [18:40:04] MatmaRex: so the branch cut is done the day of the deployment for group0? [18:40:05] sabya: change 'require ores::base' to 'include ores::base', and add a require => Git::Clone['ores-wm-config'] to the base::service_unit resource def [18:40:35] halfak: if it needs more memory, it can probably have more because there are quota settings per project [18:40:50] (03CR) 10Gergő Tisza: "There is tooling support for XWD:1 (a Firefox plugin and a Chrome plugin) but not for XWD:log, IMO that should be fixed first." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 (owner: 10Ori.livneh) [18:40:52] Dereckson: usually, yes [18:40:58] yes [18:41:15] halfak: but i'd also ask for advice from andrew bogott [18:41:22] mutante, gotcha. [18:41:38] Weird that we didn't have the same issue with ores-web-01 [18:41:41] ori: thanks. let me try. [18:42:18] (03CR) 10Ori.livneh: "Gergő, both the Chrome and Firefox extensions have been updated to support multiple parameters in the X-Wikimedia-Debug header. We support" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 (owner: 10Ori.livneh) [18:42:30] halfak: yea, so we identified all instances that use anything ores related using the "watroles" tool.. and none of them have this issue, only web-02 [18:42:39] (03PS1) 10Mobrovac: kafka_config: Utility function for formatting configuration variables [puppet/kafka] - 10https://gerrit.wikimedia.org/r/278329 (https://phabricator.wikimedia.org/T130371) [18:42:45] tgr: see the updated https://wikitech.wikimedia.org/wiki/X-Wikimedia-Debug [18:43:21] mutante, we currently have less memory free on 01 than 02. Would it be crazy if I retried puppet and attributed this to "too many forked uwsgi"? [18:43:22] halfak: is it much work to create ores-web-04 ? could be worth it to just replace this instance and see if it comes back or not [18:43:35] mutante, shouldn't be much work, no [18:43:54] maybe try that lazy approach first, throw away the instance and replace with a fresh one [18:44:08] (03PS2) 10Ori.livneh: Enable reuse of sockets in TIME_WAIT state on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/278327 (https://phabricator.wikimedia.org/T130364) [18:44:15] (03PS1) 10Eevans: make logstash messages separable by cluster [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) [18:44:40] halfak: oh, that wouldnt be crazy no. [18:45:07] I'll try that and then look into lowering the fork count and adding a third web node. [18:45:28] sounds good [18:45:37] (03CR) 10jenkins-bot: [V: 04-1] Enable reuse of sockets in TIME_WAIT state on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/278327 (https://phabricator.wikimedia.org/T130364) (owner: 10Ori.livneh) [18:46:27] iiinteresting mobrovac guess you decided to make it a global func instead of one in kafka module? [18:46:28] (03CR) 10Mobrovac: "I also plan to create a kafka_cluster_name function, but that is highly wmf-specific, so I think it'd be better suited in ops/puppet/modul" [puppet/kafka] - 10https://gerrit.wikimedia.org/r/278329 (https://phabricator.wikimedia.org/T130371) (owner: 10Mobrovac) [18:46:29] hm [18:46:40] oh, this is in kafka [18:46:41] sorry [18:46:46] Hmmm. Looks like I'll change the web servers list here: https://wikitech.wikimedia.org/wiki/Hiera:Ores [18:46:47] hmmm [18:46:51] And workers per core [18:46:57] mobrovac: i think we shoudln't pass clusters [18:46:59] as a global [18:47:07] the kafka modulle doesn't know about more than one cluster [18:47:12] ottomata: global? it's in module/kafka/ [18:47:20] yeah sorry, missed that [18:47:24] just saw the path, didn't see the repo [18:47:26] :) [18:47:29] haha [18:47:35] but still, you have it taking clusters as an arg [18:47:44] (03PS3) 10Ori.livneh: Enable reuse of sockets in TIME_WAIT state on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/278327 (https://phabricator.wikimedia.org/T130364) [18:47:46] (03CR) 10MaxSem: Allow finer-grained control over debug logging via XWD (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 (owner: 10Ori.livneh) [18:48:00] (03PS3) 10Dzahn: ganglia: fix "defined type defined inside a class" [puppet] - 10https://gerrit.wikimedia.org/r/278229 [18:48:11] the kafka module doesn't know anything about a global clusters hash with multiple clusters in it [18:49:25] if its in the kafka module, i think it should take the per cluster brokers hash directly, instead of the global one...although it is easier to use this way in ops/puppet, since you don't have pull out the per cluster brokers yourself [18:49:27] yeah i thought about avoiding the extra step of getting the single cluster hash out [18:49:39] (03CR) 10Gehel: Adding metric collection to nginx in the context of elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [18:49:41] yeah, i guess its ok, but if we do that, then this probably shouldn,'t go in the kafka module [18:49:51] i can remove it too [18:50:03] MaxSem: would 'XWikimediaDebug' work? [18:50:09] instead of XWD [18:50:18] prolly:) [18:50:18] mutante, FYI https://phabricator.wikimedia.org/T130394 Thanks for your help :) [18:50:21] mobrovac: hmmmmmm.....i'm not sure, it is easier to use this way...should we just put it in the ops/puppet repo? [18:50:27] (03PS2) 10Gehel: Adding metric collection to nginx in the context of elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/278278 (https://phabricator.wikimedia.org/T130365) [18:50:50] ottomata: i guess we can do that too [18:51:04] halfak: alright, yw. is the puppet issue you came for answered? [18:51:23] hmmm, you know, i think that might be nicer, because then we can enforce the /kafka/$cluster_name zk chroot as a wmf convention, rather than need to parameterize it somehow in this function [18:51:24] (03CR) 10Eevans: [C: 031] Filter StatusLogger messages from UDP appender [puppet] - 10https://gerrit.wikimedia.org/r/277265 (https://phabricator.wikimedia.org/T128787) (owner: 10Eevans) [18:51:34] sabya: feel free to ping me about gerrit/puppet [18:51:40] the kafka module doesn't know anything about 'cluster name', it just takes a zk chroot parameter [18:51:50] /kafka/$cluster_name is our convention [18:52:08] sure! thanks, mutante :) [18:52:26] gehel: I think the only reason the nginx module does not configure the diamond collector by default for all nginx instances is because the nginx puppetization is in a submodule and we were trying to not have it depend on operations/puppet-specific resources [18:52:31] but i think that is out the window now [18:52:34] so it might make sense to consolidate [18:52:57] (03PS1) 10Chad: Add basic linter config for arcanist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278333 [18:53:04] (03CR) 10Ori.livneh: [C: 032] Enable reuse of sockets in TIME_WAIT state on all app servers [puppet] - 10https://gerrit.wikimedia.org/r/278327 (https://phabricator.wikimedia.org/T130364) (owner: 10Ori.livneh) [18:53:28] (03CR) 10Dzahn: [C: 032] "compiler tested no-op, issue on uranium is unrelated and the same pre-change" [puppet] - 10https://gerrit.wikimedia.org/r/278229 (owner: 10Dzahn) [18:53:40] (03CR) 10Chad: [C: 032] Add basic linter config for arcanist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278333 (owner: 10Chad) [18:53:47] (03PS4) 10Dzahn: ganglia: fix "defined type defined inside a class" [puppet] - 10https://gerrit.wikimedia.org/r/278229 [18:53:59] lost the merge race :) [18:54:48] (03Merged) 10jenkins-bot: Add basic linter config for arcanist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278333 (owner: 10Chad) [18:54:48] 6Operations, 13Patch-For-Review, 7Performance: nf_conntrack: table full errors on Eqiad Job Runners - https://phabricator.wikimedia.org/T130364#2134929 (10ori) 5Open>3Resolved a:3ori [18:55:47] (03PS1) 10Odder: Create a HiDPI logo for the Czech Wikipedia (cswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278334 (https://phabricator.wikimedia.org/T130392) [18:56:35] (03PS3) 10Ori.livneh: Allow finer-grained control over debug logging via XWD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 [18:56:36] !log demon@tin Synchronized .arclint: no op really, co master sync (duration: 00m 39s) [18:56:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:58:55] (03PS2) 10Dzahn: mediawiki/refreshlinks: move cronjob define out of class [puppet] - 10https://gerrit.wikimedia.org/r/278230 [19:03:16] 6Operations, 10DBA: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#1998016 (10Dzahn) Maybe edit site.pp so that the actually unused ones are removed from puppet but the one still used is still in it. Then there is less ambiguity and we can move forward with the decom by revoking pu... [19:03:55] (03CR) 10Ori.livneh: "> Are we sure requests with XWD header always go to one of the designated machines?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 (owner: 10Ori.livneh) [19:04:11] (03CR) 10Ori.livneh: [C: 032] Allow finer-grained control over debug logging via XWD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 (owner: 10Ori.livneh) [19:04:30] 6Operations, 10DBA: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#2135011 (10Dzahn) [19:04:41] (03Merged) 10jenkins-bot: Allow finer-grained control over debug logging via XWD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278310 (owner: 10Ori.livneh) [19:06:49] !log ori@tin Synchronized wmf-config/logging.php: Iabca8858e: Allow finer-grained control over debug logging via XWD (duration: 00m 32s) [19:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:11:43] (03PS1) 10Dzahn: remove db200[1-7] from DHCP [puppet] - 10https://gerrit.wikimedia.org/r/278338 (https://phabricator.wikimedia.org/T125827) [19:14:17] (03CR) 10Jcrespo: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/278338 (https://phabricator.wikimedia.org/T125827) (owner: 10Dzahn) [19:14:18] (03PS2) 10ArielGlenn: add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) [19:14:48] mutante: launched one instance with role ores::web. now if I run there puppet agent -tv, I get this: https://gist.github.com/sabyasachi/ee108b5d35850ca4fad3 [19:15:59] 6Operations, 10DBA, 13Patch-For-Review: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#2135050 (10jcrespo) "the actually unused ones are removed from puppet" It was like that, until Moritz readded a bunch to get security updates. Please wait until I see the final destination of... [19:16:19] (03CR) 10jenkins-bot: [V: 04-1] add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) (owner: 10ArielGlenn) [19:16:26] sabya: looking [19:18:02] sabya: so the class ores::redisproxy has a parameter, $server, that is not optional [19:18:19] sabya: but the role labs::ores::web just includes redisproxy without specifying the $server [19:18:30] mutante, I think that param comes from wikitech [19:18:38] https://wikitech.wikimedia.org/wiki/Hiera:Ores [19:18:45] "ores::redisproxy::server": ores-redis-01.ores.eqiad.wmflabs [19:19:07] ah! hmm. then maybe this means the hiera lookup fails [19:19:33] but this line: [19:19:39] 3 include ::ores::redisproxy [19:20:01] !log upgraded bohrium VM: vcpus 2 => 8, ram 4 => 8g [19:20:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:20:21] can that work with the include? [19:20:55] is there another instance with identical roles where this doesnt happen? [19:21:00] !log rebooting bohrium [19:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:04] mutante, seems like it has been working like that for a while? Unless there was a recent change. [19:21:26] I did just run puppet on ores-web-02 successfully [19:22:25] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 0.008 second response time [19:23:04] PROBLEM - Misc HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [19:23:26] ^piwik errors [19:23:51] https://github.com/wikimedia/operations-puppet/blob/production/modules/ores/manifests/web.pp is 5 months old since changed. [19:24:12] halfak: sabya: let me double check in wikitech about the instance config [19:24:21] so, 02 runs and 04 does not [19:24:25] i'll connect [19:29:16] (03PS3) 10ArielGlenn: add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) [19:29:45] also, the labsores.pp is gone in recent code? [19:30:12] sabya: yes, please look in modules/role/manifests/ores/ [19:30:22] sabya: it's all the same content, just split up in one file per role [19:30:56] FWIW, I just confirmed that starting up a new ores-web node with the puppet role worked as expected. [19:31:10] It cloned the git repo into /srv/ores/config/ [19:31:14] eh, sorry, modules/role/manifests/labs/ores/ [19:31:19] flower.pp lb.pp precached.pp redis.pp staging.pp web.pp worker.pp [19:31:25] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:29] (03CR) 10jenkins-bot: [V: 04-1] add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) (owner: 10ArielGlenn) [19:31:42] hopefully that's also easier to read and edit than that single large labsores before [19:32:19] halfak:which is your instance? [19:34:55] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 9.550 second response time [19:35:20] i'm adding myself to the project in wikitech etc.. [19:36:43] (03PS1) 10Ori.livneh: Enable opcache for Piwik [puppet] - 10https://gerrit.wikimedia.org/r/278341 [19:36:52] sabya: ok, i see the issue now [19:37:00] (03PS2) 10Ori.livneh: Enable opcache for Piwik [puppet] - 10https://gerrit.wikimedia.org/r/278341 [19:37:00] sabya: from your paste "on node sabya2.revscoring.eqiad.wmflabs" [19:37:11] (03CR) 10Ori.livneh: [C: 032 V: 032] Enable opcache for Piwik [puppet] - 10https://gerrit.wikimedia.org/r/278341 (owner: 10Ori.livneh) [19:37:28] sabya: that instance is not running in the ores project context, but in revscoring. and the hiera values are set per project, so $server does not get set [19:37:49] Oh no! That was my mistake. [19:38:02] Was hoping sabya could work in the `revscoring` project for experimentation [19:38:29] Wait... then again, we have `ores` roles available in the `revscoring` project. [19:38:41] Must have pre-dated our switch to `ores` hiera [19:38:44] 6Operations, 10ops-eqiad, 6DC-Ops: testing: r430 server / h800 controller / md1200 shelf - https://phabricator.wikimedia.org/T127490#2045136 (10chasemp) Verdict here is 'no go'? [19:38:45] maybe those roles dont use hiera? [19:38:58] Presumably, we can just copy the `hiera` from ores to revscoring, right? [19:39:07] we can easily create new projects [19:39:14] yea, i think so [19:39:32] just saying, it costs me 5 seconds to give you a new ores-test if you want to [19:39:36] https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ARevscoring&type=revision&diff=375830 [19:39:41] {{done}} [19:39:48] :P [19:40:10] Woops! I still messed it up :S [19:40:14] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:40:45] Actually... that'll probably work fine. [19:41:48] halfak: can you add me to ORES context as well? [19:41:54] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 9.635 second response time [19:42:07] 6Operations, 6Labs: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2135094 (10chasemp) [19:42:09] 6Operations, 6Labs, 10Labs-Infrastructure: Unable to connect both redundant labstores to the shelves in parallel - https://phabricator.wikimedia.org/T117453#2135093 (10chasemp) 5Open>3Invalid [19:46:03] sabya, {{done}} [19:46:18] (03PS1) 10Dereckson: New blog for English Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278342 [19:46:21] thanks. puppet runs now [19:46:25] sabya, still, I think we should try to work from `revscoring` project [19:46:27] great! [19:47:39] yes, the instance is still in revscoring project. puppet seems to run happily now. [19:48:47] great. [19:48:53] hiera copy-pasta FTW [19:49:57] glad it works, cool [19:50:44] i would recommend keeping the ores instances in the ores project, but whatever works is fine [19:51:15] mutante, I feel weird about starting up a bunch of instances next to our productionish service [19:51:26] Hence the desire to test in `revscoring` [19:51:34] I'm interested in your thoughts on the subject [19:51:48] halfak: how about i make ores-staging [19:51:52] it's super quick [19:52:14] mutante, OK that works for me. [19:52:16] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:59] !log temporarily disabling puppet on krypton [19:53:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:54:00] * halfak kills some old stuff in the `revscoring` project [19:55:12] halfak: i created the project and made you both admins. would you do the copy/paste for hiera again? [19:55:31] mutante, will do. [19:55:36] sabya: if you set the project filter to "ores-staging" now, you are admin and can create your own instances [19:56:21] great [19:57:12] (03PS2) 10Eevans: [WIP]: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 [19:57:21] (03CR) 10Eevans: [WIP]: write cassandra instance yaml descriptors (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [19:57:58] (03PS1) 10Papaul: Decom:Removed production DNS for db200[1-7] Bug:T125827 [dns] - 10https://gerrit.wikimedia.org/r/278344 (https://phabricator.wikimedia.org/T125827) [19:58:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP]: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 (owner: 10Eevans) [20:00:17] (03PS3) 10Eevans: [WIP]: write cassandra instance yaml descriptors [puppet] - 10https://gerrit.wikimedia.org/r/277865 [20:02:55] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 9.885 second response time [20:05:34] sabya, no rush to switch over. [20:05:46] I think we should eventually deprecate the `revscoring` project though. [20:06:03] yes, ofcourse [20:06:08] We have your test instances and Amir1's test ORES extension instance -- which is now running on the beta cluster anyway. [20:07:26] I need to say we should keep the ores extension instance [20:07:44] Amir1, OK. No problems. [20:07:56] we can move it to somewhere else [20:08:03] I imagine that in a while (weeks or month) we can reconsider shutting down that instance. [20:08:11] How much trouble is it to move? [20:08:18] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2135197 (10Paladox) >>! In T124474#2027410, @mobrovac wrote: > I'm setting this task to stalled, as this is only relevant for edge cases where we execute `npm install` directly on the hosts (some testing hosts and CI). I don't... [20:08:21] not so much [20:08:25] data doesn't matter at all [20:08:34] we can re-setup [20:08:45] and we have import/export [20:10:54] (03PS4) 10ArielGlenn: add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) [20:11:54] Amir1, no rush. I'll make a task for it. [20:11:58] Get to it whenever :) [20:12:21] sure [20:13:09] (03PS1) 10Dereckson: New blog for French, English, Arabic Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278345 [20:13:11] (03PS5) 10ArielGlenn: add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) [20:13:37] (03PS2) 10Dereckson: New blog for French, English, Arabic Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278345 [20:14:43] (03CR) 10ArielGlenn: [C: 032] add dataset nfs clients to hiera, make nfs service subscribe to exports [puppet] - 10https://gerrit.wikimedia.org/r/278043 (https://phabricator.wikimedia.org/T111586) (owner: 10ArielGlenn) [20:22:45] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:22:46] 6Operations, 6Discovery, 10Kartotherian, 10Maps, 3Discovery-Maps-Sprint: Maps hardware planning for FY16/17 - https://phabricator.wikimedia.org/T125126#2135274 (10Tfinc) @mark Yes, lets move forward with this. thanks [20:23:26] (03CR) 10Tim Landscheidt: [C: 031] "The change itself looks good to me, but I think the whole class should be deleted as apparently it is not used (and in the wrong file anyw" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/278233 (owner: 10Dzahn) [20:23:46] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:25:01] 6Operations, 10hardware-requests: new labstore hardware for eqiad - https://phabricator.wikimedia.org/T126089#2135278 (10chasemp) An overview of where I'm at with this to further T127508 In https://phabricator.wikimedia.org/T127508#2063406 I outlined that I believe our historical perspective on this has been... [20:25:14] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:43] (03PS1) 10BryanDavis: Logging: add ApiAction kafka logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278347 (https://phabricator.wikimedia.org/T108618) [20:29:34] 6Operations, 10Ops-Access-Requests, 6Discovery, 10Maps: Requesting maps-admins access for Eric Evans - https://phabricator.wikimedia.org/T130412#2135290 (10Eevans) [20:30:45] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 0.010 second response time [20:30:50] (03PS1) 10ArielGlenn: fix embarrassing ^t typo in exports list for datasets in hiera [puppet] - 10https://gerrit.wikimedia.org/r/278348 [20:31:12] (03PS1) 10Eevans: add eevans to maps-admins group [puppet] - 10https://gerrit.wikimedia.org/r/278349 (https://phabricator.wikimedia.org/T130412) [20:31:54] (03PS3) 10Dzahn: New blog for French, English, Arabic Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278345 (owner: 10Dereckson) [20:32:21] (03CR) 10ArielGlenn: [C: 032] fix embarrassing ^t typo in exports list for datasets in hiera [puppet] - 10https://gerrit.wikimedia.org/r/278348 (owner: 10ArielGlenn) [20:32:23] (03CR) 10Dzahn: [C: 032] New blog for French, English, Arabic Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278345 (owner: 10Dereckson) [20:33:28] (03PS4) 10Dzahn: New blog for French, English, Arabic Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278345 (owner: 10Dereckson) [20:34:45] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:24] (03PS3) 10Dzahn: mediawiki/refreshlinks: move cronjob define out of class [puppet] - 10https://gerrit.wikimedia.org/r/278230 [20:35:48] (03CR) 10Dzahn: [C: 032] "no-op , terbium and also the new terbium-equivalent http://puppet-compiler.wmflabs.org/2108/" [puppet] - 10https://gerrit.wikimedia.org/r/278230 (owner: 10Dzahn) [20:36:26] (03PS2) 10Dereckson: New blog for English Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278342 [20:36:51] (03CR) 10Dereckson: "PS2: rebased against 278345" [puppet] - 10https://gerrit.wikimedia.org/r/278342 (owner: 10Dereckson) [20:36:52] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2135319 (10Krinkle) >>! In T124474#2135197, @Paladox wrote: > npm is running 2.x on node 0.10 but is running npm 1.x on node 4.3 so it is a down grade. We have tried npm 3.x by adding npm 3 to dependencies and it worked so no... [20:37:03] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2135321 (10Krinkle) [20:37:26] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:37:54] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:15] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:35] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:40:05] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:40:28] (03PS1) 10Dereckson: New blog for English Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278350 [20:40:48] (03PS3) 10Dzahn: New blog for English Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278342 (owner: 10Dereckson) [20:40:54] (03PS2) 10Dereckson: New blog for English Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278350 [20:41:14] 7Blocked-on-Operations, 6Operations, 10Datasets-General-or-Unknown: Snapshot hosts need to be manually added to dataset1001's exports - https://phabricator.wikimedia.org/T111586#2135327 (10ArielGlenn) 5Open>3Resolved Done. Closing. [20:41:17] 6Operations, 10Dumps-Generation, 7HHVM, 13Patch-For-Review: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#2135329 (10ArielGlenn) [20:41:37] (03CR) 10Dzahn: [C: 032] "thanks for the transparency disclaimer , approved :)" [puppet] - 10https://gerrit.wikimedia.org/r/278342 (owner: 10Dereckson) [20:44:49] (03PS3) 10Dzahn: New blog for English Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/278350 (owner: 10Dereckson) [20:45:51] (03CR) 10Dzahn: [C: 032] "yea, it would be nice if he could filter by tag, but still mostly all of it is on topic, except the occasional space shuttle article, afai" [puppet] - 10https://gerrit.wikimedia.org/r/278350 (owner: 10Dereckson) [20:47:33] (03PS3) 10Gehel: Adding a `$ensure` parameter to nginx::status_site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/278276 (https://phabricator.wikimedia.org/T130365) [20:51:45] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 9.809 second response time [20:52:31] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2135383 (10Dzahn) https://links.email.donate.wikimedia.org/ links.email.donate.wikimedia.org uses an invalid security cert... [20:53:09] (03CR) 10Krinkle: [C: 032] Use 'include' instead of 'include_once' in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277941 (owner: 10Krinkle) [20:53:19] !log reenabling puppet on krypton [20:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:53:40] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2135385 (10Dzahn) >>! In T74514#2054126, @CCogdill_WMF wrote: > It will be in use for another couple of weeks, until the Sw... [20:53:55] (03Merged) 10jenkins-bot: Use 'include' instead of 'include_once' in missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/277941 (owner: 10Krinkle) [20:54:42] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#2135386 (10CCogdill_WMF) Apologies @Dzahn, you're totally right. We don't use the links.email.donate.wikimedia.org domain a... [20:55:35] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: delete links.email.donate.wikimedia.org from DNS - https://phabricator.wikimedia.org/T130414#2135390 (10Dzahn) [20:56:26] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: links.email.donate.wikimedia.org should offer HTTPS - https://phabricator.wikimedia.org/T74514#758165 (10Dzahn) @CCogdill_WMF thank you, that is good news. i was just going through that old wiki page [20:56:36] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7Blocked-on-Fundraising-Tech, 7HTTPS: delete links.email.donate.wikimedia.org from DNS - https://phabricator.wikimedia.org/T130414#2135421 (10CCogdill_WMF) My comment was incorrect, this domain is no longer in use. Feel free to delete the DNS record! [20:57:59] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS: delete links.email.donate.wikimedia.org from DNS - https://phabricator.wikimedia.org/T130414#2135422 (10greg) [21:03:03] what's up with https://irc.wikimedia.org/ no redirect but also no error [21:03:06] loop? [21:03:30] used to be a redirect to docs on meta about irc [21:05:55] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:06:14] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:07:24] (03CR) 10BryanDavis: [C: 031] make logstash messages separable by cluster [puppet] - 10https://gerrit.wikimedia.org/r/278330 (https://phabricator.wikimedia.org/T130393) (owner: 10Eevans) [21:08:46] sync-common on mw1017 is saying: [21:08:51] cannot delete non-empty directory: php-1.27.0-wmf.11/cache/l10n [21:08:51] cannot delete non-empty directory: php-1.27.0-wmf.11/cache/l10n [21:08:51] cannot delete non-empty directory: php-1.27.0-wmf.11/cache [21:08:51] cannot delete non-empty directory: php-1.27.0-wmf.11/cache [21:08:51] cannot delete non-empty directory: php-1.27.0-wmf.11 [21:08:57] bd808: twentyafterfour: ostriches [21:09:13] sort of normal [21:09:19] First time I see it [21:09:33] it means somebody didn't fully clean up on an l10n delete [21:09:38] !log krinkle@tin Synchronized wmf-config/missing.php: (no message) (duration: 00m 25s) [21:09:40] That ^ [21:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:10:06] I'll fix it [21:10:11] ostriches: "we" should fix up a better script for that [21:10:15] Yeah [21:10:35] deleteMediaWiki should be smarter. [21:10:51] I *think* scap-purge-l10n-cache does the right thing [21:10:56] or did at some point [21:11:05] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 9.798 second response time [21:11:26] !log cleaned up stale /srv/mediawiki/php-1.27.0-wmf.{10,11} from the apaches. [21:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:32] Krinkle: Should be all gone now ^ [21:12:08] I actually saw that yesterday afternoon but got distracted by other borken things and forgot to clean it up [21:13:42] mutante, doesn't etherpad work fine over https? [21:14:12] Krenair: i think so yea, i was also wondering about that comment [21:14:50] https://etherpad.wikimedia.org/ works fine for me [21:14:56] torrus works over HTTPS [21:15:00] all of my bookmarked pads are https [21:15:38] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS: delete links.email.donate.wikimedia.org from DNS - https://phabricator.wikimedia.org/T130414#2135450 (10Dzahn) How about this one: open.email.donate.wikimedia.org When i look at DNS i see these 2 lines, related to mkt41 links.email.donate 1H... [21:16:05] Krenair: does it work at all? [21:16:09] right now [21:16:16] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:16:47] we should check if this is true or not: [21:16:50] " not excluded in HTTPS Everywhere " [21:16:54] Reedy: :) [21:17:22] yes, thanks, i'm marking etherpad as resolved [21:18:49] (03PS7) 1020after4: Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) [21:20:27] (03CR) 10jenkins-bot: [V: 04-1] Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [21:21:10] (03PS1) 10Dzahn: delete links.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) [21:26:18] 6Operations, 6Services: Package npm 2.14 - https://phabricator.wikimedia.org/T124474#2135457 (10Paladox) >>! In T124474#2135319, @Krinkle wrote: >>>! In T124474#2135197, @Paladox wrote: >> npm is running 2.x on node 0.10 but is running npm 1.x on node 4.3 so it is a down grade. We have tried npm 3.x by adding... [21:26:18] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 9.559 second response time [21:26:18] Krinkle hi about using puppet i think https://github.com/wikimedia/operations-puppet/blob/4b78fffdfb/modules/contint/manifests/packages/javascript.pp was broken so it was removed from nodepool. [21:26:20] Krinkle see https://github.com/wikimedia/integration-config/commit/cf7b841e354ebe60981f6e3e8c733d47bb410ed7 [21:26:20] Please [21:26:22] (03PS3) 10Dzahn: mediawiki/updatequerypages: move defines out of class [puppet] - 10https://gerrit.wikimedia.org/r/278231 [21:27:08] (03CR) 10Dzahn: [C: 032] "this also fixes these odd resource names" [puppet] - 10https://gerrit.wikimedia.org/r/278231 (owner: 10Dzahn) [21:28:26] PROBLEM - piwik.wikimedia.org on bohrium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:31:55] RECOVERY - piwik.wikimedia.org on bohrium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 518 bytes in 0.019 second response time [21:34:08] what is the actual $::mediawiki::users::web set to [21:34:41] i cant seem to find where it actually gets the value [21:35:26] (03CR) 10Dzahn: "confirmed no-op on terbium and mw2090" [puppet] - 10https://gerrit.wikimedia.org/r/278231 (owner: 10Dzahn) [21:36:11] (03PS2) 10Dzahn: delete links.email.donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/278353 (https://phabricator.wikimedia.org/T130414) [21:38:08] (03CR) 10Dzahn: mha: let lint ignore nested classes/defines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278233 (owner: 10Dzahn) [21:39:48] (03CR) 10Dzahn: mha: let lint ignore nested classes/defines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278233 (owner: 10Dzahn) [21:40:13] (03PS2) 10Dzahn: mha: let lint ignore nested classes/defines [puppet] - 10https://gerrit.wikimedia.org/r/278233 [21:40:22] (03CR) 10Dzahn: [C: 032] mha: let lint ignore nested classes/defines [puppet] - 10https://gerrit.wikimedia.org/r/278233 (owner: 10Dzahn) [21:41:25] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org from DNS - https://phabricator.wikimedia.org/T130414#2135502 (10CCogdill_WMF) Would it be possible to get a full list of DNS entries you want to remove? I’m seeing A, MX, CNAME and TXT re... [21:47:01] (03PS8) 1020after4: Add a deployment source & target class for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) [21:49:34] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org from DNS - https://phabricator.wikimedia.org/T130414#2135531 (10Dzahn) So far i just wanted to deleted links.email.donate.wikimedia.org (because it was listed on https://wikitech.wikimedia... [21:50:28] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2135544 (10Dzahn) [21:51:01] 7Puppet, 10Scap3 (Scap3-Adoption-Phase1): move scap3 keyholder configuration to hiera to avoid proliferation of more*::deployment::source classes - https://phabricator.wikimedia.org/T130419#2135545 (10mmodell) [21:54:09] (03CR) 10Dzahn: [C: 031] Adding a `$ensure` parameter to nginx::status_site [puppet/nginx] - 10https://gerrit.wikimedia.org/r/278276 (https://phabricator.wikimedia.org/T130365) (owner: 10Gehel) [21:57:01] (03Abandoned) 10Legoktm: Output PHP version before running phpunit tests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/271203 (owner: 10Legoktm) [21:59:58] (03PS1) 10Dzahn: puppet-lint: remove exception for nested classes [puppet] - 10https://gerrit.wikimedia.org/r/278399 (https://phabricator.wikimedia.org/T93645) [22:01:00] (03PS2) 10Dzahn: puppet-lint: remove exception for nested classes [puppet] - 10https://gerrit.wikimedia.org/r/278399 (https://phabricator.wikimedia.org/T93645) [22:01:30] (03PS3) 10Dzahn: puppet-lint: remove exception for nested classes [puppet] - 10https://gerrit.wikimedia.org/r/278399 (https://phabricator.wikimedia.org/T93645) [22:02:08] (03CR) 10Dereckson: [C: 031] Create a HiDPI logo for the Czech Wikipedia (cswiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278334 (https://phabricator.wikimedia.org/T130392) (owner: 10Odder) [22:02:10] (03CR) 10Dzahn: "the "operations-puppet-puppetlint-strict SUCCESS in 43s" on this is proof they are all gone :)" [puppet] - 10https://gerrit.wikimedia.org/r/278399 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:02:33] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2135591 (10CCogdill_WMF) Great, thanks @Dzahn. We're double checking this list with IBM to be extra ca... [22:03:26] (03CR) 10Dzahn: [C: 032] puppet-lint: remove exception for nested classes [puppet] - 10https://gerrit.wikimedia.org/r/278399 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:04:21] (03PS1) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [22:04:24] 6Operations, 7Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#2135593 (10Dzahn) removed all nested classes and defines https://gerrit.wikimedia.org/r/#/c/278225/ https://gerrit.wikimedia.org/r/#/c/278230/ https://gerrit.wikimedia.org... [22:04:59] (03PS2) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [22:06:41] (03CR) 10ArielGlenn: "Do we want this to run daily? Maybe that's overkill." [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [22:09:20] (03PS1) 10Eevans: enable instance 'b'; restbase1013-b [puppet] - 10https://gerrit.wikimedia.org/r/278402 (https://phabricator.wikimedia.org/T125842) [22:10:18] apergos: \o/ [22:10:23] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2135644 (10Dzahn) @CCogdill_WMF Sounds great. thank you. They should be all. this is the full list i f... [22:10:27] heh [22:11:01] apergos: daily probably is overkill, we had said weekly before on the task [22:11:07] ah right [22:11:11] let me adjust the patch [22:12:18] (03PS1) 10Dzahn: puppet-lint: remove exception for old 2.6 puppet [puppet] - 10https://gerrit.wikimedia.org/r/278404 [22:12:20] (03PS3) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [22:12:39] (03PS2) 10Dzahn: puppet-lint: remove exception for old 2.6 puppet [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) [22:13:01] (03PS3) 10Dzahn: puppet-lint: remove exception for old 2.6 puppet [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) [22:14:12] (03CR) 10Legoktm: "Yay! I suggested minor tweaks to the index description" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [22:14:14] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [22:14:17] (03CR) 10Dzahn: "sample wiki = meta ?" [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [22:14:46] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: remove exception for old 2.6 puppet [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:15:33] (03CR) 10Dzahn: [C: 04-1] "hah! apparently this is NOT just for puppet 2.6 -> " ./modules/mysql/manifests/init.pp:17 WARNING class inheriting from params class (clas" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:16:57] (03PS4) 10Dzahn: puppet-lint: remove exception for old 2.6 puppet [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) [22:17:31] (03PS5) 10Dzahn: puppet-lint: remove exception for "class_inherits_from_params_class" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) [22:19:05] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [22:19:50] (03CR) 10jenkins-bot: [V: 04-1] puppet-lint: remove exception for "class_inherits_from_params_class" [puppet] - 10https://gerrit.wikimedia.org/r/278404 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:21:37] 6Operations, 10DBA, 13Patch-For-Review: Investigate/decom db2001-db2007 - https://phabricator.wikimedia.org/T125827#2135677 (10Dzahn) Moritz said he is adding the updates because the servers are up. This might be a catch 22. [22:26:04] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:26:16] (03CR) 10Dzahn: "I'm not sure if that actually makes it harder. In the times of v4 IP space running out and RIPE asking for justification for new IP blocks" [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [22:27:21] (03CR) 10Dzahn: "but "i'm not sure" actually means "i'm not sure" here, not just the polite version of "no" :)" [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [22:27:32] (03CR) 10Paladox: "But if you use the name of domain everytime they change the ip it will always be blocked unless they change domain." [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [22:27:47] (03CR) 10Paladox: "Ok." [puppet] - 10https://gerrit.wikimedia.org/r/277904 (owner: 10Paladox) [22:35:15] PROBLEM - puppet last run on db2063 is CRITICAL: CRITICAL: puppet fail [22:43:12] 6Operations, 6Project-Admins, 3DevRel-March-2016: Operations-related subprojects/tags reorganization - https://phabricator.wikimedia.org/T119944#2135724 (10RobH) S >>! In T119944#2129274, @Aklapper wrote: >>>! In T119944#2056247, @Aklapper wrote: >>>! In T119944#2044189, @faidon wrote: >>> * For #DC-Ops, ta... [22:46:18] (03PS1) 10Dereckson: Disable upload on ia.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278411 (https://phabricator.wikimedia.org/T130425) [22:49:16] (03PS1) 10ArielGlenn: onallwikis: make dryrun option do something useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/278412 [22:50:40] (03CR) 10ArielGlenn: [C: 032 V: 032] onallwikis: make dryrun option do something useful [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/278412 (owner: 10ArielGlenn) [22:52:53] (03PS4) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [22:53:14] (03PS1) 10Halfak: Adds arabic and polish languages files to ores role. [puppet] - 10https://gerrit.wikimedia.org/r/278413 [22:53:47] (03PS5) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [22:53:53] o/ mutante. Got a minute to look at https://gerrit.wikimedia.org/r/#/c/278413/ ? [22:53:58] Should be an easy one :) [22:54:22] (03CR) 10ArielGlenn: "Meta is better, you're right." [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [22:55:00] (03CR) 10ArielGlenn: dump url shorteners for wiki projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [22:58:50] mutante: I think that's still an exception [22:59:25] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:59:33] mutante: https://github.com/EFForg/https-everywhere/blob/master/src/chrome/content/rules/Wikimedia.xml#L10 [23:00:24] (03CR) 10Ladsgroup: [C: 031] Adds arabic and polish languages files to ores role. [puppet] - 10https://gerrit.wikimedia.org/r/278413 (owner: 10Halfak) [23:00:45] 6Operations, 10Traffic, 10Wikimedia-Fundraising, 7HTTPS, 13Patch-For-Review: delete links.email.donate.wikimedia.org (and all other email.donate.*?) from DNS - https://phabricator.wikimedia.org/T130414#2135390 (10Reedy) https://github.com/EFForg/https-everywhere/blob/master/src/chrome/content/rules/Wiki... [23:03:06] RECOVERY - puppet last run on db2063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:13:26] RECOVERY - Misc HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:14:02] 7Blocked-on-Operations, 10Datasets-Archiving, 10Dumps-Generation, 10Flow, 3Collaboration-Team-Current: Publish recurring Flow dumps at http://dumps.wikimedia.org/ - https://phabricator.wikimedia.org/T119511#2135793 (10ArielGlenn) That looks better. I need to see if all the options necessary for dumps are... [23:17:01] (03CR) 10Tim Landscheidt: mha: let lint ignore nested classes/defines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/278233 (owner: 10Dzahn) [23:35:06] !log krinkle@tin Synchronized php-1.27.0-wmf.17/extensions/WikimediaEvents/modules/ext.wikimediaEvents.deprecate.js: (no message) (duration: 00m 35s) [23:35:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:14] PROBLEM - Host oxygen is DOWN: PING CRITICAL - Packet loss = 100% [23:42:42] (03CR) 10Dereckson: [C: 031] Update favicon for Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/278321 (https://phabricator.wikimedia.org/T70728) (owner: 10Odder) [23:48:41] (03PS6) 10Dereckson: Enable Translation extension on AffCom (chapcomwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio) [23:50:28] (03CR) 10Dereckson: "I removed this part of the commit, as tables should be created before deployment, and as a commit message is not a deploy howto. I recopy " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio) [23:50:41] (03CR) 10Dereckson: [C: 031] Enable Translation extension on AffCom (chapcomwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275289 (https://phabricator.wikimedia.org/T66122) (owner: 10MarcoAurelio) [23:58:03] (03CR) 10Dereckson: [C: 04-1] "Please mark this patch as abandoned, as it has been superseded by I8f807023a9288c263f8841b5e3c5428b814abe7e." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/273774 (owner: 10Kelson) [23:59:27] (03CR) 10Dereckson: [C: 031] Set $wgNamespacesWithSubpages to true for NS_TEMPLATE for ru.wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276737 (https://phabricator.wikimedia.org/T124615) (owner: 10Pmlineditor)