[00:02:35] (03PS15) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [00:06:53] (03CR) 10Chad: "Wheee all of it works minus the symlink swap bit. Will fix that next" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 (owner: 10Chad) [00:09:15] PROBLEM - HHVM jobrunner on mw1161 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [00:10:15] RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [00:12:37] !log Remove 2FA from users Alan and Magicknight94 (retroactive log) [00:12:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:02] (sorry) [00:14:38] (03PS10) 10GeoffreyT2000: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) [00:29:51] (03PS16) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [00:38:13] (03PS17) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [00:41:52] (03PS18) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [00:43:43] (03PS19) 10Chad: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 [00:44:01] (03CR) 10Chad: "Only took 19 iterations on this but hey it works and it's way nicer than PHP :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 (owner: 10Chad) [00:55:46] bd808: We should move the SAL to its own namespace on wikitech [00:55:56] A) makes it easier to exclude from searches (often a false positive) [00:56:16] B) Makes it easier to start archives by pagenames as opposed to having to move things every so often [00:56:29] (so [[SAL]] could just redirect to current one) [00:56:34] Or transclude it [00:58:43] (03CR) 10Chad: [C: 032] Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 (owner: 10Chad) [01:01:18] (03Merged) 10jenkins-bot: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 (owner: 10Chad) [01:01:34] (03CR) 10jenkins-bot: Moving update wikiversions to scap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/367828 (owner: 10Chad) [01:15:57] !log demon@tin Synchronized multiversion/: rm old and busted updateWikiversions (duration: 01m 07s) [01:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:17:09] !log demon@tin Synchronized scap/plugins/updatewikiversions.py: add new hotness updatewikiversions.py (duration: 00m 45s) [01:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:37:31] (03PS1) 10Chad: updatewikiversions.py: Retain the space between key and value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378828 [01:37:33] (03CR) 10Chad: [C: 032] updatewikiversions.py: Retain the space between key and value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378828 (owner: 10Chad) [01:40:10] (03Merged) 10jenkins-bot: updatewikiversions.py: Retain the space between key and value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378828 (owner: 10Chad) [01:40:20] (03CR) 10jenkins-bot: updatewikiversions.py: Retain the space between key and value [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378828 (owner: 10Chad) [01:41:42] !log demon@tin Synchronized scap/plugins/updatewikiversions.py: Minor spacing nitpick, that's it for today (duration: 00m 45s) [01:41:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:44:28] (03PS1) 10Chad: updatewikiversions.py: also end file with newline, which json.dump() does not [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378829 [01:44:30] (03CR) 10Chad: [C: 032] updatewikiversions.py: also end file with newline, which json.dump() does not [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378829 (owner: 10Chad) [01:46:55] (03Merged) 10jenkins-bot: updatewikiversions.py: also end file with newline, which json.dump() does not [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378829 (owner: 10Chad) [01:47:04] (03CR) 10jenkins-bot: updatewikiversions.py: also end file with newline, which json.dump() does not [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378829 (owner: 10Chad) [01:52:27] !log demon@tin Synchronized scap/plugins/updatewikiversions.py: Minor newline nitpick. I promise that's it (duration: 00m 46s) [01:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:21:50] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.18) (duration: 07m 04s) [02:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:56] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Sep 19 02:28:56 UTC 2017 (duration 7m 6s) [02:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:44:38] (03PS1) 10KartikMistry: Remove wgContentTranslationEnableSuggestions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378833 [02:53:22] (03Abandoned) 10Mattflaschen: Enable $wgStructuredChangeFiltersOnWatchlist on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378263 (https://phabricator.wikimedia.org/T164234) (owner: 10Catrope) [03:09:44] (03PS4) 10Chad: Remove $stdlogo entirely [mediawiki-config] - 10https://gerrit.wikimedia.org/r/359037 (owner: 10Reedy) [03:29:19] (03CR) 10Smalyshev: "@ArielGlenn True, but I wasn't sure how to do it. Should I make some include file or modify dump_functions.sh? Where should I place that i" [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [03:54:04] (03PS7) 10BryanDavis: Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [04:07:30] (03PS8) 10BryanDavis: Add a default Apache 2.0 license [puppet] - 10https://gerrit.wikimedia.org/r/183862 (https://phabricator.wikimedia.org/T67270) (owner: 10Rush) [04:18:16] 10Operations, 10WMF-Legal, 10Wikimedia-General-or-Unknown, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3616863 (10bd808) With the latest updates to https://gerrit.wikimedia.org/r/#/c/183862 : ``` $ grep -v wikimedia.org CONTRIBUTORS | wc -... [05:57:22] (03CR) 10ArielGlenn: "I'd make a separate include file for the categoriesrdf specific dump stuff, rather like the wikidata cron jobs have a file of common funct" [puppet] - 10https://gerrit.wikimedia.org/r/378355 (https://phabricator.wikimedia.org/T173774) (owner: 10Smalyshev) [07:02:01] 10Operations, 10HHVM: Migration of mw* servers to stretch - https://phabricator.wikimedia.org/T174431#3616910 (10MoritzMuehlenhoff) [07:10:45] PROBLEM - Check HHVM threads for leakage on mw1259 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:13:05] PROBLEM - Check HHVM threads for leakage on mw1169 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:14:36] PROBLEM - Check HHVM threads for leakage on mw1260 is CRITICAL: CRITICAL: HHVM has more than double threads running or queued than apache has busy workers [07:17:50] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3616923 (10jcrespo) The error on the description was on the lifecycle log. It gave the same description that you googled. [07:27:59] 10Operations, 10WMF-Legal, 10Wikimedia-General-or-Unknown, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#3616926 (10MoritzMuehlenhoff) >>! In T67270#3616863, @bd808 wrote: > A number of the 112 non-Foundation author attributions are actually... [07:34:23] !log kartik@tin Started deploy [cxserver/deploy@03c1eb1]: Update cxserver to 569b7b7 [07:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:27] !log kartik@tin Finished deploy [cxserver/deploy@03c1eb1]: Update cxserver to 569b7b7 (duration: 01m 04s) [07:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:52] !log kartik@tin (no justification provided) [07:36:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:28] Some unknown errors while deploying cxserver. [07:38:51] akosiaris: can you look at, https://pastebin.com/YvGgZNhW [07:50:16] !log installing bind9 regression update on trusty servers [07:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:55] (03PS3) 10Muehlenhoff: Remove salt grains used for trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378696 [07:52:54] OK. Looks like hit by https://phabricator.wikimedia.org/T176184 [07:56:17] 10Operations, 10Parsoid, 10Scap, 10Release-Engineering-Team (Backlog): Check 'depool' failed while deploying - https://phabricator.wikimedia.org/T176184#3616952 (10KartikMistry) Also: https://phabricator.wikimedia.org/P6022 - blocking cxserver deployment. [07:58:06] !log kartik@tin Started deploy [cxserver/deploy@03c1eb1]: Update cxserver to 569b7b7 [07:58:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:43] !log kartik@tin Finished deploy [cxserver/deploy@03c1eb1]: Update cxserver to 569b7b7 (duration: 00m 37s) [07:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:00] !log mobrovac@tin Started deploy [cxserver/deploy@03c1eb1]: (no justification provided) [08:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:13] !log mobrovac@tin Finished deploy [cxserver/deploy@03c1eb1]: (no justification provided) (duration: 00m 13s) [08:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] (03PS3) 10ArielGlenn: Move nfs and directory setup for dumpsdata hosts into dumps module [puppet] - 10https://gerrit.wikimedia.org/r/378701 (https://phabricator.wikimedia.org/T175606) [08:23:15] RECOVERY - Check HHVM threads for leakage on mw1169 is OK: OK [08:27:35] (03PS1) 10Giuseppe Lavagetto: scap::conftool: fix home directory [puppet] - 10https://gerrit.wikimedia.org/r/378847 [08:28:29] (03PS2) 10Giuseppe Lavagetto: scap::conftool: fix home directory [puppet] - 10https://gerrit.wikimedia.org/r/378847 (https://phabricator.wikimedia.org/T176184) [08:28:51] 10Operations, 10Parsoid, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Check 'depool' failed while deploying - https://phabricator.wikimedia.org/T176184#3616972 (10Joe) a:03Joe [08:29:52] 10Operations, 10Parsoid, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Backlog): Check 'depool' failed while deploying - https://phabricator.wikimedia.org/T176184#3616450 (10Joe) This was caused by https://gerrit.wikimedia.org/r/#/c/365891/, yet another case of a labs-specific fix breaking prod... [08:33:17] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7916/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/378847 (https://phabricator.wikimedia.org/T176184) (owner: 10Giuseppe Lavagetto) [08:38:14] (03CR) 10Gehel: [C: 032] Bump highlighter version to 5.3.2.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377983 (https://phabricator.wikimedia.org/T173231) (owner: 10DCausse) [08:38:17] (03CR) 10Gehel: [V: 032 C: 032] Bump highlighter version to 5.3.2.1 [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/377983 (https://phabricator.wikimedia.org/T173231) (owner: 10DCausse) [08:39:24] !log mobrovac@tin Started deploy [cxserver/deploy@03c1eb1]: (no justification provided) [08:39:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:16] !log mobrovac@tin Finished deploy [cxserver/deploy@03c1eb1]: (no justification provided) (duration: 00m 52s) [08:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:14] (03CR) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart (032 comments) [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff) [08:42:24] (03PS5) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 [08:46:39] (03PS1) 10DCausse: Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 [08:47:06] (03PS6) 10Muehlenhoff: Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 [08:47:08] (03PS4) 10ArielGlenn: Move nfs and directory setup for dumpsdata hosts into dumps module [puppet] - 10https://gerrit.wikimedia.org/r/378701 (https://phabricator.wikimedia.org/T175606) [08:47:51] (03CR) 10Volans: [C: 031] "LGTM" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff) [08:48:14] (03CR) 10jerkins-bot: [V: 04-1] Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [08:48:27] (03CR) 10ArielGlenn: [C: 032] Move nfs and directory setup for dumpsdata hosts into dumps module [puppet] - 10https://gerrit.wikimedia.org/r/378701 (https://phabricator.wikimedia.org/T175606) (owner: 10ArielGlenn) [08:49:52] (03PS2) 10DCausse: Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 [08:54:21] (03CR) 10Muehlenhoff: [C: 032] Fix parsing of necessary restarts in query_restart [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/376688 (owner: 10Muehlenhoff) [09:05:45] (03PS1) 10Muehlenhoff: Fix regression [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/378852 [09:06:50] 10Operations, 10Parsoid, 10Scap, 10Release-Engineering-Team (Backlog), 10Services (watching): Check 'depool' failed while deploying - https://phabricator.wikimedia.org/T176184#3617004 (10mobrovac) 05Open>03Resolved Confirmed to have fixed deployments on SCB, resolving. Thank you @Joe for the quick fix! [09:07:39] (03CR) 10Gehel: "LGTM, we just need to wait for elasticsearch codfw to be ready." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [09:19:15] (03PS1) 10Elukey: contint::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378853 [09:24:45] (03PS2) 10Muehlenhoff: Fix regression [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/378852 [09:25:37] (03PS4) 10Mobrovac: [Logging config] Enable logging for updateBetaFeaturesUserCounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378703 (owner: 10Ppchelko) [09:25:40] (03PS1) 10Elukey: smokeping::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378854 [09:25:41] (03PS1) 10Elukey: tendril::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378855 [09:26:09] (03CR) 10Volans: [C: 031] "LGTM! This was on me" [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/378852 (owner: 10Muehlenhoff) [09:29:06] (03CR) 10Mobrovac: [C: 032] [Logging config] Enable logging for updateBetaFeaturesUserCounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378703 (owner: 10Ppchelko) [09:30:33] (03CR) 10Muehlenhoff: [C: 032] Fix regression [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/378852 (owner: 10Muehlenhoff) [09:30:58] !log cp1062 - varnish-backend-restart, mbox lag [09:31:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:30] (03Merged) 10jenkins-bot: [Logging config] Enable logging for updateBetaFeaturesUserCounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378703 (owner: 10Ppchelko) [09:31:43] (03CR) 10jenkins-bot: [Logging config] Enable logging for updateBetaFeaturesUserCounts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378703 (owner: 10Ppchelko) [09:34:26] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Enable the updateBetaFeaturesUserCounts logging channel - T175637 (duration: 00m 50s) [09:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:34:40] T175637: End of September milestone: Migrate first production use case - https://phabricator.wikimedia.org/T175637 [09:39:55] (03PS1) 10Ppchelko: Revert "[Logging config] Enable logging for updateBetaFeaturesUserCounts" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378856 [09:40:49] !log powercycle analytics1062 - no ssh, console com2 frozen [09:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:35] RECOVERY - Check Varnish expiry mailbox lag on cp1062 is OK: OK: expiry mailbox lag is 0 [09:42:56] !log installing libxml2 security updates on trusty (Debian already fixed) [09:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:25] RECOVERY - Host analytics1062 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [09:45:20] (03PS2) 10Ema: VCL: remove wikiScrape rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/378732 [09:45:27] (03CR) 10Ema: [V: 032 C: 032] VCL: remove wikiScrape rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/378732 (owner: 10Ema) [09:46:57] (03PS1) 10Elukey: noc::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378857 [09:49:35] (03PS1) 10Elukey: librenms::apache: disallow .htaccess usage [puppet] - 10https://gerrit.wikimedia.org/r/378858 [09:50:52] (03PS1) 10Jcrespo: mariadb: Setup db1101 on s2 to replace db1018 and db1036 [puppet] - 10https://gerrit.wikimedia.org/r/378859 (https://phabricator.wikimedia.org/T172679) [09:51:57] 10Operations, 10HHVM: HHVM: Unknown exception - https://phabricator.wikimedia.org/T173705#3617147 (10MoritzMuehlenhoff) 05Open>03Resolved That's fixed with 2.0.14. [09:52:52] (03PS3) 10Elukey: site.pp: assign roles to mw1307-28 [puppet] - 10https://gerrit.wikimedia.org/r/377774 (https://phabricator.wikimedia.org/T165519) [09:53:22] (03PS2) 10Jcrespo: mariadb: Setup db1101 on s2 to replace db1018 and db1036 [puppet] - 10https://gerrit.wikimedia.org/r/378859 (https://phabricator.wikimedia.org/T172679) [09:53:39] (03CR) 10Elukey: [C: 032] site.pp: assign roles to mw1307-28 [puppet] - 10https://gerrit.wikimedia.org/r/377774 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [09:53:49] new appservers coming [09:54:58] (03PS3) 10Jcrespo: mariadb: Setup db1101 on s2 to replace db1018 and db1036 [puppet] - 10https://gerrit.wikimedia.org/r/378859 (https://phabricator.wikimedia.org/T172679) [09:56:06] \o/ [09:56:54] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: add stretch base image [puppet] - 10https://gerrit.wikimedia.org/r/378860 [10:04:11] (03PS1) 10Muehlenhoff: Update debian/changelog [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/378862 [10:04:46] (03CR) 10Jcrespo: [C: 032] mariadb: Setup db1101 on s2 to replace db1018 and db1036 [puppet] - 10https://gerrit.wikimedia.org/r/378859 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [10:07:48] (03CR) 10Muehlenhoff: [C: 032] Update debian/changelog [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/378862 (owner: 10Muehlenhoff) [10:10:21] (03CR) 10Muehlenhoff: [C: 04-1] "See comment for a typo, otherwise LGTM." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378860 (owner: 10Giuseppe Lavagetto) [10:11:31] (03PS1) 10Elukey: hieradata::regex: temporary stop notifications for new mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/378864 (https://phabricator.wikimedia.org/T165519) [10:12:16] (03CR) 10Giuseppe Lavagetto: docker::baseimages: add stretch base image (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378860 (owner: 10Giuseppe Lavagetto) [10:14:02] !log shutting down mysql on db1018 [10:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:01] (03PS2) 10Giuseppe Lavagetto: docker::baseimages: add stretch base image [puppet] - 10https://gerrit.wikimedia.org/r/378860 [10:21:17] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/378854 (owner: 10Elukey) [10:24:12] (03CR) 10Muehlenhoff: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/378860 (owner: 10Giuseppe Lavagetto) [10:24:49] (03PS3) 10Giuseppe Lavagetto: docker::baseimages: add stretch base image [puppet] - 10https://gerrit.wikimedia.org/r/378860 [10:29:08] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7918/copper.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/378860 (owner: 10Giuseppe Lavagetto) [10:29:59] (03CR) 10Elukey: [C: 032] hieradata::regex: temporary stop notifications for new mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/378864 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [10:30:05] (03PS2) 10Elukey: hieradata::regex: temporary stop notifications for new mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/378864 (https://phabricator.wikimedia.org/T165519) [10:32:14] _joe_ merge when you are ready (or I can do it) [10:33:16] <_joe_> done [10:36:10] PROBLEM - puppet last run on copper is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:37:03] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: fully qualify exec command [puppet] - 10https://gerrit.wikimedia.org/r/378866 [10:38:45] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: fully qualify exec command [puppet] - 10https://gerrit.wikimedia.org/r/378866 (owner: 10Giuseppe Lavagetto) [10:42:20] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/378868 [10:44:24] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/378868 (owner: 10Giuseppe Lavagetto) [10:46:20] RECOVERY - puppet last run on copper is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [10:49:30] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:49:31] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:49:31] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:49:31] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:50:11] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:50:20] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:50:20] PROBLEM - salt-minion processes on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:50:20] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:51:50] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [10:52:46] nrpe dead, restarting it [10:53:20] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [10:53:21] RECOVERY - salt-minion processes on stat1005 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [10:53:21] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [10:53:34] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 19 minutes ago with 0 failures [10:53:34] !log restarted nagios-nrpe-server on stat1005 [10:53:40] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [10:53:40] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:53:40] RECOVERY - DPKG on stat1005 is OK: All packages OK [10:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:11] RECOVERY - Disk space on stat1005 is OK: DISK OK [11:00:10] RECOVERY - Check HHVM threads for leakage on mw1260 is OK: OK [11:11:10] RECOVERY - Check HHVM threads for leakage on mw1259 is OK: OK [11:11:23] 10Operations, 10Office-IT: Create affcom-staff email account - https://phabricator.wikimedia.org/T176153#3617255 (10Peachey88) Yes, that would be best done by Office-IT as a shared mailbox, @egalvezwmf you will need to submit this as a request in zendesk. [11:20:29] !log kartik@tin Started deploy [cxserver/deploy@03c1eb1]: Update cxserver to 569b7b7 [11:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:50] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Tue 2017-09-19 11:21:45 UTC. [11:23:11] !log kartik@tin Finished deploy [cxserver/deploy@03c1eb1]: Update cxserver to 569b7b7 (duration: 02m 42s) [11:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:43] 10Operations, 10Parsoid, 10Scap, 10Release-Engineering-Team (Backlog), 10Services (watching): Check 'depool' failed while deploying - https://phabricator.wikimedia.org/T176184#3617282 (10KartikMistry) Thanks @Joe and @mobrovac [11:25:48] !log uploaded debdeploy 0.99.1 to apt.wikimedia.org (for trusty, jessie and stretch) [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:56] 10Operations, 10Patch-For-Review: allow rsyncing between build host and install hosts - https://phabricator.wikimedia.org/T176178#3616206 (10akosiaris) This is already implemented, albeit in the other direction. See https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/role/manif... [11:31:29] (03CR) 10Alexandros Kosiaris: [C: 04-2] "See comments in https://phabricator.wikimedia.org/T176178#3617296." [puppet] - 10https://gerrit.wikimedia.org/r/378810 (https://phabricator.wikimedia.org/T176178) (owner: 10Dzahn) [11:32:09] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: drop exec for apt-key add [puppet] - 10https://gerrit.wikimedia.org/r/378872 [11:32:34] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: drop exec for apt-key add [puppet] - 10https://gerrit.wikimedia.org/r/378872 (owner: 10Giuseppe Lavagetto) [11:34:29] <_joe_> pfff [11:34:39] <_joe_> what am I doing today [11:34:40] (03PS1) 10Giuseppe Lavagetto: docker::baseimages: another dependency fix [puppet] - 10https://gerrit.wikimedia.org/r/378873 [11:35:04] (03CR) 10Giuseppe Lavagetto: [C: 032] docker::baseimages: another dependency fix [puppet] - 10https://gerrit.wikimedia.org/r/378873 (owner: 10Giuseppe Lavagetto) [11:35:45] (03PS1) 10Muehlenhoff: Extend library hint for libxml2 [puppet] - 10https://gerrit.wikimedia.org/r/378874 [11:36:18] (03PS2) 10Muehlenhoff: Extend library hint for libxml2 [puppet] - 10https://gerrit.wikimedia.org/r/378874 [11:38:06] (03PS1) 10Elukey: profile::kafka::broker: add the monitoring_enabled option [puppet] - 10https://gerrit.wikimedia.org/r/378876 (https://phabricator.wikimedia.org/T167992) [11:41:07] (03CR) 10Muehlenhoff: [C: 032] Extend library hint for libxml2 [puppet] - 10https://gerrit.wikimedia.org/r/378874 (owner: 10Muehlenhoff) [11:43:59] (03PS2) 10Elukey: profile::kafka::broker: add the monitoring_enabled option [puppet] - 10https://gerrit.wikimedia.org/r/378876 (https://phabricator.wikimedia.org/T167992) [11:45:23] (03PS1) 10Muehlenhoff: Extend Cumin alias to also apply to the canary [puppet] - 10https://gerrit.wikimedia.org/r/378878 [11:45:40] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/7920/" [puppet] - 10https://gerrit.wikimedia.org/r/378876 (https://phabricator.wikimedia.org/T167992) (owner: 10Elukey) [11:46:42] (03PS2) 10Muehlenhoff: Extend Cumin alias to also apply to the canary [puppet] - 10https://gerrit.wikimedia.org/r/378878 [11:48:49] the last change should remove the alerts from kafka-jumbo [11:49:05] (03CR) 10Muehlenhoff: [C: 032] Extend Cumin alias to also apply to the canary [puppet] - 10https://gerrit.wikimedia.org/r/378878 (owner: 10Muehlenhoff) [11:57:46] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3617323 (10Jan_Dittrich) > You can use eventlogging and wikimediaevents code at this time , there are quite > a bit of examples of how to run ab tests on discovery's code. My concern is mainly with... [11:57:50] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3617324 (10Jan_Dittrich) > You can use eventlogging and wikimediaevents code at this time , there are quite > a bit of examples of how to run ab tests on discovery's code. My concern is mainly with... [12:11:15] 10Operations, 10Epic, 10Goal, 10Services (doing), and 2 others: Services Q1 2017/18 goal: Begin migrating job queue processing to multi-DC enabled eventbus infrastructure. - https://phabricator.wikimedia.org/T169937#3617336 (10Pchelolo) [12:12:39] (03PS1) 10Elukey: profile::kafka::broker: no-op refactoring to test ordering [puppet] - 10https://gerrit.wikimedia.org/r/378884 [12:13:23] (03CR) 10Elukey: [C: 032] profile::kafka::broker: no-op refactoring to test ordering [puppet] - 10https://gerrit.wikimedia.org/r/378884 (owner: 10Elukey) [12:15:55] (03PS1) 10Filippo Giunchedi: base: fix remote_syslog hiera name [puppet] - 10https://gerrit.wikimedia.org/r/378885 [12:16:07] _joe_: ^ [12:40:25] (03PS11) 10Zfilipin: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) (owner: 10GeoffreyT2000) [12:49:33] (03CR) 10Giuseppe Lavagetto: [C: 031] base: fix remote_syslog hiera name [puppet] - 10https://gerrit.wikimedia.org/r/378885 (owner: 10Filippo Giunchedi) [12:55:06] !log rebooting kubestage1001 [12:55:18] (03CR) 10Zfilipin: [C: 031] Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) (owner: 10GeoffreyT2000) [12:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:00] 10Operations, 10CirrusSearch, 10Discovery, 10Discovery-Search, and 6 others: Job queue is increasing non-stop - https://phabricator.wikimedia.org/T173710#3617363 (10Joe) FWIW we're seeing another almost-incontrollable growth of jobs on commons and probably other wikis. I might decide to raise the concurren... [12:56:25] jouncebot: next [12:56:26] In 0 hour(s) and 3 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T1300) [12:56:48] hashar: there is just one patch for swat, I can take care of it [12:57:40] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3617370 (10Jgreen) >>! In T176175#3616476, @faidon wrote: > Ah! active/backup is safer indeed, but in my experience, I've seen many more iss... [12:59:12] (03PS1) 10Elukey: install_server: add partman recipe for new mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/378902 (https://phabricator.wikimedia.org/T165519) [12:59:19] zeljkof: roger :) [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: (Dis)respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T1300). Please do the needful. [13:00:05] geoffreytrang: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [13:00:13] I can SWAT today! [13:00:49] (03CR) 10Elukey: [C: 032] install_server: add partman recipe for new mw hosts [puppet] - 10https://gerrit.wikimedia.org/r/378902 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [13:00:52] hashar: hm, geoffreytrang is not around, should I wait? [13:01:32] my partman recipe will not work the first time [13:01:39] so I'll likely hack on install1002 [13:01:54] zeljkof: I guess you can just do it ? [13:02:01] I usually run namespaceDupes.php before deployment [13:02:24] then pull the patch on tin, and on terbium.eqiad.wmnet run scap pull then run namespaceDupes.php there [13:03:16] hashar: ok, I'll deploy? [13:03:23] yes [13:03:27] (without the question mark) [13:03:57] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) (owner: 10GeoffreyT2000) [13:04:35] !log upgrading elasticsearch plugins on elastic2001 - T173231 [13:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:47] T173231: Wikidata Elastic search drops results with matches in different language label - https://phabricator.wikimedia.org/T173231 [13:05:41] (03Merged) 10jenkins-bot: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) (owner: 10GeoffreyT2000) [13:06:12] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3617381 (10Jgreen) >>! In T152562#3616435, @Ejegg wrote: > For CiviCRM code: > > PHP client library: https://github.com/Jimdo/prometheus_client_php > Metric types: http... [13:06:22] (03CR) 10jenkins-bot: Rename Wikisaurus namespace on Wiktionary to "Thesaurus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374063 (https://phabricator.wikimedia.org/T174264) (owner: 10GeoffreyT2000) [13:06:40] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/ac01f1587c972629d90f01e238c9a56d1b2f4b0ee3249a381ca0c6f23568234a/shm is not accessible: Permission denied [13:08:54] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3617382 (10elukey) @Cmjohnson I am trying to PXE boot on mw1319 but I am seeing the usual blank screen when switch ports are disabled. Whenever you have a minut... [13:09:41] RECOVERY - Disk space on copper is OK: DISK OK [13:12:00] (03CR) 10Filippo Giunchedi: [C: 032] base: fix remote_syslog hiera name [puppet] - 10https://gerrit.wikimedia.org/r/378885 (owner: 10Filippo Giunchedi) [13:12:05] (03PS2) 10Filippo Giunchedi: base: fix remote_syslog hiera name [puppet] - 10https://gerrit.wikimedia.org/r/378885 [13:12:34] zeljkof: sorry my computer crashes from time to time [13:12:48] hashar_: :D [13:13:47] hashar_: uh oh, sync-apaches: 95% (ok: 281; fail: 9; left: 13) [13:13:50] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/072095d1b4341a1b1fbbb69de66d7c2dac68acad41f7627932ca32ea2a2df91a/shm is not accessible: Permission denied [13:13:50] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [13:14:10] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:14:20] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [13:14:26] seems cp1055 [13:14:28] Cc ema [13:14:49] zeljkof: is that locked ? [13:14:56] there might be some network trouble going on [13:14:57] Request from [redacted] via cp3043 cp3043, Varnish XID 69534275 [13:14:58] Error: 503, Backend fetch failed at Tue, 19 Sep 2017 13:13:24 GMT [13:15:03] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:374063|Rename Wikisaurus namespace on Wiktionary to "Thesaurus" (T174264)]] (duration: 02m 56s) [13:15:04] no progress since 95% [13:15:09] oh, finished [13:15:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:17] T174264: Move namespace in the English Wiktionary: Wikisaurus → Thesaurus - https://phabricator.wikimedia.org/T174264 [13:15:18] sync-apaches: 100% (ok: 281; fail: 22; left: 0) [13:15:23] hashar_: ^ [13:15:24] ^^ I'm getting a lot of those lately [13:15:40] hashar_: I can paste the output to phab paste [13:15:50] RECOVERY - Disk space on copper is OK: DISK OK [13:15:54] it's " Host key verification failed." [13:16:09] No route to host [13:16:10] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [13:16:17] Connection timed out [13:16:28] those three errors [13:16:31] elukey: might be the new ones? ^^^ [13:16:36] tobbycat: sadly it is a known issue :( https://phabricator.wikimedia.org/T175803 [13:16:52] volans: seems cp1055, mailbox spike and recovery [13:17:03] it should be already gone [13:17:04] !log upgrading elasticsearch plugins on elasticsearch codfw, including cold restart of the cluster - T173231 [13:17:09] elukey: I was referring to the fail: 22 for scap ;) [13:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:21] T173231: Wikidata Elastic search drops results with matches in different language label - https://phabricator.wikimedia.org/T173231 [13:17:30] volans: loals sorry, checking [13:17:40] elukey: okay, glad you're on it already [13:17:46] zeljkof: it is probably me adding the new hosts, sorry for that [13:17:55] zeljkof: can you give us a couple of hostnames if they are in the output [13:18:13] volans, elukey : sure, there is 22 [13:18:19] pretty sure those are pooled = no and not inactive [13:18:22] since I am stupid [13:18:26] mw1313.eqiad.wmnet returned [255]: Host key verification failed. [13:18:33] yeah [13:18:37] mw1311.eqiad.wmnet port 22: No route to host [13:18:38] it's me [13:18:40] fixing it now [13:18:46] mw1318.eqiad.wmnet port 22: Connection timed out [13:18:56] elukey: thanks, let me know when you are done [13:19:08] so I can repeat the scap sync-file [13:19:39] !log upgrading elasticsearch plugins on relforge, including cold restart of the cluster - T173231 [13:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:06] zeljkof: technically it's already ok because those are new hosts not yet installed, but better to err on the side of caution and repeat it anyway once fixed ;) [13:20:18] elukey: yup, that was cp1055 [13:20:44] volans: yes, I don't think anything will break if I repeat the deploy, and I will feel better getting all green :) [13:22:05] +1 [13:23:28] zeljkof: done [13:23:40] elukey: great, scap-ing again [13:24:20] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:24:41] * volans would have waited a couple of minutes to let confd update the dsh file... hopefully it was quick enough ;) [13:25:18] elukey, volans: still some failures (4 so far) [13:25:26] mw1313.eqiad.wmnet returned [255]: Host key verification failed. [13:25:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:25:33] zeljkof: ^^^ [13:25:50] volans: retry? :D [13:25:57] let's wait a couple of mins [13:26:07] 1313 is inactive [13:26:09] just checked [13:26:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:26:32] it's still deploying, probably waiting for 13 remaining [13:26:38] sync-apaches: 95% (ok: 281; fail: 4; left: 13) [13:26:40] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:374063|Rename Wikisaurus namespace on Wiktionary to "Thesaurus" (T174264)]] (duration: 02m 55s) [13:26:49] sync-apaches: 100% (ok: 281; fail: 17; left: 0) [13:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:55] T174264: Move namespace in the English Wiktionary: Wikisaurus → Thesaurus - https://phabricator.wikimedia.org/T174264 [13:27:19] volans, elukey: 17 failures, down from 22 :) [13:28:01] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:29:32] volans, elukey: should I wait some more, or deploy again? [13:30:00] PROBLEM - Disk space on copper is CRITICAL: DISK CRITICAL - /var/lib/docker/containers/296af4efe8f6f57432c8905b9d09a558eca6ba4214f40b1183f1d7a794976745/shm is not accessible: Permission denied [13:30:01] do you need list of errors? [13:30:08] zeljkof: checking the dsh now on tin [13:30:11] PROBLEM - PyBal backends health check on lvs2006 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2023.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! [13:30:20] PROBLEM - PyBal backends health check on lvs2003 is CRITICAL: PYBAL CRITICAL - search-https_9243 - Could not depool server elastic2005.codfw.wmnet because of too many down!: search_9200 - Could not depool server elastic2001.codfw.wmnet because of too many down! [13:30:35] gehel: ---^ [13:30:39] ^ pybal is me, forgot that check again... [13:30:50] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3617443 (10Jgreen) a:05Jgreen>03Papaul [13:30:50] ahh okok [13:30:53] elukey: thanks [13:31:00] zeljkof: should be ok TM [13:31:21] elukey: ok, should I try to deploy again? [13:31:29] yes please [13:31:36] deploying [13:32:25] !log zfilipin@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:374063|Rename Wikisaurus namespace on Wiktionary to "Thesaurus" (T174264)]] (duration: 00m 45s) [13:32:33] elukey: all green! :D [13:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:41] T174264: Move namespace in the English Wiktionary: Wikisaurus → Thesaurus - https://phabricator.wikimedia.org/T174264 [13:32:47] zeljkof: super! Thanks for the patience and sorry for the trouble [13:32:55] elukey: no problem, thank you for the help [13:33:20] RECOVERY - PyBal backends health check on lvs2006 is OK: PYBAL OK - All pools are healthy [13:33:21] RECOVERY - PyBal backends health check on lvs2003 is OK: PYBAL OK - All pools are healthy [13:35:21] !log EU SWAT finished [13:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:34] (03PS4) 10Muehlenhoff: Remove salt grains used for trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378696 [13:37:43] (03CR) 10Muehlenhoff: [C: 032] Remove salt grains used for trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378696 (owner: 10Muehlenhoff) [13:46:51] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [13:47:30] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0 [13:50:00] (03Abandoned) 10Ottomata: Use EventBus for recentchanges stream instead of RCStream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [13:50:30] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 37, down: 0, dormant: 0, excluded: 0, unused: 0 [13:52:00] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 86.95 ms [13:52:05] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10User-Elukey, 10User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3617503 (10Cmjohnson) @elukey done ge-6/0/9 up up mw1319 ge-6/0/10 up up mw1320 ge-6/0/11 up up mw1321 ge-6/0/12 up... [13:55:53] cmjohnson1: o/ - still not working [13:56:10] (pxeboot on mw1319) [13:56:36] (03PS1) 10Ottomata: Remove rcstream puppetization [puppet] - 10https://gerrit.wikimedia.org/r/378913 (https://phabricator.wikimedia.org/T172356) [13:56:49] (03PS12) 10Rush: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [13:57:48] (03CR) 10Ottomata: [C: 032] Remove rcstream puppetization [puppet] - 10https://gerrit.wikimedia.org/r/378913 (https://phabricator.wikimedia.org/T172356) (owner: 10Ottomata) [14:02:08] (03CR) 10Alexandros Kosiaris: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/378913 (https://phabricator.wikimedia.org/T172356) (owner: 10Ottomata) [14:02:22] @elukey they're all installing now [14:04:45] (03PS1) 10Jcrespo: mariadb: Pool db1101 as new s2 host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378914 (https://phabricator.wikimedia.org/T172679) [14:09:13] (03PS1) 10Muehlenhoff: Remove trebuchet-trigger from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/378915 [14:09:17] (03PS1) 10Jcrespo: dbtools: Add db1101 to s2 [software] - 10https://gerrit.wikimedia.org/r/378916 (https://phabricator.wikimedia.org/T172679) [14:11:30] cmjohnson1: ouch I wanted to test a recipe first [14:11:35] will check thanks ! [14:11:36] (03PS1) 10Muehlenhoff: Drop Hiera setting for trebuchet server in deployment-prep [puppet] - 10https://gerrit.wikimedia.org/r/378917 [14:11:54] elukey as soon as the ports were enabled they will hit the installer [14:12:09] ahhhh [14:12:29] I will just power them down from now on [14:12:32] (03PS2) 10Jcrespo: mariadb: Pool db1101 as new s2 host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378914 (https://phabricator.wikimedia.org/T172679) [14:12:55] (03CR) 10Alexandros Kosiaris: "the package will be manually purged I guess ?" [puppet] - 10https://gerrit.wikimedia.org/r/378915 (owner: 10Muehlenhoff) [14:13:49] cmjohnson1: testing mw1319, that's the only one that I'd need for now [14:14:18] (03CR) 10Muehlenhoff: "Yeah, I'll prune these manually. Wasn't really worth doing the ensure=>absent dance for such a tiny host group." [puppet] - 10https://gerrit.wikimedia.org/r/378915 (owner: 10Muehlenhoff) [14:15:00] 10Operations, 10DBA: decommission db1018 - https://phabricator.wikimedia.org/T176215#3617573 (10jcrespo) [14:15:25] elukey the others are off now [14:15:27] (03CR) 10Alexandros Kosiaris: [C: 031] Remove trebuchet-trigger from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/378915 (owner: 10Muehlenhoff) [14:16:38] cmjohnson1: I had some issues PXE booting with mw1319, checking mw1320 now [14:20:01] https://phabricator.wikimedia.org/diffusion/ESPB/browse/master/maintenance/cleanup.php is outdated, right? [14:21:00] @elukey it's installing for me...you may need to give it a few mins [14:21:22] im out if you want to console in [14:21:34] super thanks [14:21:47] the partman recipe doesn't work so I'd need to work on it :P [14:22:00] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3617612 (10Cmjohnson) @jcrespo pulled power and reset....power on at your convenience. [14:25:01] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:02] (03PS1) 10RobH: changing dchan to divec [puppet] - 10https://gerrit.wikimedia.org/r/378919 (https://phabricator.wikimedia.org/T176142) [14:27:40] (03CR) 10RobH: [C: 032] changing dchan to divec [puppet] - 10https://gerrit.wikimedia.org/r/378919 (https://phabricator.wikimedia.org/T176142) (owner: 10RobH) [14:35:21] (03PS10) 10BBlack: VCL: stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 (https://phabricator.wikimedia.org/T145661) [14:36:27] (03PS1) 10Ema: 1.14.0: prometheus metrics, BGP MED, bugfixes [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/378920 (https://phabricator.wikimedia.org/T165764) [14:38:20] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617676 (10Cmjohnson) I checked the BIOS Settings, everything is enabled, the standard boot order is correct 1, CDROM 2. FLOPPY 3. USB 4. HARD DRIVE 5. PCI SLOT 1 ETHERNET 10Gb I... [14:39:00] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:39:03] !log disabling puppet on all misc active mysql hosts [14:39:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:31] 10Operations, 10MediaWiki-Maintenance-scripts, 10MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)): wikitech-static sync failing - https://phabricator.wikimedia.org/T176090#3617696 (10Andrew) @Reedy definitely no need to cherry-pick if this is getting pushed out today :) [14:40:13] (03CR) 10BBlack: [C: 032] VCL: stabilize backend storage patterns [puppet] - 10https://gerrit.wikimedia.org/r/376751 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [14:40:27] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3617700 (10jcrespo) Will do! Thanks. Please give me a heads up if any maintenance happens here, unless you tell me otherwise, I will put it back into production. We can put it down at any time later, but I do not wan... [14:41:20] (03CR) 10Ema: [C: 032] 1.14.0: prometheus metrics, BGP MED, bugfixes [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/378920 (https://phabricator.wikimedia.org/T165764) (owner: 10Ema) [14:42:23] (03PS2) 10Muehlenhoff: Remove trebuchet-trigger from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/378915 [14:43:21] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/varnish/tests/text/15-x-next-is-cache-miss2pass.vtc] [14:43:32] mmh [14:44:41] (03CR) 10Muehlenhoff: [C: 032] Remove trebuchet-trigger from deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/378915 (owner: 10Muehlenhoff) [14:44:43] (03PS7) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [14:44:44] (03PS1) 10Filippo Giunchedi: base: send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [14:45:04] ema: probably just puppetmaster race condition on new file creation [14:45:21] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:45:22] bblack: looks like, after a manual puppet run it went fine [14:45:33] ok [14:47:22] (03CR) 10Jcrespo: [C: 032] Add new m1 host db2078, enable firewall on all misc services [puppet] - 10https://gerrit.wikimedia.org/r/377460 (https://phabricator.wikimedia.org/T175685) (owner: 10Jcrespo) [14:47:30] (03PS8) 10Jcrespo: Add new m1 host db2078, enable firewall on all misc services [puppet] - 10https://gerrit.wikimedia.org/r/377460 (https://phabricator.wikimedia.org/T175685) [14:47:58] ^andrewbogott, akosiaris, this is going live [14:48:02] ok! [14:48:12] ok [14:50:11] (03PS8) 10Filippo Giunchedi: rsyslog: add support to receive syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) [14:50:13] (03PS2) 10Filippo Giunchedi: base: send syslog over TLS [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) [14:50:22] as expected, m4 unaffected [14:51:05] m1 too [14:51:24] I will do all until the small glitch [14:52:31] jynus: lmk when m5 is done? [14:52:51] yes, I will log it [14:53:49] (03PS1) 10Muehlenhoff: Remove deployment::redis [puppet] - 10https://gerrit.wikimedia.org/r/378925 [14:53:52] (03PS1) 10Eevans: Upgrade Cassanra build to 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) [14:54:11] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler02/7922/" [puppet] - 10https://gerrit.wikimedia.org/r/369950 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [14:54:37] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler03/7923/" [puppet] - 10https://gerrit.wikimedia.org/r/378922 (https://phabricator.wikimedia.org/T136312) (owner: 10Filippo Giunchedi) [14:55:00] (03CR) 10Eevans: [C: 031] "The dev environment has already been upgraded to -wmf5; This changeset allows Puppet to be re-enabled in the restbase dev." [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [14:55:23] about to do m5, andrewbogott [14:55:31] ok, thanks [14:55:47] !log deploying firewall to m5 hosts [14:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:30] (03CR) 10Herron: [C: 032] Lists: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/378697 (https://phabricator.wikimedia.org/T175878) (owner: 10Herron) [14:56:40] andrewbogott: it should be done now [14:57:16] jynus: ok, I'll run some tests... [14:58:05] it definitely switched off the network [14:59:38] https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=1505832973522&to=1505833127213&var-dc=eqiad%20prometheus%2Fops&var-server=db1009&var-port=9104 [14:59:52] will wait to do m2 [15:00:03] jynus: is m1 done ? [15:00:04] RoanKattouw and matt_flaschen: Dear deployers, time to do the RCFilters deployment deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T1500). [15:00:04] No patches in the queue for this window. Wheeee! [15:00:18] m1, yes, I commented above unaffected [15:00:21] jynus: yeah, seems like labcontrol1001 can't reach the db anymore [15:00:43] which ip has labcontrol? [15:00:59] can you restart the service? [15:01:03] I did [15:01:07] (03PS2) 10Herron: Lists: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/378697 (https://phabricator.wikimedia.org/T175878) [15:01:21] 208.80.154.92 [15:01:27] jynus: ok, confirmed, noted on the etherpad [15:01:31] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [15:01:50] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [15:01:50] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [15:01:51] !log restart varnish-be on cp1055, 503 spike [15:02:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:46] andrewbogott: try now, I opened a whole manually [15:02:52] *hole [15:03:27] jynus: seems better [15:03:51] is ther anything else that should be able to connect to that host? [15:04:04] I'm looking at instances failing atm and verifying they see recovery as well now [15:04:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [15:04:21] that's all trickle down issues tho [15:05:00] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [15:06:00] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:06:10] jynus: labservices* also use m5 [15:06:13] andrewbogott: puppet is still broken everywhere afaict [15:06:13] tell me what you see failing for sure [15:06:24] andrewbogott: give me ips [15:06:34] labservices1001 ? [15:06:43] jynus: yes [15:06:44] 208.80.155.117 and 208.80.154.12 [15:06:44] Hey [15:07:21] those 2 done [15:07:22] andrewbogott: does nodepool use m5? [15:07:31] I don't know [15:07:46] not as far as I know [15:08:05] 208.80.154.147 too, maybe? [15:08:10] striker? [15:08:26] I believe striker does use m5 jynus yes [15:08:44] <_joe_> bblack: ema we just had a 503 peak from text [15:08:48] <_joe_> afaict [15:09:03] yes, see above :) [15:09:08] or you mean new since that [15:09:44] no, it's still the cp1055 one [15:09:54] (03PS1) 10Herron: Revert "Lists: Change zen.spamhaus.org DNSBL action from warn to drop" [puppet] - 10https://gerrit.wikimedia.org/r/378928 [15:10:40] (03CR) 10Herron: [C: 032] Revert "Lists: Change zen.spamhaus.org DNSBL action from warn to drop" [puppet] - 10https://gerrit.wikimedia.org/r/378928 (owner: 10Herron) [15:10:52] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617763 (10RobH) @bblack: So to confirm, if we disable the memory share, it won't pxe boot. If we enable it, it will pxe boot? [15:11:36] andrewbogott: puppet stalls on instances in Tools I am trying atm [15:12:19] it may be moving but slow due to master overload [15:12:37] I have whitelisted everthing I saw [15:12:37] jynus: I think that 10.64.20.13 and 10.64.20.25 might also need access. If they do that's a bug in openstack, but… let's try it. [15:12:41] (that's labnet1001 and 1002) [15:12:51] 10.x hosts have access by default [15:13:18] andrewbogott: so things are moving now for puppet [15:13:31] jynus: ok, nevermind then :) [15:13:32] chasemp: cool [15:13:34] let's give it a minute [15:13:40] so, what I will need [15:13:40] I am also seeing nodepool churning again [15:13:42] so maybe we're better. I'm still waiting for my test instance to settle down [15:13:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:13:53] and, there, I logged in. so that's all good [15:13:55] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:14:07] andrewbogott: let's do nothing or a minute to see things settle? [15:14:12] Is for you to go to db1009 and puppetizing the iptables list into a network list or something [15:14:13] yep [15:14:15] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:14:42] there may be more addresses [15:14:55] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:15:00] andrewbogott: does nova-api not call into the DB directly? [15:15:01] RoanKattouw, we should also deploy the bullet one, I think. [15:15:08] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617766 (10BBlack) I don't know, I hadn't tried re-enabling the memory sharing stuff. All I really know is the sequence of events last week was approximately: 1. It was PXE booting... [15:15:10] it uses nova-conductor as well as nova-compute? [15:15:14] chasemp: in theory all the nova services are supposed to use conductor [15:15:18] but I'm not sure if it really does [15:15:20] matt_flaschen: What bullet one? [15:15:23] * chasemp nods [15:15:49] 10Operations, 10ops-eqiad, 10DBA: db1100 crashed - https://phabricator.wikimedia.org/T175973#3617768 (10Cmjohnson) No, go ahead and put back into production. [15:16:21] RoanKattouw, fixing grouped watchlist to show filled/hollow bullets for unseen/seen. [15:16:52] https://gerrit.wikimedia.org/r/#/c/378832/ [15:16:54] Oh the 4-character patch [15:17:20] +2ed, could you add it to the wiki page for posterity? [15:17:26] RoanKattouw, yeah. [15:18:29] andrewbogott: are you taking a crack at puppetizing the fw rules we just did adhoc for m5? [15:18:32] RoanKattouw, can you also copy the watchlist pages while you're waiting for Jenkins? https://phabricator.wikimedia.org/P6021 [15:18:39] The gadgets [15:18:52] anything else broken, that you can see? [15:19:00] chasemp: yep, will do that shortly [15:19:03] (03PS1) 10Herron: Lists: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/378930 (https://phabricator.wikimedia.org/T175878) [15:19:05] jynus: so far seems ok [15:19:05] Oh yeah, that's right [15:19:18] ACKNOWLEDGEMENT - puppet last run on elastic1020 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 seconds ago with 1 failures. Failed resources (up to 3 shown): Service[elasticsearch] Gehel waiting for plugin update to re-enable the elasticsearch service [15:19:21] I saw that yesterday and planned to do it during/after the deploy but forgot [15:19:32] I will go with m2 now CC akosiaris [15:19:37] jynus: I haven't seen anythng new broken but we are waiting for puppet to settle [15:19:39] ok [15:19:51] (03CR) 10Herron: [C: 032] Lists: Change zen.spamhaus.org DNSBL action from warn to drop [puppet] - 10https://gerrit.wikimedia.org/r/378930 (https://phabricator.wikimedia.org/T175878) (owner: 10Herron) [15:20:06] akosiaris: do not get worried, the problem here is not very well tracked production - labs holes [15:20:15] I got that [15:20:16] production - production should not create problems [15:20:28] except the temporary network loss [15:20:45] (03PS1) 10Muehlenhoff: Add git-fat to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/378931 [15:22:02] !log enabling firewall on db1020 (m2-master) [15:22:11] matt_flaschen: OK I did the ckb one, but now Jenkins is done [15:22:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:36] matt_flaschen: Are the GT/WM patches testable without the config changes? [15:22:53] (03PS1) 10Ema: main.py: import etcd [debs/pybal] - 10https://gerrit.wikimedia.org/r/378933 [15:23:12] akosiaris: should be up now [15:23:27] ok, I've seen our fullstack test go through a whole cycle, so we're probably good [15:23:33] Ugh never mind Jenkins isn't done yet [15:24:01] jynus: confirmed [15:24:19] akosiaris: the firewall didn't complain [15:24:23] andrewbogott: I'm clushing out puppet runs for Tools to verify [15:24:26] RoanKattouw, yeah, enable the Beta for the first time on a fresh user. It will load RcFiltersBeta the new way. [15:24:42] jynus: yeah and a quick test on gerrit and otrs points out they work fine [15:25:04] (03PS3) 10Alexandros Kosiaris: Revert "Revert "Revert "sshd_config: Increase MaxAuthTries""" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) [15:25:04] and it is configured "inter 3s fall 3" [15:25:14] so it was down less than 9 seconds [15:25:17] a test rebase also worked fine.. great [15:25:26] otrs ok? [15:25:30] yes [15:25:32] (03CR) 10Mobrovac: Upgrade Cassanra build to 3.11.0-wmf5 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [15:26:21] matt_flaschen: OK, all the wmf.18 patches are on mwdebug1002 [15:26:25] Let's split up the testing? [15:26:38] If you do the GuidedTour and WikimediaMessages ones, I can do the core ones? [15:26:56] RoanKattouw, yep [15:26:59] 10Operations, 10ops-codfw, 10fundraising-tech-ops, 10netops: connect second ethernet interface for fundraising codfw hosts - https://phabricator.wikimedia.org/T176175#3617810 (10Papaul) a:05Papaul>03Jgreen @Jgreen Complete [15:28:57] (03PS2) 10Ema: main.py: import etcd [debs/pybal] - 10https://gerrit.wikimedia.org/r/378933 [15:30:13] RoanKattouw, looks good. [15:30:35] jynus: ok etherpad updated [15:30:53] had a quick look at iegreview as well.. I don't have a username/password but it looks like it's working [15:31:05] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:48] OK cool [15:31:56] I'll run the scap thenn [15:32:06] (03CR) 10Muehlenhoff: [C: 031] "As mentioned on IRC, I didn't review the 1:1 differences to the Salt old version, but whether the control/work flow is sane and that's all" [puppet] - 10https://gerrit.wikimedia.org/r/377501 (https://phabricator.wikimedia.org/T148814) (owner: 10Volans) [15:32:44] akosiaris: thanks, we will have to do a proper cleanup of m2 at some point in the future [15:33:14] when we do the followup for stretch + 10.1 upgrade [15:33:35] !log catrope@tin Started scap: core, GuidedTour and WikimediaMessages patches for T176191, T167262 and T175765 [15:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:49] T175765: Move New Filters opt-out preference to its own section on the page - https://phabricator.wikimedia.org/T175765 [15:33:49] T167262: Server-launched guided tour can show on the wrong page if the user navigates away before full page load - https://phabricator.wikimedia.org/T167262 [15:33:49] T176191: [betalabs] Watchlist - empty markers displayed for seen/unseen changes - https://phabricator.wikimedia.org/T176191 [15:33:50] akosiaris: the firewall was important because that way new hosts get it on setup, so no more downtime for new hosts [15:36:39] 10Operations, 10OTRS: Upgrade OTRS to 5.0.23 - https://phabricator.wikimedia.org/T176221#3617848 (10akosiaris) [15:38:23] (03PS1) 10Herron: Remove mx2001 MX record from dns for OS upgrade [dns] - 10https://gerrit.wikimedia.org/r/378936 (https://phabricator.wikimedia.org/T175361) [15:38:58] (03CR) 10Ema: [C: 032] main.py: import etcd [debs/pybal] - 10https://gerrit.wikimedia.org/r/378933 (owner: 10Ema) [15:39:15] (03PS1) 10Ema: main.py: import etcd [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/378937 [15:41:40] jynus: yup, agreed. nice work btw! [15:43:21] I would have loved to do it at the same time [15:43:29] (03CR) 10Ema: [C: 032] main.py: import etcd [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/378937 (owner: 10Ema) [15:43:51] but I am being more and more realistic with my reach these weeks [15:44:04] thanks for your help, akosiaris [15:44:12] (03CR) 10Ottomata: [C: 031] Add git-fat to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [15:44:19] (03PS2) 10Eevans: Upgrade Cassandra build to 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) [15:44:45] hey ops -- didn't we have issues with the core PHP release team prior to HHVM? my recollection is that there was a sense of incompetence there. i don't want to put that in writing, though, and maybe my memories are faulty. [15:45:29] (03CR) 10Herron: [C: 032] Remove mx2001 MX record from dns for OS upgrade [dns] - 10https://gerrit.wikimedia.org/r/378936 (https://phabricator.wikimedia.org/T175361) (owner: 10Herron) [15:46:29] Hmm weird [15:46:39] 15:44:08 Check 'Logstash Error rate for mw1276.eqiad.wmnet' failed: ERROR: 50% OVER_THRESHOLD (Avg. Error rate: Before: 0.03, After: 2.00, Threshold: 1.00) [15:46:45] Just that one backend though, the other canaries were fine [15:47:22] And scap proceeded [15:47:41] 10Operations, 10MediaWiki-Maintenance-scripts, 10MW-1.30-release-notes (WMF-deploy-2017-09-19 (1.30.0-wmf.19)): wikitech-static sync failing - https://phabricator.wikimedia.org/T176090#3617880 (10Reedy) The thing that was causing it to return null itself has been fixed. The `strval` was just an extra safe guard [15:48:25] 10Operations, 10DBA, 10Patch-For-Review: Firewall configurations for database hosts - https://phabricator.wikimedia.org/T104699#3617885 (10jcrespo) I think after the above patch, only the proxies are missing? [15:48:59] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3617887 (10Cmjohnson) @bblack @robh I went through and re-verified all the settings, this generation does not give an option of UEFI or Legacy like the new generations. The bios is v... [15:49:35] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:50:49] RoanKattouw: Yeah, that is known to happen. I think thcipriani changed it so it stops scap if 2 or more backends report high error rate. [15:51:00] Aha Ok [15:51:08] Yeah I am assuming it's just a fluke [15:51:13] I have fatalmonitor open just to be sure [15:51:41] yep, 2 or more backends is the new stop deploy magic [15:53:23] (03Abandoned) 10Thcipriani: CI: install docker-ce from download.docker.com [puppet] - 10https://gerrit.wikimedia.org/r/377492 (https://phabricator.wikimedia.org/T175293) (owner: 10Thcipriani) [15:54:08] !log catrope@tin Finished scap: core, GuidedTour and WikimediaMessages patches for T176191, T167262 and T175765 (duration: 20m 32s) [15:54:15] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [15:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:28] T175765: Move New Filters opt-out preference to its own section on the page - https://phabricator.wikimedia.org/T175765 [15:54:28] T167262: Server-launched guided tour can show on the wrong page if the user navigates away before full page load - https://phabricator.wikimedia.org/T167262 [15:54:29] T176191: [betalabs] Watchlist - empty markers displayed for seen/unseen changes - https://phabricator.wikimedia.org/T176191 [15:54:44] (03PS1) 10Ema: setup.py: bump version number (1.14) [debs/pybal] - 10https://gerrit.wikimedia.org/r/378938 [15:56:14] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3617926 (10Nuria) @Jan_Dittrich : bucketing is available as part of wikimedia events, see an example of usage as part of serach code: https://github.com/wikimedia/mediawiki-extensions-WikimediaEvent... [15:58:01] RoanKattouw, also, the fatalmonitor filtered to that host hardly shows anything: https://logstash.wikimedia.org/goto/19b4f80e5dd3424547be879aab037839 [15:58:05] (03PS1) 10Andrew Bogott: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 [15:58:11] Right [15:58:23] RoanKattouw, maybe it is including infos. Again, filtered to that host: https://logstash.wikimedia.org/goto/6e574443cb8cee47598223c76f732827 [15:58:28] (03CR) 10jerkins-bot: [V: 04-1] add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [15:59:42] (03PS2) 10Andrew Bogott: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 [15:59:53] (03CR) 10Ema: [C: 032] setup.py: bump version number (1.14) [debs/pybal] - 10https://gerrit.wikimedia.org/r/378938 (owner: 10Ema) [16:00:03] (03PS1) 10Ema: setup.py: bump version number (1.14) [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/378942 [16:00:04] godog, moritzm, and _joe_: (Dis)respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T1600). Please do the needful. [16:00:04] No patches in the queue for this window. Wheeee! [16:00:08] (03CR) 10jerkins-bot: [V: 04-1] add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:00:45] (03PS3) 10Catrope: RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:01:04] (03CR) 10Catrope: [C: 032] RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:01:18] (03PS2) 10Catrope: Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) [16:01:23] jouncebot: good bot [16:01:38] (03PS3) 10Andrew Bogott: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 [16:02:05] (03CR) 10jerkins-bot: [V: 04-1] add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:03:15] (03CR) 10Jcrespo: "This is cool to me (at least, the method, I haven't checked every IP), but maybe some people will not like the single role on site.pp :-)" [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:04:21] (03PS4) 10Andrew Bogott: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 [16:04:53] (03CR) 10Ema: [V: 032 C: 032] setup.py: bump version number (1.14) [debs/pybal] (1.14) - 10https://gerrit.wikimedia.org/r/378942 (owner: 10Ema) [16:06:53] (03CR) 10Jcrespo: [C: 032] dbtools: Add db1101 to s2 [software] - 10https://gerrit.wikimedia.org/r/378916 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [16:07:14] (03CR) 10jerkins-bot: [V: 04-1] RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:07:50] (03CR) 10Jcrespo: [C: 032] mariadb: Pool db1101 as new s2 host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378914 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [16:08:27] (03PS4) 10Catrope: RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:08:30] (03CR) 10Catrope: [C: 032] RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:08:55] (03PS3) 10Catrope: Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) [16:08:57] (03CR) 10jerkins-bot: [V: 04-1] Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) (owner: 10Catrope) [16:10:19] (03PS3) 10Jcrespo: mariadb: Pool db1101 as new s2 host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378914 (https://phabricator.wikimedia.org/T172679) [16:10:21] (03PS1) 10Jcrespo: mariadb: Remove references to db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378945 (https://phabricator.wikimedia.org/T176215) [16:11:37] !log pybal 1.14.0 uploaded to apt.w.o [16:11:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:40] (03PS10) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [16:12:58] (03PS5) 10Rush: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:13:24] (03CR) 10Paladox: [C: 031] "This has been running successfully for the last 24 hours." [puppet] - 10https://gerrit.wikimedia.org/r/378768 (owner: 10Paladox) [16:13:55] (03CR) 10Paladox: [C: 031] "@Chad would you be able to +1 or -1 please? :)" [puppet] - 10https://gerrit.wikimedia.org/r/378768 (owner: 10Paladox) [16:14:34] (03CR) 10Rush: [C: 031] add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:15:05] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [16:15:52] (03Merged) 10jenkins-bot: RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:16:21] (03CR) 10jenkins-bot: RCFilters: Enable on watchlist for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/374382 (owner: 10Jforrester) [16:18:32] (03PS1) 10Jcrespo: mariadb: Decommission db1018 and removing references to it [puppet] - 10https://gerrit.wikimedia.org/r/378947 (https://phabricator.wikimedia.org/T176215) [16:18:35] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:18:45] (03CR) 10Catrope: [C: 032] Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) (owner: 10Catrope) [16:19:18] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable structured filters on watchlist on all wikis (duration: 00m 45s) [16:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:12] (03CR) 10Jcrespo: [C: 032] mariadb: Remove references to db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378945 (https://phabricator.wikimedia.org/T176215) (owner: 10Jcrespo) [16:22:14] (03CR) 10jenkins-bot: mariadb: Pool db1101 as new s2 host [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378914 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [16:22:17] (03PS2) 10Jcrespo: mariadb: Remove references to db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378945 (https://phabricator.wikimedia.org/T176215) [16:23:08] (03Merged) 10jenkins-bot: Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) (owner: 10Catrope) [16:23:25] (03CR) 10Jcrespo: [C: 032] mariadb: Decommission db1018 and removing references to it [puppet] - 10https://gerrit.wikimedia.org/r/378947 (https://phabricator.wikimedia.org/T176215) (owner: 10Jcrespo) [16:24:42] jynus: https://gerrit.wikimedia.org/r/#/c/378941/ [16:24:44] (03CR) 10jenkins-bot: Enable structured change filters by default on cawiki, frwiki and hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378264 (https://phabricator.wikimedia.org/T157642) (owner: 10Catrope) [16:24:56] 10Operations, 10OCG-General, 10Reading-Community-Engagement, 10Epic, and 3 others: [EPIC] (Proposal) Replicate core OCG features and sunset OCG service - https://phabricator.wikimedia.org/T150871#3618031 (10ovasileva) [16:26:08] (03CR) 10Zoranzoki21: [C: 031] mariadb: Remove references to db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378945 (https://phabricator.wikimedia.org/T176215) (owner: 10Jcrespo) [16:26:14] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable RCFilters by default on cawiki, frwiki and hewiki (T157642) (duration: 00m 59s) [16:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:26] T157642: Graduate New Filters UX out of beta on Recent Changes - https://phabricator.wikimedia.org/T157642 [16:26:28] RoanKattouw: I noticed that recentchangeslinked has been giving query errors sometimes about missing a join on the page table (e.g. https://fr.wikipedia.org/wiki/Sp%C3%A9cial:Suivi_des_liens/Cat%C3%A9gorie:Bon_article?userExpLevel=unregistered%3Bnewcomer%3Blearner&hidepreviousrevisions=1&hidenewpages=1&hidecategorization=1&hideWikibase=1&hidelog=1&namespace=0&limit=50&days=7&urlversion=2&action=re [16:26:31] nder&enhanced=0 ). Do you think its RC filter related [16:26:38] Is that recent? [16:26:57] Yeah that's likely a regression we caused [16:27:05] bawolff: Could you file a task? [16:27:17] I'll make it UBN and get it looked at today [16:27:19] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Trying that again, got an error the first time (duration: 00m 52s) [16:27:22] I was looking at logs for something else and saw it, so I'm not sure if its new or not [16:27:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:29] chasemp: let's disable puppet on db1009 and deploy it [16:28:57] jynus: andrewbogott is running it through the pupppet compiler and updating one fo the rules and then we'll roll [16:29:20] (03CR) 10Rush: [C: 04-1] "toolsadmin does not resolve to the same IP as californium" [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:29:23] RoanKattouw: https://phabricator.wikimedia.org/T176228 [16:29:30] Thanks [16:29:34] !log disable puppet on db1009 [16:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:51] andrewbogott: I'm already there if you want me to update... [16:30:01] sure, please do [16:30:11] horizon should use profile::openstack::main::horizon_host [16:30:46] ack [16:31:28] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.14 (duration: 03m 15s) [16:31:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:45] (03CR) 10Jcrespo: [V: 032 C: 032] mariadb: Remove references to db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378945 (https://phabricator.wikimedia.org/T176215) (owner: 10Jcrespo) [16:33:03] (03CR) 10jenkins-bot: mariadb: Remove references to db1018 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378945 (https://phabricator.wikimedia.org/T176215) (owner: 10Jcrespo) [16:33:18] (03PS1) 10Elukey: partman: restore previous recipe for mw13* [puppet] - 10https://gerrit.wikimedia.org/r/378951 (https://phabricator.wikimedia.org/T165519) [16:33:37] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618086 (10RobH) It looks like these are different versions though: | <03:00:00> BCM57810 - EC:B1:D7:7B:C6:D8 MBA:v7.10.71 CCM:v7.10.71 | | <03:00:01> BCM57810 - EC:... [16:34:11] (03CR) 10Elukey: [C: 032] partman: restore previous recipe for mw13* [puppet] - 10https://gerrit.wikimedia.org/r/378951 (https://phabricator.wikimedia.org/T165519) (owner: 10Elukey) [16:34:16] !log demon@tin Pruned MediaWiki: 1.30.0-wmf.17 [keeping static files] (duration: 01m 18s) [16:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:54] (03PS6) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:34:59] (03PS7) 10Rush: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:36:28] (03PS1) 10Giuseppe Lavagetto: Fixes to the build script: [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378952 [16:36:30] (03PS1) 10Giuseppe Lavagetto: Makefile: make "clean" fault-tolerant [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378953 [16:37:17] !log jynus@tin Synchronized wmf-config/db-eqiad.php: decom db1018 (duration: 01m 01s) [16:37:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:38] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Fix container references [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/378714 (owner: 10Giuseppe Lavagetto) [16:38:11] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618104 (10BBlack) I think @Cmjohnson said before that they're at different revs because they're different pieces of hardware (onboard vs card), and those are the latest revs for eac... [16:38:26] !log jynus@tin Synchronized wmf-config/db-codfw.php: decom db1018 (duration: 00m 51s) [16:38:34] gerrit is being weird andrewbogott sec [16:38:36] (03PS8) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:51] (03PS9) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:40:50] chasemp: puppet compiler looks reasonable: https://puppet-compiler.wmflabs.org/compiler02/7926/db1009.eqiad.wmnet/ [16:40:55] agreed [16:40:56] http://puppet-compiler.wmflabs.org/7927/db1009.eqiad.wmnet/ [16:41:02] (last version) [16:41:11] let's try it? [16:41:12] (03PS1) 10Ottomata: Add LVS service for druid-broker [puppet] - 10https://gerrit.wikimedia.org/r/378956 (https://phabricator.wikimedia.org/T176223) [16:41:58] andrewbogott: take a last look and see if you +1 https://gerrit.wikimedia.org/r/#/c/378941/? [16:42:30] chasemp: yep! jynus we're ready to merge https://gerrit.wikimedia.org/r/#/c/378941/ now [16:42:33] (03PS1) 10Chad: group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378957 [16:42:37] (03CR) 10Andrew Bogott: [C: 031] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:42:44] !log demon@tin Started scap: bootstrap wmf.19 [16:42:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:44] We have this UBN! bug that needs immediate deploy: https://phabricator.wikimedia.org/T175984 [16:44:01] (03PS1) 10Jcrespo: dbtools: Remove db1018 from s2 [software] - 10https://gerrit.wikimedia.org/r/378958 [16:44:05] thcipriani: greg-g^ [16:44:12] jynus: or if you're off for the day we can try again tomorrow, same time [16:44:17] Grrrr [16:44:18] I'm making the patches but since we don't have the morning SWAT and evening swat is 3 am in my time, can I deploy now? [16:44:31] I'm mid-scap [16:44:32] But ok [16:44:37] !log demon@tin scap aborted: bootstrap wmf.19 (duration: 01m 52s) [16:44:43] no_justification: after your deploy [16:44:47] I aborted. [16:44:49] I can wait for half an hour [16:44:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:01] Right now != half an hour :) [16:45:09] "immediate deploy" [16:45:09] !log demon@tin Started scap: bootstrap wmf.19 [16:45:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:27] sorry, long day :/ [16:45:52] andrewbogott: I am here [16:45:58] puppet is disabled on db1009 [16:46:11] deploy and check iptables on m5 on codfw (passive) [16:46:29] (03CR) 10Ottomata: Add LVS service for druid-broker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378956 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [16:46:35] jynus: what is the actual m5/codfw hostname? [16:46:43] (03CR) 10Andrew Bogott: [C: 032] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378941 (owner: 10Andrew Bogott) [16:47:07] jynus: ok I'll merge https://gerrit.wikimedia.org/r/#/c/378941/ and run puppet on db1009 [16:47:10] andrewbogott: ^ [16:47:12] (the patch is merged) [16:47:16] ah [16:47:26] chasemp: wait, I think there's a different host to test on before db1009 [16:47:41] there is a codfw host I think in site.pp [16:47:53] no, 1009 puppet is disabled [16:47:58] on purpose [16:48:08] so you can see what will happen on [16:48:19] db2030 [16:48:41] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618162 (10RobH) Well, my thought about the multi function mode being set incorrectly doesnt work. I make it match all the other ports and it still doesn't pxe boot on eth0. Settin... [16:48:56] andrewbogott: hm profile::openstack::main::nova_controller [16:49:03] Could not find data item profile::openstack::main::nova_controller in [16:49:05] let me look [16:49:21] (03CR) 10Elukey: Add LVS service for druid-broker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378956 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [16:49:52] (03PS1) 10Rush: Revert "mariadb: add firewall exceptions for m5 and cloud services" [puppet] - 10https://gerrit.wikimedia.org/r/378959 [16:49:57] revert first [16:50:11] jynus: sorry somethign is whacky there hiera wise we'll dig in [16:50:18] no problem [16:50:32] (03CR) 10Rush: [C: 032] Revert "mariadb: add firewall exceptions for m5 and cloud services" [puppet] - 10https://gerrit.wikimedia.org/r/378959 (owner: 10Rush) [16:50:38] 10Operations, 10Mail, 10OTRS: E-mails from Qualtrics to OTRS not delivered - https://phabricator.wikimedia.org/T170427#3618169 (10akosiaris) p:05Triage>03Low [16:51:09] oh I know what :) [16:51:24] I think [16:51:32] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:51:59] (03CR) 10Ottomata: Add LVS service for druid-broker (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378956 (https://phabricator.wikimedia.org/T176223) (owner: 10Ottomata) [16:53:20] andrewbogott: the short version is those keys re defined in per site hiera (under eqiad/) and the db2030 host is in codfw and so never finds them) [16:53:32] thinking on easiest option to get these rules at least persisted [16:54:39] (03PS1) 10Jcrespo: icinga: Disable notifications on db2078, enable them on db1101 [puppet] - 10https://gerrit.wikimedia.org/r/378962 (https://phabricator.wikimedia.org/T172679) [16:55:37] jynus: how late are you hoping to work today? This might be a while since we have a meetingin 5 [16:56:02] no longer than 20 CEST, that is in one hour [16:56:17] you do not actually need me, I think [16:56:28] s5 is used exclusively by you [16:56:33] I meant m5 [16:56:50] so if it breaks, it only affects you :-) [16:57:47] (03PS1) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 [16:57:48] jynus: we'll have this sorted in a minute [16:58:06] (03PS2) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 [16:58:11] (03CR) 10jerkins-bot: [V: 04-1] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 (owner: 10Rush) [16:58:13] (03PS1) 10Muehlenhoff: Remove hack configuring eventlogging/eventlogging to use trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378964 [16:58:23] what I mean is that I am not in a rush, and you can do it later on your own [16:58:29] (03CR) 10jerkins-bot: [V: 04-1] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 (owner: 10Rush) [16:59:04] jynus: I'm out until next mon, so would rather merge a simple thing devoid of hiera for the moment [16:59:19] sure, that works, too [16:59:29] no_justification: has the branch cut for wmf.19 happened? [16:59:59] (03PS3) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 [17:00:27] (03CR) 10jerkins-bot: [V: 04-1] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 (owner: 10Rush) [17:00:42] (03CR) 10Ottomata: [C: 031] "1 nit, +1 otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378964 (owner: 10Muehlenhoff) [17:00:42] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [17:00:48] Amir1: yes [17:00:57] thanks :) [17:01:29] one of these tries.. [17:01:38] (03PS4) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 [17:02:09] curly braces [17:02:20] yeah [17:02:31] (03PS5) 10Rush: mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 [17:03:51] (03PS2) 10Muehlenhoff: Remove hack configuring eventlogging/eventlogging to use trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378964 [17:04:16] andrewbogott: https://gerrit.wikimedia.org/r/#/c/378963/ [17:04:32] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 2 others: Unable to delete file on Commons (Fatal exception of type "Wikimedia\Rdbms\DBQueryError") - https://phabricator.wikimedia.org/T176185#3618250 (10Bawolff) [17:04:43] (03CR) 10Andrew Bogott: [C: 031] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 (owner: 10Rush) [17:04:53] (03CR) 10Rush: [C: 032] mariadb: add firewall exceptions for m5 and cloud services [puppet] - 10https://gerrit.wikimedia.org/r/378963 (owner: 10Rush) [17:06:06] jeez ferm has an issue still [17:06:08] ok [17:07:42] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [17:08:43] PROBLEM - Check systemd state on db2030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:08:53] (03PS2) 10Jcrespo: icinga: Disable notifications on db2078, enable them on db1101 [puppet] - 10https://gerrit.wikimedia.org/r/378962 (https://phabricator.wikimedia.org/T172679) [17:09:34] (03CR) 10Jcrespo: [C: 032] icinga: Disable notifications on db2078, enable them on db1101 [puppet] - 10https://gerrit.wikimedia.org/r/378962 (https://phabricator.wikimedia.org/T172679) (owner: 10Jcrespo) [17:09:43] RECOVERY - Check systemd state on db2030 is OK: OK - running: The system is fully operational [17:10:57] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618266 (10RobH) So, PXE isn't working now for eth0, mac address ec:b1:d7:7b:c6:d8. This is the eth0 that is also detected in the OS, so it doesn't appear to be an issue where the B... [17:12:23] (03PS1) 10Rush: mariadb: ferm for m5 use () for multiple source hosts [puppet] - 10https://gerrit.wikimedia.org/r/378966 [17:12:47] (03CR) 10Rush: [C: 032] mariadb: ferm for m5 use () for multiple source hosts [puppet] - 10https://gerrit.wikimedia.org/r/378966 (owner: 10Rush) [17:15:43] ok andrewbogott and jynus here I go [17:15:45] on db1009 [17:15:51] afaict db2030 looks ok [17:16:04] did you compare the ips with the list I gave you? [17:16:13] !log demon@tin Finished scap: bootstrap wmf.19 (duration: 31m 04s) [17:16:15] no ip missing? [17:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:51] jynus: one of them (from labtest) seemed wrong, the rest are in there [17:16:55] jenkins is on strike I guess [17:17:05] andrewbogott: ok [17:17:31] we will iterate later on this, as long as the base stuff works [17:17:57] jynus: I think we are good [17:18:12] PROBLEM - puppet last run on db2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:13] cool, thank you [17:18:16] https://phabricator.wikimedia.org/P6025 [17:18:19] mm [17:18:21] ^ [17:18:23] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:26] (03PS1) 10Ottomata: Add druid LVS svc name [dns] - 10https://gerrit.wikimedia.org/r/378967 (https://phabricator.wikimedia.org/T176223) [17:18:55] db2030...I'm running puppet now but I think that's delayed alert or soething [17:19:02] ok, cool [17:19:12] RECOVERY - puppet last run on db2030 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [17:19:19] andrewbogott: can you look at nova-fullstack? and I'll check on nodepool [17:19:28] yep [17:22:25] yeah nodepool seems in order and isntances can run puppet afaict [17:23:26] chasemp: my by-hand test worked fine, instance is up and I can ssh [17:23:32] I'm still waiting for fullstack to complete a cycle [17:23:40] I'm not crazy about teh hardcoded hosts but even teh secondaries are there so it's not scary just lame, I'll thinkn on best remedy [17:23:47] bit I think we're good [17:23:50] great [17:23:54] tx jynus andrewbogott [17:24:00] no, thanks to you [17:24:52] chasemp: ok, that's it, fullstack cycle completed [17:24:53] let's meet! [17:26:13] 10Operations, 10ops-codfw, 10DBA: Degraded RAID on db2010 - https://phabricator.wikimedia.org/T175228#3618357 (10jcrespo) 05Open>03Resolved Let's consider this fixed and lets focus on T175685. [17:28:02] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:32:53] (03CR) 10Thcipriani: [C: 031] Add git-fat to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [17:34:03] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:34:35] (03CR) 10Chad: "Didn't we talk about maybe just including git-fat in scap? It's only one file, and we can ditch the cruft we don't make use of." [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [17:34:43] (03CR) 10Chad: "(Otherwise this is fine by me)" [puppet] - 10https://gerrit.wikimedia.org/r/378931 (owner: 10Muehlenhoff) [17:39:27] 10Operations, 10Discovery, 10Elasticsearch, 10Discovery-Search (Current work): Use SSL certificates with discovery entry for elasticsearch - https://phabricator.wikimedia.org/T162037#3618409 (10debt) [17:41:07] 10Operations, 10Patch-For-Review: allow rsyncing between build host and install hosts - https://phabricator.wikimedia.org/T176178#3618411 (10Dzahn) @akosiaris Here is what happened, that fragment you found was from me, from last time i wanted to upload a gerrit build. So this time i build it again, find my own... [17:42:41] RECOVERY - Host lvs1007 is UP: PING OK - Packet loss = 0%, RTA = 0.33 ms [17:43:23] (03CR) 10Pmiazga: "can we use the existing permissions system, and use some administrator/wmf role? TBH this hack is bit ugly." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [17:43:25] !log stopping db2010 and cloning to db2078 [17:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:21] PROBLEM - Host lvs1007 is DOWN: PING CRITICAL - Packet loss = 100% [17:47:36] 10Operations, 10Discovery-Search (Current work): Investigate I/O limits on elasticsearch servers - https://phabricator.wikimedia.org/T153083#3618422 (10debt) 05Open>03Resolved We've fixed the overall issue - so closing. [17:47:49] Amir1: I finished quite some time ago: are you going to deploy your fix right away now? [17:48:00] !log arlolra@tin Started deploy [parsoid/deploy@a01064d]: Updating Parsoid to 05a0965 [17:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:48:52] no_justification: after deploy of services [17:49:02] jenkins is very slow [17:49:25] Amir1: patch? [17:49:30] url I mean [17:50:11] https://gerrit.wikimedia.org/r/#/c/378960/ (wmf.18) https://gerrit.wikimedia.org/r/378961 (wmf.19) [17:50:25] greg-g: I think I need to do two deploys but wmf.19 is not important [17:51:09] 17 minutes isn't really "very slow" during a busy part of the day, but OK :P [17:51:38] and why the delay in merging them when this was an "immediate deploy" an hour ago? [17:52:02] just trying to understand where the delay came in [17:52:32] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban: Reinstall Analytics Hadoop Cluster with Debian Jessie - https://phabricator.wikimedia.org/T157807#3618446 (10Nuria) [17:54:01] !log T173464 start cirrussearch reindex of all wikis starting with zh [17:54:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:16] T173464: Re-index Chinese Wikis - https://phabricator.wikimedia.org/T173464 [17:54:20] 10Operations, 10Patch-For-Review: allow rsyncing between build host and install hosts - https://phabricator.wikimedia.org/T176178#3618460 (10Dzahn) >>! In T176178#3617296, @akosiaris wrote: > This is already implemented, albeit in the other direction. See https://phabricator.wikimedia.org/source/operations-pup... [17:54:43] 10Operations, 10Patch-For-Review: allow rsyncing between build host and install hosts - https://phabricator.wikimedia.org/T176178#3618462 (10Dzahn) 05Open>03Resolved [17:55:44] (03Abandoned) 10Dzahn: aptrepo: allow rsyncing from package build hosts [puppet] - 10https://gerrit.wikimedia.org/r/378810 (https://phabricator.wikimedia.org/T176178) (owner: 10Dzahn) [17:56:19] (03PS1) 10Smalyshev: Tag WDQS labs tole with labs project tag [puppet] - 10https://gerrit.wikimedia.org/r/378970 [17:58:31] (03PS3) 10Dzahn: cassandra: bump dev cluster version to 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [17:58:36] (03PS2) 10Smalyshev: Tag WDQS labs tole with labs project tag [puppet] - 10https://gerrit.wikimedia.org/r/378970 [17:59:13] (03PS4) 10Dzahn: cassandra: bump dev cluster version to 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [17:59:45] greg-g: hmm, mostly because I had to make a patch for Wikibase too [18:00:01] not one, but two (wmf.18 and wmf.19) [18:00:01] (03CR) 10Dzahn: [C: 032] cassandra: bump dev cluster version to 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378926 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [18:00:26] !log arlolra@tin Finished deploy [parsoid/deploy@a01064d]: Updating Parsoid to 05a0965 (duration: 12m 26s) [18:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:40] I'm going next [18:01:33] urandom: ^ if you want to re-enable puppet in restbase-dev, go ahead [18:02:52] (03PS3) 10Krinkle: Enable jQuery 3 on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378802 (https://phabricator.wikimedia.org/T124742) [18:05:29] 10Operations, 10Patch-For-Review: rack and setup wtp1025-1048 - https://phabricator.wikimedia.org/T165520#3618530 (10Arlolra) There're a couple problems with this change, https://gerrit.wikimedia.org/r/#/c/377966/ `/etc/dsh/group/parsoid` contains `ruthenium.eqiad.wmnet` resulting in, ``` 17:55:07 ['/usr/bi... [18:07:03] !log ladsgroup@tin Synchronized php-1.30.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/includes/Content/EntityContent.php: Fix undoing merge operations that turned Items into a redirects, part I (T175984) (duration: 00m 50s) [18:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:15] T175984: Unable to undo a Wikidata merge - https://phabricator.wikimedia.org/T175984 [18:08:27] !log ladsgroup@tin Synchronized php-1.30.0-wmf.19/extensions/Wikidata/extensions/Wikibase/repo/tests/phpunit/includes/Content/EntityContentTest.php: Fix undoing merge operations that turned Items into a redirects, part II (T175984) (duration: 00m 50s) [18:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:01] 10Operations, 10Analytics, 10ChangeProp, 10EventBus, and 4 others: Select candidate jobs for transferring to the new infrastucture - https://phabricator.wikimedia.org/T175210#3618572 (10GWicke) I honestly don't have a strong preference between the other "hearted" tasks. Given that all of them are fairly lo... [18:12:19] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618576 (10RobH) Ok, so there is something wrong with lvs1007 network firmware/settings. lvs1007 boot order is missing the network device option. When selecting it in the one time... [18:12:22] (03PS1) 10Herron: Revert "Remove mx2001 MX record from dns for OS upgrade" [dns] - 10https://gerrit.wikimedia.org/r/378971 [18:12:48] (03CR) 10Herron: [C: 032] Revert "Remove mx2001 MX record from dns for OS upgrade" [dns] - 10https://gerrit.wikimedia.org/r/378971 (owner: 10Herron) [18:13:35] confirming it works [18:13:46] (on mwdebug1002) [18:13:54] syncing everywhere [18:15:37] !log ladsgroup@tin Synchronized php-1.30.0-wmf.18/extensions/Wikidata/extensions/Wikibase/repo/includes/Content/EntityContent.php: Fix undoing merge operations that turned Items into a redirects, part I (T175984) (duration: 00m 49s) [18:15:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:52] T175984: Unable to undo a Wikidata merge - https://phabricator.wikimedia.org/T175984 [18:17:02] !log ladsgroup@tin Synchronized php-1.30.0-wmf.18/extensions/Wikidata/extensions/Wikibase/repo/tests/phpunit/includes/Content/EntityContentTest.php: Fix undoing merge operations that turned Items into a redirects, part II (T175984) (duration: 00m 48s) [18:17:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:54] (03CR) 10Thcipriani: [C: 031] "Confirmed, nothing in there but deploy:* keys" [puppet] - 10https://gerrit.wikimedia.org/r/378925 (owner: 10Muehlenhoff) [18:19:54] deploy is done [18:20:03] I stay around for a while to monitor fatalmonitor [18:20:33] mutante: thanks! [18:25:54] !log Updated Parsoid to 05a0965 (T122965, T175101, T173384, T173061, T172896, T174977, T159894) [18:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:17] T173061: False Lint-error on Pipe Entity taken for Pipe - https://phabricator.wikimedia.org/T173061 [18:26:17] T159894: Add support to Parsoid's cite code for `responsive` parameter - https://phabricator.wikimedia.org/T159894 [18:26:17] T175101: Parsoid error pages use excessive HTML entity encoding - https://phabricator.wikimedia.org/T175101 [18:26:17] T174977: {{REVISIONID}} returns null, causing templates to be rendered in Preview mode (when using REST API) - https://phabricator.wikimedia.org/T174977 [18:26:18] T173384: Parsing error: Expecting : in parser function definiton - https://phabricator.wikimedia.org/T173384 [18:26:18] T122965: Provide a way to support HTML5 elements in browsers that don't support it and don't have JS shipped, like IE8 and below - https://phabricator.wikimedia.org/T122965 [18:26:18] T172896: Empty li item in external links list - https://phabricator.wikimedia.org/T172896 [18:26:55] jouncebot: now [18:26:55] No deployments scheduled for the next 0 hour(s) and 33 minute(s) [18:27:00] jouncebot: next [18:27:01] In 0 hour(s) and 32 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T1900) [18:27:53] (03PS1) 10Jcrespo: mariadb: Remove all references to db2010 on production [puppet] - 10https://gerrit.wikimedia.org/r/378976 (https://phabricator.wikimedia.org/T175685) [18:28:13] Am I the only one with testwiki page loading problems? 37,30 s to load https://test.wikipedia.org/wiki/Tolla [18:28:55] (03CR) 10Jcrespo: [C: 032] mariadb: Remove all references to db2010 on production [puppet] - 10https://gerrit.wikimedia.org/r/378976 (https://phabricator.wikimedia.org/T175685) (owner: 10Jcrespo) [18:29:14] it's loading slowly for me too [18:35:21] !log T160570, T169940: Upgrade restbase-ng environment to Cassandra 3.11.0-wmf5 [18:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:35] T169940: End of September milestone: Start migration of production use cases. - https://phabricator.wikimedia.org/T169940 [18:35:35] T160570: Cassandra 3.x Tracking - https://phabricator.wikimedia.org/T160570 [18:37:04] (03PS2) 10Jcrespo: dbtools: Remove db1018 from s2 [software] - 10https://gerrit.wikimedia.org/r/378958 [18:37:06] (03PS1) 10Jcrespo: dbtools: Remove db2010, add db2078 from m1 dblist [software] - 10https://gerrit.wikimedia.org/r/378977 (https://phabricator.wikimedia.org/T175685) [18:37:28] (03CR) 10Jcrespo: [C: 032] dbtools: Remove db1018 from s2 [software] - 10https://gerrit.wikimedia.org/r/378958 (owner: 10Jcrespo) [18:37:46] (03CR) 10Jcrespo: [V: 032 C: 032] dbtools: Remove db2010, add db2078 from m1 dblist [software] - 10https://gerrit.wikimedia.org/r/378977 (https://phabricator.wikimedia.org/T175685) (owner: 10Jcrespo) [18:46:31] (03PS11) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [18:50:50] (03CR) 10Ayounsi: "Note that librenms uses .htaccess:" [puppet] - 10https://gerrit.wikimedia.org/r/378858 (owner: 10Elukey) [18:52:30] !log upgrading nodejs to 6.11 on maps-test2004 for testing - T171707 [18:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:45] T171707: Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707 [18:54:16] (03PS9) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [18:54:41] (03CR) 10jerkins-bot: [V: 04-1] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [18:56:12] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3618868 (10Gehel) kartotherian logs (`tail-kartotherian`) after the nodejs 6.11 upgrade show errors connecting to postgresql. This might be unrelated to the upgrad... [18:58:05] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618870 (10RobH) Per @bblack's request, I've done a show config script on both lvs1007 and lvs1008 for comparison. P6026 shows both. The only difference is the virtualization is fl... [19:00:05] no_justification: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T1900). [19:00:05] No patches in the queue for this window. Wheeee! [19:00:42] (03PS10) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [19:01:14] lol is jouncebot developing AI,.. hooked up to ORES? [19:03:24] (03CR) 10Chad: [C: 032] group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378957 (owner: 10Chad) [19:05:04] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10Traffic: Make maps active / active - https://phabricator.wikimedia.org/T162362#3618894 (10Gehel) a:03Gehel [19:05:29] (03Merged) 10jenkins-bot: group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378957 (owner: 10Chad) [19:05:39] mutante: Niharika is having fun with it :) [19:06:18] mutante: It's being trained on a dataset on no_justification's IRC message history. [19:06:19] (03CR) 10jenkins-bot: group0 to wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378957 (owner: 10Chad) [19:06:34] Niharika: lol, that's hilarious :) [19:07:37] uh, that's not a wise idea :P as long as you have a bad words filter it'll just be snark, I guess? [19:08:38] !log demon@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to wmf.19 [19:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:49] (03PS1) 10Eevans: Upgrade restbase-ng env to Cassandra 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378983 (https://phabricator.wikimedia.org/T160570) [19:09:57] 10Operations, 10Patch-For-Review: letsencrypt::cert::integrated and non-http servers - https://phabricator.wikimedia.org/T174720#3618904 (10herron) The mx cert renewal T174081 that initiated this task will need to be taken care of pretty soon. If there aren't any major objections to the standalone nginx appro... [19:10:16] (03CR) 10Dzahn: [C: 031] "http://puppet-compiler.wmflabs.org/7930/" [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [19:10:17] Let's just say it's not gonna go out of its way to be nice to anyone. :P [19:10:25] Niharika: uh oh :) [19:11:36] who is running "SELECT /* Wikimedia\Rdbms\Database::query */ /*+ MAX_EXECUTION_TIME(1000) */ 1 FROM recentchanges WHERE SLEEP(2000)" from terbium? [19:11:47] (03CR) 10Eevans: [C: 031] "Only the new RESTBase Cassandra cluster uses '3.x', and it has already been upgraded to -wmf5; This will just allow us to re-enable Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/378983 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [19:11:52] (03CR) 10Paladox: "LGTM, will test it." [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [19:14:45] 10Operations, 10Discovery, 10Maps, 10Maps-Sprint, 10monitoring: Map caches metrics look broken - https://phabricator.wikimedia.org/T141186#3618970 (10debt) 05Open>03declined Declining this - Ganglia is already depreciated and there is a notice about it. [19:15:23] (03CR) 10Paladox: [C: 031] gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [19:15:26] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3618974 (10Gehel) a:03Pnorman [19:15:34] (03PS2) 10Dzahn: cassandra: Upgrade restbase-ng env to Cassandra 3.11.0-wmf5 [puppet] - 10https://gerrit.wikimedia.org/r/378983 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [19:15:38] (03CR) 10Paladox: [C: 031] "passes puppet on labs." [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [19:15:59] (03CR) 10Dzahn: [C: 032] "yea, per "already upgraded" and "dev" was done first earlier today" [puppet] - 10https://gerrit.wikimedia.org/r/378983 (https://phabricator.wikimedia.org/T160570) (owner: 10Eevans) [19:17:17] (03CR) 10Dzahn: [C: 032] "per compiler, only really changes gerrit2001, not cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [19:17:34] (03PS11) 10Dzahn: gerrit: fix host for TLS cert/monitoring if on slave [puppet] - 10https://gerrit.wikimedia.org/r/378360 [19:21:05] 10Operations, 10ops-eqiad, 10Traffic, 10netops: Upgrade BIOS/RBSU/etc on lvs1007 - https://phabricator.wikimedia.org/T167299#3618988 (10RobH) IRC update: @cmjohnson went ahead and reset bios settings to defaults, and after power cycling the server, it hasn't resolved the network device not showing in the... [19:28:11] PROBLEM - HTTPS on gerrit2001 is CRITICAL: SSL CRITICAL - failed to verify gerrit.wikimedia.org against gerrit-slave.wikimedia.org [19:28:20] ^ me.. my change is about fixing that [19:28:37] (not cobalt the active server) [19:29:03] https://gerrit-slave.wikimedia.org/r/ [19:29:03] heh [19:29:04] (03CR) 10Ottomata: [C: 031] Remove hack configuring eventlogging/eventlogging to use trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378964 (owner: 10Muehlenhoff) [19:29:13] mutante it's accissable :) [19:29:21] paladox: that's correct :) [19:29:29] :) [19:31:23] (03PS12) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [19:31:31] (03Draft1) 10Paladox: Revert "Add symlinks for Debian-packaged Bouncycastle Jars" [puppet] - 10https://gerrit.wikimedia.org/r/350446 [19:31:34] (03Draft2) 10Paladox: Revert "Add symlinks for Debian-packaged Bouncycastle Jars" [puppet] - 10https://gerrit.wikimedia.org/r/350446 [19:31:37] (03Draft3) 10Paladox: Revert "Add symlinks for Debian-packaged Bouncycastle Jars" [puppet] - 10https://gerrit.wikimedia.org/r/350446 [19:33:17] RECOVERY - HTTPS on gerrit2001 is OK: SSL OK - Certificate gerrit-slave.wikimedia.org valid until 2017-12-18 18:25:00 +0000 (expires in 89 days) [19:33:28] ah :) nice [19:33:39] i was starting to think i need a follow-up change. .but no [19:33:57] no_justification: ^ that stuff is fixed now [19:34:10] gerrit2001 has proper cert and ServerAlias etc [19:34:50] (03CR) 10Gergő Tisza: "There is no such thing as a wmf role. I could limit to staff but few people have that right (https://meta.wikimedia.org/wiki/Special:Globa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/377929 (https://phabricator.wikimedia.org/T175868) (owner: 10Gergő Tisza) [19:35:48] (03CR) 10Dzahn: "no-op on cobalt - fixed things on gerrit2001" [puppet] - 10https://gerrit.wikimedia.org/r/378360 (owner: 10Dzahn) [19:36:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): labvirt1015 crashes - https://phabricator.wikimedia.org/T171473#3619052 (10Cmjohnson) 05Open>03Resolved Resolving this, if you have any further issues please reopen the task [19:38:01] mutante: woot! [19:39:25] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3619064 (10Pnorman) The last error is ``` {"name":"kartotherian","hostname":"maps-test2004","pid":124,"level":50,"levelPath":"error","msg":"geoshapes support fail... [19:39:49] Still serving 503 for now, but at least we have a secure connection to it :) [19:40:41] (03PS4) 10Andrew Bogott: fullstack: add a 'success' stat [puppet] - 10https://gerrit.wikimedia.org/r/378175 [19:41:40] (03CR) 10Andrew Bogott: [C: 032] fullstack: add a 'success' stat [puppet] - 10https://gerrit.wikimedia.org/r/378175 (owner: 10Andrew Bogott) [19:41:58] no_justification: yea, but a nice message on it. to make it even better we should say "this is the slave" or something :) [19:42:56] (03CR) 10Krinkle: [C: 031] Remove hack configuring eventlogging/eventlogging to use trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/378964 (owner: 10Muehlenhoff) [19:46:05] different error message content based on "if $slave" [19:49:13] !log bblack@neodymium conftool action : set/pooled=no; selector: cluster=cache_text,name=cp4008.ulsfo.wmnet [19:49:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:54] (03PS1) 10Madhuvishy: device_backup: Update cron MAILTo to new mailing list address [puppet] - 10https://gerrit.wikimedia.org/r/378989 (https://phabricator.wikimedia.org/T168480) [19:49:59] !log bblack@neodymium conftool action : set/pooled=no; selector: cluster=cache_upload,name=cp4005.ulsfo.wmnet [19:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:52:51] (03PS1) 10Ottomata: Set druid.processing.numMergeBuffers: 10 for druid broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/378992 [19:53:46] (03PS2) 10Ottomata: Set druid.processing.numMergeBuffers: 10 for druid broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/378992 [19:55:23] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3619176 (10Pnorman) Everything tests okay to me. I checked serving tiles on 6533 and regenerating tiles through tileratorui. Even though everything tests okay, I... [19:59:21] (03CR) 10Joal: "Code duplication I think." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/378992 (owner: 10Ottomata) [19:59:49] (03PS1) 10BBlack: Revert "depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/378994 [19:59:53] (03PS2) 10BBlack: Revert "depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/378994 [19:59:59] (03CR) 10Ottomata: [C: 032] Set druid.processing.numMergeBuffers: 10 for druid broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/378992 (owner: 10Ottomata) [20:00:01] (03CR) 10BBlack: [V: 032 C: 032] Revert "depool codfw front edge traffic" [dns] - 10https://gerrit.wikimedia.org/r/378994 (owner: 10BBlack) [20:00:05] (03CR) 10Joal: [C: 031] "Good for me! Thanks @ottomata" [puppet] - 10https://gerrit.wikimedia.org/r/378992 (owner: 10Ottomata) [20:00:08] (03PS3) 10Ottomata: Set druid.processing.numMergeBuffers: 10 for druid broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/378992 [20:00:11] (03CR) 10Ottomata: [V: 032 C: 032] Set druid.processing.numMergeBuffers: 10 for druid broker and historical [puppet] - 10https://gerrit.wikimedia.org/r/378992 (owner: 10Ottomata) [20:03:56] !log bblack@neodymium conftool action : set/pooled=no; selector: cluster=cache_text,name=cp4016.ulsfo.wmnet [20:03:59] !log bblack@neodymium conftool action : set/pooled=no; selector: cluster=cache_upload,name=cp4013.ulsfo.wmnet [20:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:27] (03Abandoned) 10Ottomata: Genericize ca-manager [puppet] - 10https://gerrit.wikimedia.org/r/355782 (https://phabricator.wikimedia.org/T166167) (owner: 10Ottomata) [20:14:34] 10Operations, 10Analytics-Kanban, 10Discovery, 10Discovery-Analysis (Current work), and 2 others: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#3619208 (10debt) [20:14:38] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [20:15:07] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [20:15:20] Request from I via cp4027 cp4027, Varnish XID 1029701701 Error: 503, Backend fetch failed at Tue, 19 Sep 2017 20:12:44 GMT [20:16:30] rxy: thanks, we're investigating [20:16:49] ema: thx [20:22:49] 10Operations, 10Fundraising-Backlog, 10fundraising-tech-ops: Port fundraising stats off Ganglia - https://phabricator.wikimedia.org/T152562#3619242 (10cwdent) [20:24:29] mutante: thanks again! [20:28:42] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3473775 (10Yurik) @Pnorman which sources config file are you using? [20:29:43] 10Operations, 10Analytics, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#3619277 (10Tbayer) >>! In T135762#3617324, @Jan_Dittrich wrote: >> You can use eventlogging and wikimediaevents code at this time , there are quite >> a bit of examples of how to run ab tests on dis... [20:33:54] rxy: the situation should be back to normal now, thanks again! [20:35:36] (03CR) 10Krinkle: VCL: stabilize backend storage patterns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/376751 (https://phabricator.wikimedia.org/T145661) (owner: 10BBlack) [20:36:19] (03PS3) 10Gehel: Tag WDQS labs tole with labs project tag [puppet] - 10https://gerrit.wikimedia.org/r/378970 (owner: 10Smalyshev) [20:36:40] (03PS1) 10Ottomata: [WIP] Allow admin module to ensure system user membership in managed groups [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) [20:36:52] (03CR) 10Gehel: [C: 032] Tag WDQS labs tole with labs project tag [puppet] - 10https://gerrit.wikimedia.org/r/378970 (owner: 10Smalyshev) [20:37:10] (03CR) 10Ottomata: "Chase, let me know what you think! Is this way off?" [puppet] - 10https://gerrit.wikimedia.org/r/379004 (https://phabricator.wikimedia.org/T174465) (owner: 10Ottomata) [20:39:07] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:39:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:41:13] I thnk the continuing "PROBLEM" there is just the strange aftermath effects of that 5xx check [20:41:27] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:41:40] (03PS1) 1020after4: Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) [20:42:09] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [20:43:50] (03PS2) 1020after4: Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) [20:44:07] 10Operations, 10Analytics-Kanban, 10Patch-For-Review: Puppet admin module should support adding system users to managed groups - https://phabricator.wikimedia.org/T174465#3619321 (10Ottomata) [20:44:15] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [20:46:02] (03PS3) 1020after4: Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) [20:46:26] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) (owner: 1020after4) [20:48:26] 10Operations, 10Phabricator, 10Traffic, 10Zero: Missing IP addresses for Maroc Telecom - https://phabricator.wikimedia.org/T174342#3619326 (10Dispenser) The account was registered on August 15 and blocked on August 23. According to [[https://meta.wikimedia.org/wiki/CheckUser_policy|CheckUser policy]] this... [20:50:47] (03PS4) 1020after4: Phabricator: configure notification server [puppet] - 10https://gerrit.wikimedia.org/r/379005 (https://phabricator.wikimedia.org/T765) [20:55:58] 10Operations, 10Phabricator, 10Traffic, 10Patch-For-Review: Phabricator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#3619353 (10mmodell) >>! In T112765#2509512, @BBlack wrote: > There's a little bit of refactoring work (already in-progress) to do on the Varnis... [21:00:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:01:00] jouncebot, now [21:01:01] For the next 1 hour(s) and 58 minute(s): ip_changes saga (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T2100) [21:01:11] musikanimal, woo [21:02:00] 👍 [21:02:47] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:06:28] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [1000.0] [21:06:43] !log cp4009: varnish-be restart, 503 spike [21:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:47] 10Operations, 10Release Pipeline, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): Provision Docker >= 17.05 on contint1001 - https://phabricator.wikimedia.org/T175293#3589181 (10hashar) [21:09:26] (03CR) 10Chad: "Looks good I guess. As I said I don't really know systemd but I'll trust it's been tested and let others ok it :)" [puppet] - 10https://gerrit.wikimedia.org/r/378768 (owner: 10Paladox) [21:12:24] mutante: Yeah +1 to making the error message more useful. A) We don't actually use the planned downtime version [too much overhead] and B) The slave message seems like an outage (although it should actually just work as read-only) [21:15:55] !log maxsem@tin Synchronized php-1.30.0-wmf.19/maintenance/populateIpChanges.php: https://gerrit.wikimedia.org/r/#/c/379057/1 (duration: 00m 49s) [21:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:18] (03PS13) 10Andrew Bogott: WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [21:16:56] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619453 (10Tgr) [21:18:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:18:40] !log re-ran populateIpChanges.php on group0 wikis [21:18:51] (03CR) 10Andrew Bogott: [C: 032] WMCS: install Cumin for WMCS admins [puppet] - 10https://gerrit.wikimedia.org/r/377787 (https://phabricator.wikimedia.org/T175712) (owner: 10Volans) [21:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:47] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:21:25] 10Operations, 10Traffic: cp1066 unexplained 503 spikes - https://phabricator.wikimedia.org/T175319#3619481 (10BBlack) Going to repool this today on the assumption it was genuinely part of T175803 [21:21:42] musikanimal, don't see any stray revisions on mw or testwiki [21:22:03] that's group0 right? [21:22:25] !log bblack@neodymium conftool action : set/pooled=yes; selector: cluster=cache_text,dc=eqiad,name=cp1066.* [21:22:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:39] yup [21:22:46] that version of the script didn't hard code the batch size, so I think it correctly copied all the revisions [21:23:37] !log bblack@neodymium conftool action : set/pooled=yes; selector: cluster=cache_text,name=cp4016.ulsfo.wmnet [21:23:48] !log bblack@neodymium conftool action : set/pooled=yes; selector: cluster=cache_text,name=cp4008.ulsfo.wmnet [21:23:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:04] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619506 (10Tgr) [21:24:12] I don't think enwikivoyage is that huge (considering), we might run it just on that one to see if rev_id 1507693 shows up [21:24:25] if it does give it a go for all of group1 [21:25:47] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:25:51] group0 should be fine but for group1 wikis you'll have to use --force [21:27:02] well, I had ho use it for group0 too, right? ;) [21:27:08] s/ho/to/ [21:27:24] !log on terbium: mwscript populateIpChanges.php --wiki=enwikivoyage --force [21:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:42] 18469 IP revisions copied. [21:27:50] woah [21:28:16] this sounds suspiciously fast [21:28:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:28:27] real 0m3.546 [21:28:33] `SELECT * FROM ip_changes WHERE ipc_rev_id = 1507693` => empty set :( [21:30:39] so maybe it's not the mBatchSize thing [21:30:44] are you on research replica? [21:31:16] I wonder if it could be lagged [21:31:34] yeah, it's usually in sync, but lemme try prod [21:32:26] it's not there on db1078 [21:33:17] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [21:33:53] MaxSem: and we can definitely mass-copy rows like this? https://phabricator.wikimedia.org/source/mediawiki/browse/master/maintenance/populateIpChanges.php;617d05ae4c195d5edfb14ab21286234e8e79d888$113 [21:34:14] cuz I've never done it like that before. That change happened after we ran it on group0 [21:34:18] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:35:02] we didn't get an SQL error, right? [21:35:57] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:38:05] don't think so, but it is doing IGNORE [21:40:49] ignore is for duplicate keys, now syntax errors [21:41:49] that's what I thought [21:42:09] in any case, try the following in eval.php: $db=wfGetDB(DB_MASTER); $db->query('create temporary table foo (bar text, baz text)'); $db->insert('foo', [['bar'=>'as', 'baz'=>'df'], ['bar'=>'gh', 'baz'=>'jk']]); $res = $db->query('select * from foo'); foreach ( $res as $row ) { var_dump($row); } [21:43:20] on a dev wiki, of course :) [21:44:03] so that's not it either [21:44:08] (03PS1) 10Ayounsi: Adding a list of bogus ifNames [puppet] - 10https://gerrit.wikimedia.org/r/379122 [21:44:56] musikanimal, huh [21:45:06] A database query error has occurred. Did you forget to run your application's database schema updater after upgrading? [21:45:06] Query: INSERT INTO `ip_changes` (ipc_rev_id,ipc_rev_timestamp,ipc_hex) VALUES (NULL,'20050427185943','A666886F') [21:45:06] Function: PageArchive::undeleteRevisions [21:45:06] Error: 1048 Column 'ipc_rev_id' cannot be null (10.64.16.77) [21:45:49] uh oh, is that in production? [21:45:49] that's from enwiki prod [21:46:02] oh dear [21:46:05] namely, from Special:Undelete [21:46:18] forget the script :P [21:46:18] PageArchiveTest covered that workflow [21:46:38] (03PS13) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [21:47:20] (03PS14) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [21:47:25] err, I guess PageArchiveTest didn't cover undeleteRevisions, specifically, just undelete [21:47:28] (03PS15) 10Paladox: Gerrit: Use systemd::service for systemd [puppet] - 10https://gerrit.wikimedia.org/r/378768 [21:47:48] ACKNOWLEDGEMENT - Router interfaces on pfw3-codfw is CRITICAL: CRITICAL: host 208.80.153.197, interfaces up: 68, down: 1, dormant: 0, excluded: 0, unused: 0: Ayounsi https://gerrit.wikimedia.org/r/#/c/379122/ [21:48:13] which calls undeleteRevisions [21:49:18] (03CR) 10EBernhardson: [C: 031] Switch elasticsearch active cluster to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378850 (owner: 10DCausse) [21:49:38] umm this should be ar_rev_user === 0 https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/page/PageArchive.php;617d05ae4c195d5edfb14ab21286234e8e79d888$740 [21:49:42] not ar_rev_id!!! [21:51:31] !log disabled puppet on labpuppetmaster1001, testing cumin T175712 [21:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:45] T175712: Install cumin in the WMCS infrastructure - https://phabricator.wikimedia.org/T175712 [21:55:02] (03CR) 10Dzahn: "This introduces a config change as well:" [puppet] - 10https://gerrit.wikimedia.org/r/378768 (owner: 10Paladox) [21:55:34] no_justification, just letting you know that ^^ also increases bin/gerrit.sh limit for open files. though we have the same limit on the systemd script [21:55:58] it just applys it to bin/gerrit.sh too, but should not really affect anything unless we use the init script or call the shell script directly. [21:56:01] Is that ok? [21:56:08] PROBLEM - SSH access on gerrit2001 is CRITICAL: connect to address 208.80.153.106 and port 29418: Connection refused [21:56:30] eh, that's cause i restarted the service ? [21:56:48] but it was active (running) .. so what [21:58:26] 10Operations, 10fundraising-tech-ops, 10netops, 10Patch-For-Review: Move codfw frack to new infra - https://phabricator.wikimedia.org/T171970#3619622 (10ayounsi) 05Open>03Resolved > The fix has been committed on version 15.1X49-D110 which currently is expected to be released on 13th of Sept 2017. Firew... [21:59:21] paladox: yeah let's be consistent here [21:59:27] that is no good [21:59:32] why did the ssh service die [21:59:43] just because i restarted the gerrit service.. which is still shown as active [21:59:51] (all on 2001) [21:59:57] i think we can do systemctl restart gerrit? [22:00:20] that looked as if you were talking about the year, mutante xD [22:00:21] that's what i did [22:00:38] hmm [22:00:39] well, NOW it's failed.. wth [22:00:46] look in the logs [22:00:56] /var/lib/gerrit2/review_site/logs/error_log [22:00:57] (Result: exit-code) [22:01:03] those are the best results [22:01:09] lol [22:01:56] 4 [2017-09-19 21:53:50,802] [ShutdownCallback] INFO com.google.gerrit.sshd.SshDaemon : Stopped Gerrit SSHD [22:02:04] well. it informs me that it stopped [22:02:33] does it say why? [22:02:45] !log gerrit2001 - systemctl start gerrit [22:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:28] :\ [22:07:04] did that fix your issue mutante? [22:08:17] gerrit is running, I dunno about :29418 though [22:09:13] Last thing in error_log was about catching shutdown [22:11:08] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:11:58] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:12:03] I guess give this a try [22:12:07] bin/gerrit.sh run [22:12:50] did we do scap deploy to gerrit2001? [22:15:40] !log gerrit2001 /bin/sh /var/lib/gerrit2/review_site/bin/gerrit.sh start [22:15:44] sorry, got distracted [22:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:58] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:16:37] mutante heh i meant bin/gerrit.sh run [22:16:41] different syntax :) [22:16:47] run != start ? [22:16:56] but .. it recovered [22:17:07] Starting Gerrit Code Review: FAILED [22:17:08] lolwut [22:17:30] "run" tells me "Already Running!!" :) [22:17:46] Do we still have the dumb initd script too? [22:17:54] We hit this last time [22:18:42] stopping it with the init script [22:18:44] yes [22:18:48] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619784 (10Anomie) [22:18:58] Stop with init.d -- remove init.d -- start with sysctl [22:19:01] starts it with the init.d script [22:19:10] just to see if the 29418 comes back [22:19:12] This is what blocked us last time :) [22:19:13] ok, will do that next [22:19:34] Which makes me wonder if the icinga check is too strict and only matches on one of them? /me shrugs [22:19:47] jouncebot: refresh [22:19:50] I refreshed my knowledge about deployments. [22:19:50] there was actually nothing listening on 29418 when i looked [22:19:57] jouncebot: next [22:19:59] In 0 hour(s) and 40 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T2300) [22:20:29] it fails to start with the old way [22:20:32] I wonder if the "only listen on X port" is to blame? [22:20:43] er s/port/IP/ [22:20:52] Is it trying to listen on the master's IP? [22:20:54] * no_justification guessing [22:20:56] uh [22:20:59] * paladox checks [22:21:35] looks like a no [22:21:36] https://github.com/wikimedia/puppet/blob/production/modules/gerrit/templates/gerrit.config.erb#L169 [22:21:37] *:port [22:21:45] Ok, no biggie then [22:21:49] * is fine for now [22:22:14] !log gerrit2001 - stopped gerrit with init.d script. moved init.d script to /root, started with systemctl [22:22:22] :) [22:22:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:41] Process: 31891 ExecStart=/var/lib/gerrit2/review_site/bin/gerrit.sh start (code=exited, status=1/FAILURE) [22:23:02] Hmmmm [22:23:35] is it at the same deploy status as cobalt? [22:23:59] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:24:14] ls -la /var/lib/gerrit2/review_site/bin/gerrit.sh [22:24:38] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#3619820 (10Anomie) [22:24:55] -rwxr-xr-x 1 gerrit2 gerrit2 15414 May 2 22:20 /var/lib/gerrit2/review_site/bin/gerrit.sh [22:24:58] ? [22:25:12] mutante: Eh, with --slave [22:25:37] GerritCodeReview -Xloggc:/srv/gerrit/jvmlogs/jvm_gc.%p.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=2M -Xmx20g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site --slave --run-id=1505858540.25634 [22:25:41] Was before you killed it ^ [22:26:01] oh [22:26:04] oh heh, so the unit file needs to have different content "if slave" [22:26:06] right [22:26:08] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:26:11] ah [22:26:22] if it needs a different start command, yea [22:26:24] Yes [22:26:25] mutante should i do that in my change adding the unit [22:26:28] That'd be a problem [22:26:30] or seperate change? [22:26:35] It'll bail almost immediately cuz it can't write to the DB [22:27:04] paladox: if you give me the option, i will pick more smaller changes :) [22:27:10] ok [22:27:18] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [22:27:51] should we just have 2 templates, one for master one for slave [22:28:02] or are we doing "if $slave" check in the template itself [22:28:10] or are we passing a variable [22:28:14] if checks [22:28:59] If checks probably easiest cuz nothing else would be different [22:29:22] ok :) [22:29:37] Is this /var/lib/gerrit2/review_site/bin/gerrit.sh start --slave? [22:29:42] no_justification ^^ [22:30:08] I'm not sure [22:30:41] It *should* pull it from the container.slave config [22:30:52] oh [22:30:53] wait [22:30:54] if test "`get_config --bool container.slave`" = "true" ; then [22:30:54] RUN_ARGS="$RUN_ARGS --slave" [22:30:54] fi [22:30:54] if test "`get_config --bool container.slave`" = "true" ; then [22:30:54] RUN_ARGS="$RUN_ARGS --slave" [22:30:54] fi [22:31:02] hmm [22:31:31] Although tbh we hadn't been picking up the git open files config from that script either [22:31:36] Probably doesn't even work right :p [22:31:56] note: we did not merge the new systemd change, this is all before it :) [22:32:05] Best part of this all is systemd is basically trying to wrap a silly init.d-style script :\ [22:32:43] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3619891 (10Pnorman) > @Pnorman which sources config file are you using? I'm not using any directly - this is from whatever is running. Looking as `ps`, the relev... [22:33:39] i wonder is it because it's meant to be a string [22:33:43] no_justification ^^ [22:33:46] mutante ^^ [22:33:59] since it's doing "`get_config --bool container.slave`" = "true" [22:34:05] which means it's looking for a string [22:34:10] it's trying to be all smart about that, and set slave status based on config contents [22:34:20] slave = <%= @slave %> -> slave = "<%= @slave %>" [22:34:26] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to svwiki - https://phabricator.wikimedia.org/T176082#3619897 (10Johan) (Removing tag I assume was here simply because it was in the parent task and automatically included; this shouldn't be in Tech... [22:34:28] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:35:07] Considering get_config() doesn't even take --bool as an option it should already return a string [22:35:11] It only takes --int [22:35:13] That's fun! [22:35:18] so how about we stop using this script [22:35:26] (03Draft1) 10Paladox: Gerrit: Use strings in container.slave [puppet] - 10https://gerrit.wikimedia.org/r/379134 [22:35:29] (03Draft2) 10Paladox: Gerrit: Use strings in container.slave [puppet] - 10https://gerrit.wikimedia.org/r/379134 [22:35:31] and we use ExecStart and ExecStop commands in systemd unit files [22:35:39] just like this script _would_ create them [22:35:42] but don't use it [22:35:42] I don't think that's going to do anything paladox [22:35:50] k [22:35:52] mutante: We coulddddddd [22:35:55] (03Abandoned) 10Paladox: Gerrit: Use strings in container.slave [puppet] - 10https://gerrit.wikimedia.org/r/379134 (owner: 10Paladox) [22:36:03] Basically the whole java ....... goes straight into systemd unit file [22:36:06] Skip the middleman [22:36:09] yea, that [22:36:12] Could work [22:36:15] and then it's easy to add the --slave or not [22:36:18] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:36:22] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3619901 (10Yurik) So it seems the sources & variables file specified in the /etc/tilerator/config.yaml has incorrectly specifying the username/password, most likel... [22:36:40] what would be the full command if we used the war directly? [22:37:00] Basically what we both pasted from earlier [22:37:01] take a copy of gerrit.sh, change the last line to output the full command instead of running it [22:37:05] then run it [22:37:21] mutante: Only wonky bit might be the --run-id parameter [22:37:24] Dunno how important that is though [22:37:49] that's not important i doint think [22:38:14] tbh we could drop all those jvm options for gc logging, we're not debugging that these days [22:38:16] Would simplify it [22:38:28] ok [22:39:55] 527 exec "$RUN_EXEC" $RUN_Arg1 "$RUN_Arg2" $RUN_Arg3 $RUN_ARGS --console-log [22:40:13] if you change that "exec" to "echo" [22:40:21] what would the reload command be? [22:41:28] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:42:04] paladox: "restart" no "reload" and it's just "stop" sleep5 "start" [22:42:20] oh [22:42:27] ExecReload=/var/lib/gerrit2/review_site/bin/gerrit.sh reload [22:42:38] i have start set as ExecStart=java -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site [22:42:42] Usage: gerrit.sh {start|stop|restart|check|status|run|supervise|threads} [-d site] [22:42:50] there is no reload , these are what you have ^ [22:42:56] ah [22:42:58] lol [22:43:07] restart then [22:43:34] what about stopping it? [22:45:14] it uses start-stop-daemon -K -p -s KILL .. .ehm [22:45:30] but if it cant find that ..it resorts to just [22:45:33] kill $PID [22:45:35] slave works for me [22:45:36] with the systemd scritp [22:45:37] script [22:45:38] and the icinga check has not failed either. [22:45:54] paladox: did you use --slave in command ? [22:46:02] i used slave = true [22:46:04] in gerrit.config [22:46:15] bin/gerrit.sh status show [22:46:17] RUN_ARGS = -Xloggc:/srv/gerrit/jvmlogs/jvm_gc.%p.log -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCCause -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=2M -Xmx4g -jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site --slave [22:46:29] so that magic of "check in config if i should start as slave" would be gone [22:46:34] if we stop using this script [22:46:42] but we (puppet) also knows if slave or not [22:47:04] that is your entire command line there that you wanted [22:47:05] mutante, try [22:47:08] systemctl stop gerrit [22:47:11] But, systemctl should be able to just use the script. We can't be the only software that does this. [22:47:12] systemctl start gerrit [22:47:22] Ahhhhh, yes. statuses stuck! [22:47:26] force stop, then start? [22:47:29] yeh [22:47:39] since i did not add our restart command [22:47:44] i thought reload was it [22:47:46] I hateeeeeee systemd [22:47:56] but mutante pointed out that gerrit dosen't have a reload command [22:48:00] do we have nodepool issues? jenkins is still taking too much to build things [22:48:20] tabbycat: Nope, just a fair number of jobs [22:48:23] i used "stop", checked there was no more process, used "start" [22:48:26] it is failed again [22:48:26] Oldest one only 25 minutes [22:48:38] mutante what does it say in the logs? [22:48:42] * paladox wonders [22:48:56] it's been working for me [22:49:10] Sep 19 22:47:56 gerrit2001 gerrit2[7258]: Starting Gerrit Code Review: start-stop-daemon: unable to open pidfile '/var/lib/gerrit2/review_site/logs/ge...n denied) [22:49:18] unable to open pidfile [22:49:31] wrote pid file as root or something [22:49:33] There we go [22:49:54] aha [22:49:58] how does that path continue? [22:50:02] ./logs/ge... [22:50:08] ls -la /var/lib/gerrit2/review_site/logs/ [22:50:20] yes, and then? [22:50:25] /ge...n [22:50:26] i wonder is this [22:50:26] User=gerrit2 [22:50:27] Group=gerrit2 [22:50:28] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:50:31] in the systemd script [22:50:36] just "gerrit"? [22:50:41] gerrit2 [22:50:44] I think it's gerrit_pid [22:50:45] Checking [22:51:14] pid should be [22:51:14] gerrit.pid and gerrit.run [22:51:14] -rw-r--r-- 1 gerrit2 gerrit2 17 Sep 19 22:44 gerrit.run [22:51:16] Two things we can [22:51:18] *want [22:51:34] -rw-r--r-- 1 gerrit2 gerrit2 6 Sep 19 22:44 gerrit.pid [22:51:59] user gerrit2 _can_ write to that dir [22:52:05] it should [22:52:08] has to for the logs [22:52:09] i just did [22:52:09] Yeah, I wonder if one of your manual start/stops happened as root [22:52:15] ah [22:52:23] that may be the init script [22:52:28] the old script [22:52:32] there is no existing pid file to delete though [22:52:39] and it can write to that place.. [22:52:48] .... [22:52:49] Hmm [22:52:59] what permissions are set on ls -la /var/lib/gerrit2/review_site/logs ? [22:53:14] systemd[1]: gerrit.service: main process exited, code=exited, status=2/INVALIDARGUMENT [22:53:19] INVALIDARGUMENT.. hmm [22:53:36] was that reload? [22:53:37] paladox: It's 0644, as it should be [22:53:45] i meant user :) [22:53:51] i just became gerrit2 and did "touch foo" there [22:53:56] Permissions are fine, it's gerrit2:gerrit2 [22:54:00] ok [22:54:39] i wonder is the systemd script the same on gerrit2001 as in the repo? [22:55:04] https://github.com/wikimedia/operations-debs-gerrit/blob/master/debian/gerrit.service [22:55:31] installs "locate" so we can ... locate stuff [22:56:58] it should be /lib/systemd/system/gerrit.service [22:56:58] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [22:57:38] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [22:57:42] paladox: yea, that's what is there [22:57:54] hmm [22:58:07] well, now it's running [22:58:08] systemd seemed to recover per icinga [22:58:14] So, what's wrong then? [22:58:31] paladox: by the way, once it is running, in "systemctl status gerrit" you can also see the command line [22:58:44] yeh [22:58:48] init.d script gone, running via systemd [22:58:52] What's broken then? [22:59:16] 10Operations, 10Maps-Sprint, 10Maps (Kartotherian): Upgrade kartotherian and tilerator to nodejs 6.11 - https://phabricator.wikimedia.org/T171707#3619951 (10Pnorman) > So it seems the sources & variables file specified in the /etc/tilerator/config.yaml has incorrectly specifying the username/password, most l... [22:59:44] not sure, that it failed earlier [22:59:50] maybe it's all just not waiting long enough [22:59:56] (03Draft1) 10Paladox: Gerrit: Update systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 [22:59:59] (03PS2) 10Paladox: Gerrit: Fix systemd script [puppet] - 10https://gerrit.wikimedia.org/r/379136 [22:59:59] Probably because the mismatch of init.d / systemd [23:00:02] and thinking it should show the right status when you have to wait longer [23:00:03] Plus icinga just lagging [23:00:05] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Dear deployers, time to do the Evening SWAT (Max 8 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170919T2300). [23:00:05] Krinkle, RoanKattouw, and ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break the wikis, you will be rewarded with a sticker. [23:00:12] I'm here [23:00:32] repeats the stop/start again [23:01:14] (03CR) 10Krinkle: "I guess it should also be in one of the project dblists (in this case wikimedia.dblist, or special.dblist, probably wikimedia.dblist)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [23:01:17] (03CR) 10Krinkle: [C: 04-1] Add amwikimedia to s3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [23:01:34] the open question is .. why did the gerrit-ssh service die and never come back [23:01:45] during all this we have not seen that recoved.. and no 29418 in netstat [23:02:01] hmm [23:02:07] check logs for failed to start [23:02:10] mutante ^^ [23:02:25] /var/lib/gerrit2/review_site/logs/error_log [23:02:25] Also, I have no idea where to look for logs anymore [23:02:38] /var/lib/gerrit2/review_site/logs/sshd_log [23:02:46] \o [23:02:48] That's not it [23:02:59] sshd_log = 0 bytes [23:03:02] That's a log of ssh /connections/ [23:03:08] error_log = yea, it stopped [23:03:10] /var/lib/gerrit2/review_site/logs/error_log [23:03:12] Which we don't actually log (or would there be any on the slave right now) [23:03:18] PROBLEM - Check Varnish expiry mailbox lag on cp1066 is CRITICAL: CRITICAL: expiry mailbox lag is 2124648 [23:03:20] error_log is useless with systemd [23:03:27] So...back to my question....where do I look now? [23:03:30] (03CR) 10Krinkle: [C: 04-1] Add config for amwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [23:03:36] that one should tell u [23:03:37] us [23:03:38] /var/log/syslog? [23:03:43] systemd always logs to ^^ [23:03:51] I can SWAT [23:04:05] I'm here. [23:04:08] (for SWAT) [23:04:13] paladox: yea, we have "systemd[1]: Unit gerrit.service entered failed state." there from earlier issue [23:04:18] Nope, not all of it. Just the start/stop stuff [23:04:24] I'm looking for the running java log [23:04:25] oh hmm [23:04:33] There's probably an option to the *.war we need [23:04:41] to tell it to log to stdout or w/e for systemd to grab it? [23:05:01] * paladox looks on gogole [23:05:02] google [23:05:03] (03PS4) 10Thcipriani: Enable jQuery 3 on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378802 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:05:22] (03CR) 10Krinkle: "We still want to remove the old config, right? And presumably still need some version of the new config?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/369861 (https://phabricator.wikimedia.org/T172356) (owner: 10Hashar) [23:05:54] Probably want gerrit.war to dump to stdout, then configure the systemd unit to pick it up [23:05:56] i wish gerrit-ssh was its own service [23:06:05] but it's just started by gerrit.war, right [23:06:10] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378802 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:06:11] yep [23:06:15] Well, what does gerrit.war do if you don't enable ssh? [23:06:21] gets confused by phab-ssh [23:06:43] I don't like how we set phab-ssh up :\ [23:06:44] But anyway [23:06:47] Another topic [23:06:51] heh :) [23:06:59] do we have a ssh key in etc? [23:07:04] i wonder if it could be that [23:07:11] What on earth would that have to do with it? [23:07:29] And yes, we have the same ssh key [23:07:34] ie [23:07:35] -rw------- 1 gerrit2 gerrit2 452 Sep 11 21:18 ssh_host_ed25519_key [23:07:36] tgr: There's four things in https://gerrit.wikimedia.org/r/#/q/status:open+(+branch:wmf/1.30.0-wmf.19+OR+branch:wmf/1.30.0-wmf.18+) from you for Collection – do they still need back-porting? [23:07:36] ok [23:07:47] Wait. Different IPs. Duh. That might be an issue. [23:07:56] * no_justification shrugs [23:07:58] Oh well [23:08:08] it tries to bind to a hardcoded IP? [23:08:18] No, it tries to bind to *:29418 right now [23:08:25] Wait, ignore me on IP [23:08:27] Red herring [23:08:31] * no_justification is a little tired [23:09:19] James_F: I was waiting for https://gerrit.wikimedia.org/r/#/c/377929/ getting reviewed, they should go together [23:09:31] so on cobalt, if you do "netstat -tulpen | grep Gerrit" you get 2 lines [23:09:37] 29418, and 8080 [23:09:39] tgr: OK, just checking. :-) [23:09:40] I guess if it didn't happen today, there is not much point in the backports anymore [23:09:44] here we have neither of them [23:09:46] (03Merged) 10jenkins-bot: Enable jQuery 3 on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378802 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:09:55] (03CR) 10jenkins-bot: Enable jQuery 3 on meta.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378802 (https://phabricator.wikimedia.org/T124742) (owner: 10Krinkle) [23:09:58] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [23:10:00] the 8080 is also not there [23:10:08] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:10:09] and .. i didn't do anything to stop that again ^ [23:10:22] thcipriani: Oh, I added a couple of minor things late, hope that's OK. [23:10:37] thcipriani, is it OK if I added a couple more patches to SWAT? [23:11:11] James_F: should be ok as long as jenkins is on our side [23:11:18] I vote not moving gerrit to systemd yet [23:11:22] So… it'll break. ;-) [23:11:24] It's clearly not going to work via playing whack a mole [23:11:30] (03CR) 10MarcoAurelio: "Yes, it needs to be added to wikimedia.dblist. Special.dblist is for, well, special wikis not being chapter wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378401 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [23:11:39] Clearly more testing needs doing [23:12:04] MaxSem: we're getting close to maxing out, but as long as "a couple" 2 in this instance should be ok :) [23:12:10] :O [23:12:15] thanks [23:12:28] hmm, no_justification but it works with prod. So some how it is failing to start on a slave. [23:12:33] I found some docs https://gerrit-review.googlesource.com/Documentation/pgm-daemon.html [23:12:51] That's nice. But it doesn't work everywhere [23:12:59] And this wasn't exactly how I planned to spend my afternoon [23:13:13] thing is.. we didn't merge anything [23:13:14] Krinkle: metawiki jquery3 is on mwdebug1002, check please [23:13:33] yea, let's just keep looking later / labs [23:13:46] definitely not adding a second change on it [23:13:53] before we have it working on gerrit2001 [23:14:20] Well if we didn't merge anything then clearly that's why cobalt is fine [23:14:39] all that happened is gerrit2001 broke [23:15:07] Let's acknowledge it down and I'll dig more thoroughly tomorrow [23:15:37] thcipriani: Looks good. [23:15:43] (03CR) 10MarcoAurelio: Add config for amwikimedia (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/378400 (https://phabricator.wikimedia.org/T176042) (owner: 10Ladsgroup) [23:15:45] yea, continue tomorrow sounds good, no_justification .. done [23:15:50] Krinkle: ok, going live [23:17:36] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:378802|Enable jQuery 3 on meta.wikimedia.org]] T124742 (duration: 00m 50s) [23:17:40] ^ Krinkle is live [23:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:17:52] T124742: Upgrade to jQuery 3 - https://phabricator.wikimedia.org/T124742 [23:18:33] ACKNOWLEDGEMENT - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn debugging in progress [23:18:34] ACKNOWLEDGEMENT - SSH access on gerrit2001 is CRITICAL: connect to address 208.80.153.106 and port 29418: Connection refused daniel_zahn debugging in progress [23:18:34] ACKNOWLEDGEMENT - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn debugging in progress [23:19:11] thcipriani: confirmed, still looks good. [23:20:52] MaxSem: oh, didn't realize your patches were already merged. I pulled them down, they're live on wmf.18 and wmf.19 on mwdebug1002, check that everything looks right please [23:21:40] yeah, I tried to deploy them in my window but a meeting prevented me from doing that [23:21:48] ah, gotcha [23:24:52] thcipriani, tested [23:25:10] MaxSem: ok, going live, wmf.19 first [23:27:28] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [23:27:55] lol, it breaks AND fixes itself [23:27:55] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/includes/page/PageArchive.php: SWAT: [[gerrit:379132|Do not attempt to copy to ip_changes in PageArchive class]] (duration: 00m 50s) [23:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:28:09] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [23:29:16] except the ssh daemon doesnt come back.. oh well.. not actively used and we will keep debugging that later [23:29:59] !log thcipriani@tin Synchronized php-1.30.0-wmf.18/includes/page/PageArchive.php: SWAT: [[gerrit:379133|Do not attempt to copy to ip_changes in PageArchive class]] (duration: 00m 49s) [23:30:04] ^ MaxSem live everywhere [23:30:10] whee [23:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:14] thanks thcipriani [23:30:19] yw :) [23:30:59] (03CR) 10Dzahn: "that's right, gerrit.sh doesn't have a reload action, just restart" [puppet] - 10https://gerrit.wikimedia.org/r/379136 (owner: 10Paladox) [23:32:51] RoanKattouw: SpecialRecentchangeslinked fixes should be live on mwdebug1002 for wmf.18 and wmf.19, check please [23:33:00] Thakns, checking [23:33:55] thcipriani: Working [23:34:01] ok, going live [23:34:03] wmf.19 first [23:34:08] Confirmed that it throws a DB error in prod too [23:36:19] PROBLEM - gerrit process on gerrit2001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [23:36:38] PROBLEM - Check systemd state on gerrit2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:36:39] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/includes/specials/SpecialRecentchangeslinked.php: SWAT: [[gerrit:378973|SpecialRecentchangeslinked: Unconditionally join on the page table]] T176228 (duration: 00m 49s) [23:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:51] T176228: Special:RelatedChanges with edit filters has query error due to missing join on page table - https://phabricator.wikimedia.org/T176228 [23:39:25] !log thcipriani@tin Synchronized php-1.30.0-wmf.18/includes/specials/SpecialRecentchangeslinked.php: SWAT: [[gerrit:378973|SpecialRecentchangeslinked: Unconditionally join on the page table]] T176228 (duration: 00m 49s) [23:39:33] ^ RoanKattouw live now [23:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:41:58] ebernhardson: wikimediaevents update is live for wmf.18 and wmf.19 on mwdebug1002, check please [23:42:03] thcipriani: looking [23:42:49] thcipriani: looks great [23:42:56] ok, going live [23:45:17] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: SWAT: [[gerrit:379119|Javascript timestamps are in ms, not s]] T174106 (duration: 00m 50s) [23:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:45:30] T174106: Search Relevance Survey test #3: action items - https://phabricator.wikimedia.org/T174106 [23:46:42] !log thcipriani@tin Synchronized php-1.30.0-wmf.18/extensions/WikimediaEvents/modules/ext.wikimediaEvents.humanSearchRelevance.js: SWAT: [[gerrit:379118|Javascript timestamps are in ms, not s]] T174106 (duration: 00m 48s) [23:46:51] ^ ebernhardson live everywhere [23:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:49:10] thcipriani: still looks good, thanks! [23:49:35] awesome, yw :) [23:49:45] mutante then it breaks again :) [23:50:09] James_F: visualeditor changes are live on mwdebug1002 for wmf.18 and wmf.19, check please [23:50:46] Checking. [23:53:04] thcipriani: Yeah, LGTM. [23:53:35] alright, going live, wmf.19 first [23:54:25] paladox: yea, but that should give us hints.. it is slowly flapping [23:54:56] yeh [23:55:59] !log thcipriani@tin Synchronized php-1.30.0-wmf.19/extensions/VisualEditor/ApiVisualEditor.php: SWAT: [[gerrit:378991|ApiVisualEditor: Fix checkbox label message handling with Message objects]] T176249 (duration: 00m 48s) [23:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:14] T176249: FlaggedRevs' checkbox label displayed weirdly in VisualEditor save dialog - https://phabricator.wikimedia.org/T176249 [23:57:13] !log thcipriani@tin Synchronized php-1.30.0-wmf.18/extensions/VisualEditor/ApiVisualEditor.php: SWAT: [[gerrit:378990|ApiVisualEditor: Fix checkbox label message handling with Message objects]] T176249 (duration: 00m 49s) [23:57:22] ^ James_F live now [23:57:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:34] thcipriani: Thanks! [23:57:45] yw :) [23:57:48] RECOVERY - gerrit process on gerrit2001 is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [23:57:58] RECOVERY - Check systemd state on gerrit2001 is OK: OK - running: The system is fully operational [23:58:31] swat done with 3 minutes to spare. when the jenkins queue is empty, miracles are possible. [23:58:51] or maybe when the jenkins queue is empty it's a miracle. [23:58:59] * James_F grins.