[00:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T0000). [00:00:05] ebernhardson, Jdlrobson, and kaldari: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:09] \o [00:00:17] \o here [00:01:53] I can SWAT [00:04:28] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) (owner: 10Tjones) [00:07:27] (03Merged) 10jenkins-bot: Deploy TextCat Improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) (owner: 10Tjones) [00:07:35] (03CR) 10jenkins-bot: Deploy TextCat Improvements [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334729 (https://phabricator.wikimedia.org/T149324) (owner: 10Tjones) [00:09:40] ebernhardson: change is live on mwdebug1002 if there's anything to check there. [00:09:56] thcipriani: yup, sec [00:10:54] thcipriani: looks to be working [00:12:09] ebernhardson: ok, does CirrusSearch-common need to go out before IS? Or can they go out in a random order? [00:13:31] thcipriani: initialize settings should go first, -common refers to values that are new in IS [00:13:46] hmm, actually no it doesn't, any order should be fine [00:14:16] ebernhardson: ok, going to sync-dir wmf-config [00:16:27] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:334729|Deploy TextCat Improvements]] T149324 T142140 (duration: 00m 45s) [00:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:32] T149324: TextCat Improvement Deployment - https://phabricator.wikimedia.org/T149324 [00:16:32] T142140: Lang ID Eval Sets for Polish, Arabic, Chinese, and Dutch - https://phabricator.wikimedia.org/T142140 [00:16:33] ^ ebernhardson should be live [00:17:21] thcipriani: thanks! everything looks happy [00:17:40] cool, thanks for checking on things :) [00:18:41] (03PS3) 10Thcipriani: Update footer logos on mobile site for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336466 (https://phabricator.wikimedia.org/T157476) (owner: 10Jdlrobson) [00:19:54] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336466 (https://phabricator.wikimedia.org/T157476) (owner: 10Jdlrobson) [00:22:02] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3007850 (10EBernhardson) I created some indices locally with `$wgCirrusSearchSimilarityProfile = 'wmf_defaults'` so they would have the... [00:23:03] (03Merged) 10jenkins-bot: Update footer logos on mobile site for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336466 (https://phabricator.wikimedia.org/T157476) (owner: 10Jdlrobson) [00:23:20] (03CR) 10jenkins-bot: Update footer logos on mobile site for various projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336466 (https://phabricator.wikimedia.org/T157476) (owner: 10Jdlrobson) [00:23:49] RECOVERY - puppet last run on mw1262 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [00:23:59] jdlrobson: your patch is live on mwdebug1002, check please [00:24:27] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#2223067 (10jcrespo) [00:24:30] 06Operations, 10DBA, 13Patch-For-Review: Restart pending mysql hosts with old TLS cert - https://phabricator.wikimedia.org/T152188#3007856 (10jcrespo) 05Open>03Resolved All hosts with the old expiring cert have been reimagened or (if scheduled for decomission), restarted: ``` sudo salt --output=txt -C 'G... [00:24:44] thcipriani: looks like i need a small followup [00:25:24] okie doke [00:25:28] thcipriani: Are you still planning on SWATing https://gerrit.wikimedia.org/r/#/c/335732/ ? [00:25:47] kaldari: yup, just the last one on the list :( [00:25:53] cool, no rush :) [00:27:57] (03PS1) 10Jdlrobson: Add missing images and copyright should default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336556 [00:27:58] thcipriani: ^ [00:28:32] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3007873 (10jcrespo) After resolving T152188, pending hosts: ``` $ sudo salt --output=txt -C 'G@cluster:mysql' cmd.run 'mysql -BN --skip-ssl -e "SELECT @@ssl_ca"' | grep NULL db1020.eqiad.... [00:28:35] (03PS2) 10Thcipriani: Add missing images and copyright should default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336556 (owner: 10Jdlrobson) [00:28:52] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336556 (owner: 10Jdlrobson) [00:29:39] cool, we'll try that one :) [00:29:58] thcipriani: also our designer put an accent over a P instead of an e ;-) [00:30:10] so we may need a 3rd patch haha [00:30:21] jdlrobson nice one mate :D [00:30:32] aww :( [00:30:56] check out the footer https://fr.m.wikipedia.org/wiki/Wikip%C3%A9dia:Accueil_principal if you want to see for yourself ;-) [00:31:56] I dont speak french [00:31:59] looks ok to me [00:32:16] (03Merged) 10jenkins-bot: Add missing images and copyright should default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336556 (owner: 10Jdlrobson) [00:32:18] maybe it's because my lang is set to English :) [00:32:28] (03CR) 10jenkins-bot: Add missing images and copyright should default to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336556 (owner: 10Jdlrobson) [00:33:13] I definitely see it [00:33:13] (03PS1) 10Jdlrobson: Follow up - correct accent in French Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336557 [00:33:15] hmm, still looks OK [00:33:25] (03CR) 10jerkins-bot: [V: 04-1] Follow up - correct accent in French Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336557 (owner: 10Jdlrobson) [00:33:33] it's only on mwdebug1002 for right now [00:33:56] (03PS2) 10Jdlrobson: Follow up - correct accent in French Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336557 [00:34:06] ^ thcipriani that should be the last one [00:34:20] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3007877 (10EBernhardson) Confirmed the settings name changes will be problematic, elasticsearch 5 will scream bloody murder about them... [00:34:24] (touch wood) [00:34:36] i think its knock jdlrobson [00:34:57] jdlrobson: other patch is on mwdebug1002 [00:35:02] thanks thcipriani looking [00:35:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336557 (owner: 10Jdlrobson) [00:35:49] (03PS5) 10Dzahn: phabricator: Block IPs for recent attempts to upload offtopic files [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [00:37:22] (03Merged) 10jenkins-bot: Follow up - correct accent in French Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336557 (owner: 10Jdlrobson) [00:37:25] thcipriani: i spoke too soon.. there's something wrong with the svg for Hindi - https://hi.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-hi.svg - I'm just going to remove that from the config for the time being :/ [00:38:07] (03CR) 10jenkins-bot: Follow up - correct accent in French Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336557 (owner: 10Jdlrobson) [00:38:11] jdlrobson i can see thcipriani just wanting to tear your head off right about now :P [00:39:01] #blamenirzar [00:39:08] (03PS1) 10Jdlrobson: Remove problematic SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336558 [00:39:22] why the sudden update of svgs? [00:39:23] (03CR) 10jerkins-bot: [V: 04-1] Remove problematic SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336558 (owner: 10Jdlrobson) [00:39:28] jdlrobson: cool, I see the acute over the E for frwiki https://fr.m.wikipedia.org/static/images/mobile/copyright/wikipedia-wordmark-fr.svg [00:40:12] jdlrobson: could you review all the svgs before this next patch to ensure this is the last patch, please :) [00:40:36] i swear that's the last one of them. I swear! [00:40:43] yes i'm beating Nirzar with a stick as we speak [00:40:49] :) [00:40:53] i can send you a gif if that helps [00:41:24] yes. I will require said gif. [00:41:30] :D [00:42:15] jdlrobson: last patch may need a rebase. jenkins is mad about it :( [00:42:59] thcipriani jerkins* [00:43:04] think i can cherrypick [00:43:15] (03PS2) 10Thcipriani: Remove problematic SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336558 (owner: 10Jdlrobson) [00:43:52] ah you beat me to it :) [00:44:29] quick on a rebase [00:45:36] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336558 (owner: 10Jdlrobson) [00:45:44] (03CR) 10Dzahn: [C: 032] phabricator: Block IPs for recent attempts to upload offtopic files [puppet] - 10https://gerrit.wikimedia.org/r/334683 (owner: 10Aklapper) [00:45:46] (03PS1) 10Dzahn: icinga: add SSL cert monitoring for benefactorevents [puppet] - 10https://gerrit.wikimedia.org/r/336559 (https://phabricator.wikimedia.org/T156850) [00:45:50] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3007891 (10EBernhardson) Another thing i've just noticed while testing (we probably would have noticed this on relforge anyways), but t... [00:46:59] PROBLEM - puppet last run on mw1294 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:47:19] (03Merged) 10jenkins-bot: Remove problematic SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336558 (owner: 10Jdlrobson) [00:47:27] (03CR) 10jenkins-bot: Remove problematic SVG [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336558 (owner: 10Jdlrobson) [00:47:51] 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#3007896 (10Dzahn) [00:47:59] jdlrobson: ok! everything live on mwdebug1002 [00:48:05] thcipriani: all working [00:48:08] go ahead and push [00:48:13] perfect, going live [00:48:13] and thanks again for your patience [00:48:36] jouncebot next [00:48:36] In 13 hour(s) and 11 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1400) [00:49:22] twentyafterfour: i merged https://gerrit.wikimedia.org/r/#/c/334683/ which changes phabbanlist.conf but puppet run does not deploy change ? [00:49:47] is that part of regular phab deploy then? [00:50:12] !log thcipriani@tin Synchronized static/images/mobile/copyright: SWAT: [[gerrit:336466|Update footer logos on mobile site for various projects]] PART I T157476 (duration: 00m 41s) [00:50:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:17] T157476: Prepare logos for new branding deployment - https://phabricator.wikimedia.org/T157476 [00:51:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:336466|Update footer logos on mobile site for various projects]] PART II T157476 (duration: 00m 41s) [00:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:36] ^ jdlrobson all live [00:51:50] yup! Thank you :D [00:51:58] jdlrobson: np, yw :) [00:52:06] kaldari: sorry for the delay, merging yours now [00:52:12] yay [00:52:16] (03PS3) 10Thcipriani: Setting $wgPageAssessmentsSubprojects to true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335732 (owner: 10Kaldari) [00:52:16] twentyafterfour: please ignore me, PEBCAK :) [00:52:24] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335732 (owner: 10Kaldari) [00:53:04] thcipriani: did you purge static cache? [00:53:29] jdlrobson: no I didn't. I figured if they were new images I wouldn't have to? [00:53:31] oh ignore they're all new images [00:53:33] yeh you are right [00:53:37] cool :) [00:53:44] (03Merged) 10jenkins-bot: Setting $wgPageAssessmentsSubprojects to true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335732 (owner: 10Kaldari) [00:53:48] it's cached HTML that's the problem [00:53:51] but that's expected :) [00:53:53] (03CR) 10jenkins-bot: Setting $wgPageAssessmentsSubprojects to true on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335732 (owner: 10Kaldari) [00:55:46] !log iridum - apache graceful'ed [00:55:47] thcipriani my designer has been beaten https://usercontent.irccloud-cdn.com/file/2JGxk51w/hitnirzar.gif [00:55:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:55:56] !log mw1189 service hhvm restart [00:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:07] jdlrobson: hahaha [00:56:42] (03CR) 10Dzahn: [C: 032] icinga: add SSL cert monitoring for benefactorevents [puppet] - 10https://gerrit.wikimedia.org/r/336559 (https://phabricator.wikimedia.org/T156850) (owner: 10Dzahn) [00:56:49] (03PS2) 10Dzahn: icinga: add SSL cert monitoring for benefactorevents [puppet] - 10https://gerrit.wikimedia.org/r/336559 (https://phabricator.wikimedia.org/T156850) [00:56:57] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings-labs.php: SWAT [[gerrit:335732|Setting $wgPageAssessmentsSubprojects to true on beta cluster]] (housekeeping sync) (duration: 00m 40s) [00:57:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:48] ^ kaldari I've merged and sync'd your file for house keeping, but it should go out to beta cluster shortly [00:57:59] RECOVERY - Nginx local proxy to apache on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.028 second response time [00:57:59] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 73309 bytes in 0.151 second response time [00:58:08] thcipriani: OK, will check it shortly [00:58:09] RECOVERY - Apache HTTP on mw1189 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.025 second response time [01:01:59] RECOVERY - puppet last run on mw1294 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [01:02:06] !log mw1294 - run puppet because it popped up in Icinga as failed - removes a bunch of /var/tmp/core/../rsvg-convert.*, all else normal [01:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:10:59] PROBLEM - Check correctness of the icinga configuration on einsteinium is CRITICAL: Icinga configuration contains errors [01:12:54] icinga-wm is me and i'm already fixing [01:13:55] (03PS1) 10Dzahn: icinga: fr-tech-ops contact group for benefactorevents [puppet] - 10https://gerrit.wikimedia.org/r/336564 (https://phabricator.wikimedia.org/T156850) [01:17:22] (03PS2) 10Dzahn: icinga: fr-tech-ops contact group for benefactorevents [puppet] - 10https://gerrit.wikimedia.org/r/336564 (https://phabricator.wikimedia.org/T156850) [01:18:33] (03CR) 10Dzahn: [C: 032] icinga: fr-tech-ops contact group for benefactorevents [puppet] - 10https://gerrit.wikimedia.org/r/336564 (https://phabricator.wikimedia.org/T156850) (owner: 10Dzahn) [01:28:27] 06Operations, 10fundraising-tech-ops, 13Patch-For-Review: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#2987822 (10Dzahn) check added to Icinga, and it became CRIT right away because the cert expires in 22 days https://icinga.wikimedia.org/cgi-bin/icinga/ex... [01:31:00] RECOVERY - Check correctness of the icinga configuration on einsteinium is OK: Icinga configuration is correct [01:32:05] 06Operations, 10fundraising-tech-ops, 10procurement: benefactorevents.wikimedia.org SSL cert expires 2017-03-02 - https://phabricator.wikimedia.org/T157520#3007991 (10Dzahn) [01:33:43] 06Operations, 10fundraising-tech-ops: set up SSL cert monitoring for benefactorevents.wm.o - https://phabricator.wikimedia.org/T156850#2987822 (10Dzahn) [01:34:23] (03PS7) 10Dzahn: etcd: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [01:39:56] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5372/" [puppet] - 10https://gerrit.wikimedia.org/r/334282 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [01:45:00] (03CR) 10Dzahn: "i think all you need in addition to this is a file{} that installs check_nodepool_states on the monitored host. And it should be in /usr/l" [puppet] - 10https://gerrit.wikimedia.org/r/336404 (owner: 10Rush) [01:46:38] (03PS6) 10Dzahn: Phabricator: Allow us to disable opcache validate in php.ini [puppet] - 10https://gerrit.wikimedia.org/r/336329 (owner: 10Paladox) [01:48:00] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 1803.847811 Seconds [01:48:00] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 1806.599931 Seconds [01:49:00] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 13.851071 Seconds [01:49:01] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 16.504247 Seconds [01:50:02] (03CR) 10Dzahn: [C: 032] Phabricator: Allow us to disable opcache validate in php.ini [puppet] - 10https://gerrit.wikimedia.org/r/336329 (owner: 10Paladox) [01:53:41] (03PS6) 10Dzahn: eventlogging/eventstreams: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [01:56:50] PROBLEM - puppet last run on mw1260 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:02:12] (03CR) 10Ottomata: [C: 031] eventlogging/eventstreams: Linting fixes [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [02:02:47] (03CR) 10Dzahn: [C: 032] "thanks ottomata, also compiled it here http://puppet-compiler.wmflabs.org/5373/" [puppet] - 10https://gerrit.wikimedia.org/r/334283 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [02:04:49] (03PS4) 10Dzahn: trebuchet: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328457 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [02:05:27] bd808: ^ this is correct, right [02:05:33] it does look good to me [02:05:41] missing the project name in there [02:06:25] mutante: seems right. Would be nicer to just rm the trebuchet module ;) [02:06:40] world isn't quite ready for that yet though [02:06:49] heh, ok :) [02:06:54] (03CR) 10Dzahn: [C: 032] trebuchet: Fully qualify hostname [puppet] - 10https://gerrit.wikimedia.org/r/328457 (https://phabricator.wikimedia.org/T153608) (owner: 10Tim Landscheidt) [02:07:28] bd808: thanks for the tutorial btw [02:07:47] I have to figure out how to import the app into the main script though [02:07:53] JustBerry: yw. did it get you to a working state? [02:08:04] (03CR) 10Dzahn: "@joe it's been amended, better now?" [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [02:08:14] bd808: well, the instructions are certainly clearer, just the issue I mentioned above [02:08:20] (importing the Flask app into the main script) [02:08:30] i.e. NOT app.py [02:08:35] JustBerry: lets go over to -labs and see if I can help you figure it out [02:09:22] (03PS9) 10Dzahn: hiera override to skip base icinga for test/decom hosts [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) [02:17:02] (03CR) 10Dzahn: [C: 032] openstack: switch installserver to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336356 (owner: 10Dzahn) [02:22:20] PROBLEM - puppet last run on analytics1036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:25:50] RECOVERY - puppet last run on mw1260 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [02:29:06] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.10) (duration: 07m 35s) [02:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:51] (03CR) 10Dzahn: [C: 032] wmflib/tests: replace install1001 with install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336355 (owner: 10Dzahn) [02:32:50] (03PS2) 10Dzahn: wmflib/tests: replace install1001 with install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336355 [02:34:04] (03CR) 10Dzahn: [C: 031] "compiles and diff only on cp1008 http://puppet-compiler.wmflabs.org/5375/ @bblack is this in your interest as well to get the test ho" [puppet] - 10https://gerrit.wikimedia.org/r/327388 (https://phabricator.wikimedia.org/T151632) (owner: 10Dzahn) [02:49:20] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:51:20] RECOVERY - puppet last run on analytics1036 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [03:02:22] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.11) (duration: 15m 32s) [03:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:08:05] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Feb 8 03:08:05 UTC 2017 (duration 5m 43s) [03:08:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:20] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [03:23:30] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 659.23 seconds [03:25:30] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 265.29 seconds [03:33:10] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:40:40] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:58:57] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3008159 (10Tgr) I think the third-party MediaWiki concerns are somewhat understated and are not actually third party as we use the same arrangement via... [04:01:10] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [04:04:12] mutante: I don't know about phabbanlist, probably needs apache restart? [04:04:25] oh nvm I saw the second message [04:06:14] (03PS1) 10Yuvipanda: Use same 'web' image for interactive shell too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336570 [04:06:24] bd808: ^ wanna +1? [04:06:34] bah, idk why it said 'brb' now [04:07:54] (03CR) 10BryanDavis: [C: 032] Use same 'web' image for interactive shell too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336570 (owner: 10Yuvipanda) [04:08:19] yuvipanda: hope that actually works :) [04:08:33] bd808: me too! [04:08:34] I always hope it actually works :D [04:08:43] bd808: want me to build and deploy or do you wanna? I haven't done it in a long time [04:08:53] (03Merged) 10jenkins-bot: Use same 'web' image for interactive shell too [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336570 (owner: 10Yuvipanda) [04:09:40] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [04:10:04] yuvipanda: hmmm... I should learn, but maybe tomorrow morning? Not sure if I want to break everything this late in my day :) [04:10:34] bd808: I'll be gone for 3 days starting tomorrow so I'll just roll it out now ;) [04:11:38] k. where are the docs for doing the docker builds? [04:16:44] bd808: oh you don't need a docker build [04:16:45] bd808: you just need a deb build + deploy [04:16:49] and there are docs for that somewhere [04:17:07] bd808: since we aren't building new images but just changing which image webservice shell uses [04:17:17] ah. ok [04:20:00] so its basically the same process as a new build of jsub. [04:20:22] bd808: yup. I finished it and put it on aptly now [04:20:49] bd808: should be upgraded on tools-dev now, let me do on -login [04:22:26] bd808: upgraded on -login too [04:22:33] bd808: lmk if it doesn' twork :) [04:22:39] "IOError: [Errno 2] No such file or directory: '/proc/0/cmdline'" [04:23:16] that happened when running `webservice status` inside a k8s php shell [04:23:43] bah [04:26:09] yuvipanda: lol. I see. its the eventlogging hook [04:26:17] (03PS1) 10Yuvipanda: Fix running webservice inside a k8s container [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336571 [04:26:21] bd808: ^ patch [04:27:14] (03CR) 10BryanDavis: [C: 031] Fix running webservice inside a k8s container [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336571 (owner: 10Yuvipanda) [04:28:00] (03CR) 10Yuvipanda: [C: 032] Fix running webservice inside a k8s container [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336571 (owner: 10Yuvipanda) [04:28:48] (03Merged) 10jenkins-bot: Fix running webservice inside a k8s container [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/336571 (owner: 10Yuvipanda) [04:52:50] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:59:12] (03PS1) 10Yuvipanda: tools: Fix docker builder host [puppet] - 10https://gerrit.wikimedia.org/r/336572 [04:59:31] (03CR) 10jerkins-bot: [V: 04-1] tools: Fix docker builder host [puppet] - 10https://gerrit.wikimedia.org/r/336572 (owner: 10Yuvipanda) [05:01:26] (03PS2) 10Yuvipanda: tools: Fix docker builder host [puppet] - 10https://gerrit.wikimedia.org/r/336572 [05:03:49] (03CR) 10Yuvipanda: [V: 032 C: 032] tools: Fix docker builder host [puppet] - 10https://gerrit.wikimedia.org/r/336572 (owner: 10Yuvipanda) [05:05:40] (03PS2) 10Yuvipanda: k8s: Use same logic for systemd and upstart configuration [puppet] - 10https://gerrit.wikimedia.org/r/336238 (owner: 10Tim Landscheidt) [05:05:49] (03CR) 10Yuvipanda: [V: 032 C: 032] k8s: Use same logic for systemd and upstart configuration [puppet] - 10https://gerrit.wikimedia.org/r/336238 (owner: 10Tim Landscheidt) [05:11:51] bd808: building complete, pushing is happening [05:12:17] push it -- https://www.youtube.com/watch?v=vCadcBR95oU [05:20:50] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [05:24:15] (03PS1) 10Yuvipanda: tools: Upgrade docker on tools k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/336573 (https://phabricator.wikimedia.org/T157180) [05:32:44] (03PS2) 10Yuvipanda: tools: Upgrade docker on tools k8s workers [puppet] - 10https://gerrit.wikimedia.org/r/336573 (https://phabricator.wikimedia.org/T157180) [05:34:54] brb [05:49:40] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=810.30 Read Requests/Sec=1858.30 Write Requests/Sec=5.80 KBytes Read/Sec=33622.40 KBytes_Written/Sec=2741.60 [05:59:40] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.60 Read Requests/Sec=0.60 Write Requests/Sec=0.40 KBytes Read/Sec=39.60 KBytes_Written/Sec=17.60 [06:30:00] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [06:31:30] PROBLEM - Disk space on ms-be1012 is CRITICAL: DISK CRITICAL - free space: / 2024 MB (3% inode=90%) [06:42:30] RECOVERY - Disk space on ms-be1012 is OK: DISK OK [06:43:13] (03CR) 10Giuseppe Lavagetto: [C: 031] "FTR, debug is triggered by adding --debug on the command line, and is not supported in our current unit file as it's intended for use by h" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [06:59:00] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:15:30] RECOVERY - haproxy failover on dbproxy1011 is OK: OK check_failover servers up 2 down 0 [07:20:10] PROBLEM - check_mysql on frdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2402 [07:20:10] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 2 down 1 [07:20:22] that is me ^ [07:20:27] <_joe_> marostegui: heh I was asking [07:20:40] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [07:20:42] <_joe_> log these actions to the SAL :P [07:20:50] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [07:21:11] Yeah, I was doing maintenance on the labsdb servers since yesterday (restarting etc mysql) and that also affects dbproxies :_( [07:21:14] Sorry! [07:23:02] !log Restart MySQL db1095 and labsdb1009 for maintenance - T153743 [07:23:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:23:08] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [07:25:10] RECOVERY - check_mysql on frdb2001 is OK: Uptime: 743752 Threads: 1 Questions: 10466091 Slow queries: 3841 Opens: 5972 Flush tables: 1 Open tables: 565 Queries per second avg: 14.072 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [07:31:14] (03PS2) 10Muehlenhoff: Remove madhuvishy from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/335787 (https://phabricator.wikimedia.org/T142836) [07:35:10] (03PS1) 10Urbanecm: [throttle] New rule for today [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336582 (https://phabricator.wikimedia.org/T154312) [07:35:41] (03PS2) 10Urbanecm: [throttle] New rule for 2017-02-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336582 (https://phabricator.wikimedia.org/T154312) [07:36:57] (03CR) 10Muehlenhoff: [C: 032] Remove madhuvishy from statistics-users [puppet] - 10https://gerrit.wikimedia.org/r/335787 (https://phabricator.wikimedia.org/T142836) (owner: 10Muehlenhoff) [07:39:40] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [07:39:50] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [07:40:33] Hi all, is there anybody who can deploy 336582 right now? It's for lately announced T154312. The event is at 10:30 UTC+5:30 which already was. Or should I abandon the patch? I'll ask at -releng too. [07:40:58] T154312 [07:40:58] T154312: Request for a temporary lift of account creation cap on IPs (2017-02-08) - https://phabricator.wikimedia.org/T154312 [07:49:29] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3008335 (10Samtar) a:03Ocaasi_WMF @RobH 'statistics-users' seems to be what I need! I'll assign this to Ocaasi for endorsement :-) [07:54:08] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3006952 (10MoritzMuehlenhoff) >>! In T157483#3007481, @RobH wrote: > @Samtar: Would you be able to review https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Access_Groups and i... [07:59:40] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:59:43] !log Adding 100G to the lv on dbstore1001 [07:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:05] (03CR) 10Muehlenhoff: "@dzahn: That's fine: It doesn't matter what's in their LDAP attribute, the authoritative source is the yaml file." [puppet] - 10https://gerrit.wikimedia.org/r/336417 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff) [08:01:21] (03PS4) 10Muehlenhoff: More email addresses of WMF staff/contractors with LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/336417 (https://phabricator.wikimedia.org/T142826) [08:06:50] (03CR) 10Muehlenhoff: [C: 032] More email addresses of WMF staff/contractors with LDAP access [puppet] - 10https://gerrit.wikimedia.org/r/336417 (https://phabricator.wikimedia.org/T142826) (owner: 10Muehlenhoff) [08:08:47] (03CR) 10Ema: [V: 032 C: 032] "> FTR, debug is triggered by adding --debug on the command line, and" [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/334567 (owner: 10Ema) [08:13:24] (03PS4) 10Ema: Log etcd connection status [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) [08:14:00] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:14:09] (03PS1) 10Gehel: elasticsearch - reimage elastic200[12] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336589 (https://phabricator.wikimedia.org/T151326) [08:15:51] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic200[12] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336589 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [08:16:45] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2002.codfw.wmnet [08:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:52] moritzm: unmerged puppet change about emails / ldap access, should I merge it? [08:18:10] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2003.codfw.wmnet [08:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:30] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2004.codfw.wmnet [08:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:43] (03PS2) 10Ema: base::service_unit: chmod -x systemd overrides [puppet] - 10https://gerrit.wikimedia.org/r/336381 [08:18:58] (03CR) 10Ema: [V: 032 C: 032] base::service_unit: chmod -x systemd overrides [puppet] - 10https://gerrit.wikimedia.org/r/336381 (owner: 10Ema) [08:19:12] ah, sorry, forgot to press y, please go ahead [08:19:25] moritzm: done, thanks! [08:20:25] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008416 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2003.codfw.wmnet'] ```... [08:20:38] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008417 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2004.codfw.wmnet'] ```... [08:26:40] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [08:29:10] PROBLEM - Unmerged changes on repository puppet on rhodium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:29:10] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:29:10] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [08:30:34] uh, that's my change ^ [08:31:20] apergos: sorry for pinging, do you think today is okay or we should wait until next week [08:31:24] I've merged it on puppetmaster1001, it should end up on codfw puppetmasters too right? [08:31:50] PROBLEM - puppet last run on meitnerium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:33:51] ema: any error in puppet-merge? Like connection problems, etc.. [08:35:26] elukey: oh actually yes [08:35:32] https://phabricator.wikimedia.org/P4909 [08:37:11] how to proceed in these cases? Is it ok to puppet-merge on puppetmaster2001 or is that Wrong? [08:38:32] ema: theoretically yes, practically I've never done it :D [08:39:45] (03PS1) 10Muehlenhoff: Add another user to list of LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/336590 [08:40:23] _joe_: what's the right course of action if puppet-merge fails halfway through on eqiad? https://phabricator.wikimedia.org/P4909 [08:40:29] ema: I'll simply merge my patch, then your's should catch up [08:40:47] moritzm: good plan [08:40:51] +1 [08:42:00] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:42:28] (03CR) 10Muehlenhoff: [C: 032] Add another user to list of LDAP users [puppet] - 10https://gerrit.wikimedia.org/r/336590 (owner: 10Muehlenhoff) [08:42:41] !log drain shards from elastic200[5678] in preparation for reimage - T151326 [08:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:45] T151326: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326 [08:43:34] (03PS5) 10Ema: Log etcd connection status [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) [08:43:50] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:43:56] (03Abandoned) 10Juniorsys: role: Linting changes (backup,bastionhost+others) [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [08:44:49] (03Abandoned) 10Juniorsys: ifttt/imagemagick/initramfs/interface lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/334320 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [08:45:10] RECOVERY - Unmerged changes on repository puppet on rhodium is OK: No changes to merge. [08:45:10] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [08:45:20] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [08:45:25] nice [08:45:57] (03CR) 10Ema: [V: 032 C: 032] Log etcd connection status [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/335844 (https://phabricator.wikimedia.org/T134893) (owner: 10Ema) [08:45:58] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008443 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2004.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic200... [08:49:02] ema: I doublechecked the service_unit change on mw1261; as expected hhvm remains unchanged and only prometheus-node-exporter gets restarted [08:49:20] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008451 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2003.codfw.wmnet'] ``` and were **ALL** successful. [08:50:04] moritzm: yeah I've seen that. Happy times! [08:51:30] <_joe_> ema: usually just re-run puppet-merge on the hosts where it failed [08:56:24] !log cache_upload: upgrade to jessie 8.7 and reboot into kernel 4.4.2-3+wmf8 T155401 [08:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:30] T155401: Integrate jessie 8.7 point release - https://phabricator.wikimedia.org/T155401 [08:58:31] Amir1: today what? [08:58:46] apergos: the UI dump patch merge :D [08:59:02] I was planning tomorrow at end of day, is that a problem? [08:59:09] Thursday I think we said [08:59:36] apergos: no, it's okay. I didn't just know. I don't think you said it (or you said and I missed it) [08:59:50] RECOVERY - puppet last run on meitnerium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:00:17] You missed it (I just checked my scrollback), no worries [09:00:28] Thanks! [09:04:58] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:08] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:18] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:38] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:38] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:38] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:38] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:48] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:48] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:48] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:48] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:48] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:48] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:05:49] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:49] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:58] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:58] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:58] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:58] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3038_v4, cp3038_v6 [09:05:58] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:06:02] looking [09:06:28] PROBLEM - Host cp3038 is DOWN: PING CRITICAL - Packet loss = 100% [09:06:57] nice, first reboot, first host not coming up :) [09:07:16] can anybody access the management interface? I can't though it responds to ping [09:07:18] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3038_v4, cp3038_v6 [09:07:39] cp3038 ? [09:07:43] yes [09:08:51] yeah, it doesn't respond to me either [09:08:56] sigh [09:09:03] jynus: thanks [09:09:15] others around do [09:09:22] ema: it was not in the list of failing ones... :( T150160 [09:09:22] T150160: Remote IPMI doesn't work for ~2% of the fleet - https://phabricator.wikimedia.org/T150160 [09:10:10] 06Operations, 10ops-eqiad: Heating alerts for mw servers in eqiad - https://phabricator.wikimedia.org/T149287#3008496 (10hashar) [09:10:12] 06Operations, 10ops-eqiad: mw1198.eqiad.wmnet kernel reports temperature issues - https://phabricator.wikimedia.org/T157459#3008498 (10hashar) [09:10:24] ema: have you tried ipmi to see if you can get the chassis status? [09:10:43] volans: nope, let me try [09:11:08] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [09:12:41] (03PS1) 10Muehlenhoff: Two additional LDAP accounts confirmed by Legal/HR [puppet] - 10https://gerrit.wikimedia.org/r/336593 [09:12:48] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [09:13:19] volans: that seems to work fine [09:13:57] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3008500 (10Samtar) @MoritzMuehlenhoff Sure can - samtar.on.en.wp@gmail.com [09:13:59] ema: then try ipmi-console maybe :D [09:14:38] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1005 is CRITICAL: CRITICAL - Expecting active but unit maintain-dbusers is failed [09:14:38] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:14:48] PROBLEM - Check systemd state on labstore1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:22:53] ACKNOWLEDGEMENT - Check systemd state on elastic2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Gehel fresh reimage, Im checking... [09:24:39] volans: sometimes it fails to connect, other times it hangs with SOL Established [09:24:54] nice :( [09:25:43] enough roadblocks, not enough coffee [09:25:51] ema: have you tried to turn it off and on again? (TM) ;) [09:31:07] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2003.codfw.wmnet [09:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:18] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2004.codfw.wmnet [09:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:42] (03CR) 10Muehlenhoff: [C: 032] Two additional LDAP accounts confirmed by Legal/HR [puppet] - 10https://gerrit.wikimedia.org/r/336593 (owner: 10Muehlenhoff) [09:33:09] (03PS2) 10Elukey: Enable aqs1009-b (AQS Cassandra cluster) [puppet] - 10https://gerrit.wikimedia.org/r/336430 (https://phabricator.wikimedia.org/T155654) [09:36:11] (03CR) 10Elukey: [C: 032] Enable aqs1009-b (AQS Cassandra cluster) [puppet] - 10https://gerrit.wikimedia.org/r/336430 (https://phabricator.wikimedia.org/T155654) (owner: 10Elukey) [09:39:38] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1005 is OK: OK - maintain-dbusers is active [09:39:48] RECOVERY - Check systemd state on labstore1005 is OK: OK - running: The system is fully operational [09:42:38] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [09:44:56] !log boostrapping aqs1009-b (last new AQS Cassandra instance) [09:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:57] (03PS1) 10Gehel: elasticsearch - reimage elastic200[5678] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336595 (https://phabricator.wikimedia.org/T151326) [09:50:58] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic200[5678] to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336595 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [09:55:00] !log Updated the Wikidata property suggester with data from last Monday's JSON dump and applied the T132839 workarounds [09:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:04] T132839: [RfC] Property suggester suggests human properties for non-human items - https://phabricator.wikimedia.org/T132839 [09:55:12] (03PS1) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [09:55:19] sjoerddebruin: FYI ^ [09:55:29] Thanks again. [09:56:04] (03CR) 10jerkins-bot: [V: 04-1] role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [09:59:04] (03CR) 10Filippo Giunchedi: "LGTM, you'll need to add both to hieradata/eqiad.yaml too" [puppet] - 10https://gerrit.wikimedia.org/r/336354 (https://phabricator.wikimedia.org/T152504) (owner: 10Dzahn) [10:00:29] (03PS2) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [10:01:13] (03CR) 10Filippo Giunchedi: [C: 031] icinga: remove pre-jessie conditional from monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/318442 (https://phabricator.wikimedia.org/T125023) (owner: 10Dzahn) [10:02:54] 06Operations: Broken IPMI on cp3038 - https://phabricator.wikimedia.org/T157537#3008548 (10ema) [10:03:36] 06Operations: Broken IPMI on cp3038, host failed coming back online after a reboot - https://phabricator.wikimedia.org/T157537#3008563 (10ema) [10:04:40] 06Operations, 06DC-Ops: Broken IPMI on cp3038, host failed coming back online after a reboot - https://phabricator.wikimedia.org/T157537#3008548 (10ema) [10:06:46] (03PS3) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [10:11:00] !log upgrading hhvm on codfw mediawiki cluster [10:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:20] (03CR) 10Filippo Giunchedi: [C: 04-1] "I've run PCC at https://puppet-compiler.wmflabs.org/5378/xenon.eqiad.wmnet/ though the jar that will be added doesn't exist (the -SNAPSHOT" [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) (owner: 10Eevans) [10:18:53] (03PS1) 10Marostegui: check_private_data.py: Add missing quote [puppet] - 10https://gerrit.wikimedia.org/r/336600 (https://phabricator.wikimedia.org/T153743) [10:20:46] (03PS1) 10Jcrespo: mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336601 (https://phabricator.wikimedia.org/T111654) [10:23:10] (03PS2) 10Muehlenhoff: Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 [10:23:54] (03CR) 10Ema: [C: 031] Make the experimental archive section generally available [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [10:24:05] (03PS4) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [10:24:08] (03CR) 10Jcrespo: [C: 031] check_private_data.py: Add missing quote [puppet] - 10https://gerrit.wikimedia.org/r/336600 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [10:25:50] (03CR) 10Marostegui: [C: 032] check_private_data.py: Add missing quote [puppet] - 10https://gerrit.wikimedia.org/r/336600 (https://phabricator.wikimedia.org/T153743) (owner: 10Marostegui) [10:27:01] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [10:27:01] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [10:27:11] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [10:27:11] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [10:27:11] RECOVERY - Host cp3038 is UP: PING OK - Packet loss = 0%, RTA = 83.69 ms [10:27:21] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [10:27:21] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [10:27:41] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [10:27:41] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [10:27:41] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [10:27:41] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [10:27:51] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [10:27:51] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [10:27:51] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [10:27:51] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [10:27:51] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [10:27:51] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [10:28:01] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [10:28:01] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [10:28:01] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [10:28:01] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [10:28:01] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [10:28:57] cp3038 back up? [10:29:03] how did you fix it? [10:30:05] https://i.imgur.com/iZcUNxH.mp4 [10:30:13] sudo ipmitool -I lanplus -H cp3038.mgmt.esams.wmnet -U root mc reset cold [10:30:50] so ssh was broken but ipmi not [10:30:52] which I don't think should fix it really, but that's the last command I've tried to get the drac back [10:31:17] well ipmi is at least a bit broken, ipmiconsole hangs [10:31:24] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336601 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:32:41] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336601 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:34:15] (03Merged) 10jenkins-bot: mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336601 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:34:26] (03CR) 10jenkins-bot: mariadb: Depool db1045 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336601 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [10:35:21] ema: great, make sure to open a task for ops-esams to take a look at it next time ;) [10:36:16] volans: great is something else but yeah :P [10:36:30] great that you got it back ;) [10:37:07] (03CR) 10Volans: "-1 this is not the right way to generate cassandra dummy secrets, there is a cassandra-ca-manager tool for that." [labs/private] - 10https://gerrit.wikimedia.org/r/336462 (owner: 10Andrew Bogott) [10:37:26] 06Operations, 06DC-Ops: Broken IPMI on cp3038, host failed coming back online after a reboot - https://phabricator.wikimedia.org/T157537#3008693 (10ema) I've just tried to get the drac back as follows: ``` sudo ipmitool -I lanplus -H cp3038.mgmt.esams.wmnet -U root mc reset cold ``` And that resulted in the... [10:37:51] can we document this magic somewhere ? [10:38:00] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 - https://phabricator.wikimedia.org/T157537#3008694 (10ema) [10:38:11] elukey: godog did already https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/Dell_PowerEdge_RN20_Gen8#Troubleshooting [10:38:27] !log upgrading openssl, libgd, lcms, gnutls, sqlite, libxpm and glibc in codfw mediawiki cluster (so get get effected by the restart during the HHVM upgrade) [10:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:38:54] nice! [10:39:54] !log upgrading and restarting db1045 T111654 [10:39:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:59] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [10:49:05] (03PS2) 10Volans: Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 [10:52:16] (03PS5) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [10:52:37] (03PS3) 10Volans: Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 [10:53:21] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:55:55] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1045 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336604 [11:03:01] (03PS2) 10Elukey: Replace Memcached/Redis codfw shard12->16 [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) [11:03:41] PROBLEM - puppet last run on db1057 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[pt-heartbeat-kill] [11:07:14] (03CR) 10Filippo Giunchedi: [C: 031] Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 (owner: 10Volans) [11:07:26] (03CR) 10Hashar: "What I am wondering though is how the apt priority will work and which packages are going to be installed." [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [11:08:37] (03PS3) 10Elukey: Replace Memcached/Redis codfw shard12->16 [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) [11:08:50] (03CR) 10Volans: [V: 032 C: 032] Add missing dummy secrets from production [labs/private] - 10https://gerrit.wikimedia.org/r/335643 (owner: 10Volans) [11:12:36] (03PS4) 10Elukey: Replace Memcached/Redis codfw shard12->16 [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) [11:18:58] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#2961839 (10MoritzMuehlenhoff) With T153488 two job runners in eqiad were reimaged as vid... [11:19:18] (03PS1) 10Jcrespo: mariadb: Disable TLS on redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/336607 [11:19:46] (03CR) 10Marostegui: [C: 031] mariadb: Disable TLS on redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/336607 (owner: 10Jcrespo) [11:20:09] (03CR) 10Jcrespo: [C: 032] mariadb: Disable TLS on redact_sanitarium.sh [puppet] - 10https://gerrit.wikimedia.org/r/336607 (owner: 10Jcrespo) [11:21:11] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:21:35] 06Operations, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Check the size of every cluster in codfw to see if it matches eqiad's capacity - https://phabricator.wikimedia.org/T156023#3008799 (10elukey) >>! In T156023#3008794, @MoritzMuehlenhoff wrote: > With T153488 two... [11:29:51] PROBLEM - puppet last run on mw1259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:31:41] RECOVERY - puppet last run on db1057 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:39:42] _joe_ hi, im wondering could you remove your -2 on https://gerrit.wikimedia.org/r/#/c/333358/ and could you re review please? [11:39:50] I've amended it since you did your -2 [11:39:51] :) [11:40:05] <_joe_> paladox: I'll take a look later [11:40:12] ok thanks. [11:40:15] <_joe_> thanks for working on it [11:40:23] your welcome :) [11:42:37] I can actually test that puppet change now, i have puppet master on the phabricator (labs instance) :) [11:43:59] (03PS14) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [11:48:23] (03PS6) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [11:57:07] (03PS15) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [11:57:58] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [11:58:21] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:58:28] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1045 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336604 (owner: 10Jcrespo) [11:58:47] (03PS16) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [11:59:44] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [11:59:51] RECOVERY - puppet last run on mw1259 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [12:00:06] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1045 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336604 (owner: 10Jcrespo) [12:00:13] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1045 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336604 (owner: 10Jcrespo) [12:01:13] (03PS17) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:01:26] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1045 after maintenance (duration: 00m 42s) [12:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:02] (03PS18) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:04:01] (03PS19) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:06:09] (03PS20) 10Paladox: Phabricator: Fix phd init script, also use systemd script if the os is cable of it [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:08:15] (03PS7) 10Giuseppe Lavagetto: role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) [12:09:18] (03CR) 10Giuseppe Lavagetto: [C: 032] role::configcluster: enable replication from eqiad to codfw [puppet] - 10https://gerrit.wikimedia.org/r/336596 (https://phabricator.wikimedia.org/T156009) (owner: 10Giuseppe Lavagetto) [12:09:27] (03PS1) 10Jcrespo: mariadb: Depool db1037 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336609 (https://phabricator.wikimedia.org/T111654) [12:12:26] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1037 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336609 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [12:12:41] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1037 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336609 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [12:12:55] (03CR) 10jenkins-bot: mariadb: Depool db1037 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336609 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [12:15:12] PROBLEM - puppet last run on conf2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/var/log/etcdmirror-conftool-eqiad-wmnet/syslog.log] [12:15:28] <_joe_> that's me ^^ (or better, a bug in a module I used [12:15:56] (03PS1) 10Giuseppe Lavagetto: systemd::syslog: fix owner and group of logfile [puppet] - 10https://gerrit.wikimedia.org/r/336611 [12:16:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1037 for maintenance (duration: 00m 40s) [12:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:30] !log upgrading and restarting db1037 T111654 [12:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:17:34] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [12:18:00] (03PS1) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336612 (https://phabricator.wikimedia.org/T156126) [12:19:40] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This might have unintended consequences." [puppet] - 10https://gerrit.wikimedia.org/r/336611 (owner: 10Giuseppe Lavagetto) [12:21:28] (03PS21) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:22:31] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [12:25:06] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1037 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336614 [12:25:19] (03PS22) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:26:22] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [12:26:58] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1037 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336614 (owner: 10Jcrespo) [12:27:21] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:28:05] (03PS23) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:31:19] (03CR) 10Giuseppe Lavagetto: [C: 032] "On second thoughts and after analyzing the code, this will only change ownership of a few log files without serious consequences. It is al" [puppet] - 10https://gerrit.wikimedia.org/r/336611 (owner: 10Giuseppe Lavagetto) [12:32:01] (03PS24) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:32:24] (03PS25) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:32:48] (03CR) 10Jcrespo: [C: 031] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336612 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [12:33:11] RECOVERY - puppet last run on conf2002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [12:33:24] (03CR) 10jerkins-bot: [V: 04-1] Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 (owner: 10Paladox) [12:33:42] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1037 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336614 (owner: 10Jcrespo) [12:35:12] PROBLEM - Check systemd state on conf2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:35:28] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1037 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336614 (owner: 10Jcrespo) [12:36:56] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1037 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336614 (owner: 10Jcrespo) [12:37:20] (03PS26) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:37:58] (03PS2) 10Marostegui: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336612 (https://phabricator.wikimedia.org/T156126) [12:38:30] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1037 after maintenance (duration: 00m 41s) [12:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336612 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [12:41:48] (03PS1) 10Giuseppe Lavagetto: etcdmirror::instance: fix binary path, do not declare the service [puppet] - 10https://gerrit.wikimedia.org/r/336617 [12:42:33] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336612 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [12:42:44] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1073 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336612 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [12:43:20] (03CR) 10Giuseppe Lavagetto: [C: 032] etcdmirror::instance: fix binary path, do not declare the service [puppet] - 10https://gerrit.wikimedia.org/r/336617 (owner: 10Giuseppe Lavagetto) [12:43:36] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 2 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3008863 (10Addshore) @Lea_Lacroix_WMDE @Lydia_Pintscher this should be ready to go next week technically... [12:43:43] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1073 - T156126 (duration: 00m 40s) [12:43:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:48] T156126: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126 [12:44:01] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3008866 (10Addshore) [12:44:21] 06Operations, 10DBA, 13Patch-For-Review: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#3008868 (10Marostegui) Server has been depooled and downtimed - ready to shut it down whenever @Cmjohnson is at the DC. [12:48:03] (03PS27) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [12:48:55] (03PS2) 10ArielGlenn: cleanup old files after dataset100 rsync of dumps to labs [puppet] - 10https://gerrit.wikimedia.org/r/336451 [12:49:11] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [12:49:44] (03PS1) 10Muehlenhoff: Add three users to absent group [puppet] - 10https://gerrit.wikimedia.org/r/336619 (https://phabricator.wikimedia.org/T142836) [12:50:12] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3381577 keys, up 100 days 4 hours - replication_delay is 0 [12:50:12] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [12:50:18] (03CR) 10jerkins-bot: [V: 04-1] cleanup old files after dataset100 rsync of dumps to labs [puppet] - 10https://gerrit.wikimedia.org/r/336451 (owner: 10ArielGlenn) [12:51:08] (03PS1) 10Jcrespo: mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336620 (https://phabricator.wikimedia.org/T111654) [12:51:11] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3381594 keys, up 100 days 4 hours - replication_delay is 0 [12:52:06] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336620 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [12:55:34] (03PS28) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [13:06:42] icinga--, why did backup4001 disk alerts get reenabled? [13:09:44] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2005.codfw.wmnet [13:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:57] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2006.codfw.wmnet [13:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:09] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2007.codfw.wmnet [13:10:12] RECOVERY - Check systemd state on conf2002 is OK: OK - running: The system is fully operational [13:10:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:23] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic2008.codfw.wmnet [13:10:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:46] Jeff_Green: maybe an earlier downtime expired? [13:10:48] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008894 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2005.codfw.wmnet'] ```... [13:11:06] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008895 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2006.codfw.wmnet'] ```... [13:11:14] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008896 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2007.codfw.wmnet'] ```... [13:11:30] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008897 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2008.codfw.wmnet'] ```... [13:12:14] moritzm: i guess there was an acknowledgement from the day before, maybe that's it [13:14:01] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:14:58] (03PS5) 10Elukey: Replace Memcached/Redis codfw shard12->16 [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) [13:16:13] moritzm: I replied on the addition of the 'experimental' component to our apt config. Not sure how the priorities are going to be handled and which packages will end up being installed by default [13:16:22] ^ the salt-minion alert above is me doing multiple reimages in parallel, check is probably not specific enough, having a look [13:18:10] yeah I think we got it in the past [13:18:49] (03CR) 10Muehlenhoff: "There's no need to meddle with apt priorities, the repository can be selectively enabled on the test hosts via Hiera." [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [13:25:02] (03CR) 10Hashar: "What if in experimental we have Linux Kernel 5 and Jenkins 2. Is that going to cause both to be installed or would one still need to exp" [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [13:26:56] (03CR) 10Muehlenhoff: "Packages in experimental are ultimately in there for a short term anyway, but also it's still entirely in our control what we're updating." [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [13:27:32] ACKNOWLEDGEMENT - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion Gehel multiple reimaging in progress (T151326), which generates multiple salt-minion processes [13:28:02] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:29:10] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3008998 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2006.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic200... [13:29:49] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009002 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2005.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic200... [13:36:24] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336620 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [13:37:30] (03CR) 10Hashar: "> Packages in experimental are ultimately in there for a short term anyway, but also it's still entirely in our control what we're updatin" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/336420 (owner: 10Muehlenhoff) [13:37:56] (03Merged) 10jenkins-bot: mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336620 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [13:38:04] (03CR) 10jenkins-bot: mariadb: Depool db1026 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336620 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [13:39:14] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1026 for maintenance (duration: 00m 41s) [13:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:45] !log Enable replication between db1095 and db1064 - T153743 [13:40:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:49] T153743: Add and sanitize s2, s4, s5, s6 and s7 to sanitarium2 and new labsdb hosts - https://phabricator.wikimedia.org/T153743 [13:40:55] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009082 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2007.codfw.wmnet'] ``` and were **ALL** successful. [13:41:19] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009083 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2008.codfw.wmnet'] ``` and were **ALL** successful. [13:41:29] !log Start replication on db1064 - T153743 [13:41:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:12] (03CR) 10Elukey: [C: 032] Replace Memcached/Redis codfw shard12->16 [puppet] - 10https://gerrit.wikimedia.org/r/336419 (https://phabricator.wikimedia.org/T155755) (owner: 10Elukey) [13:46:10] !log replacing the codfw memcached/redis shards 12->16 [13:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:23] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3009098 (10ArielGlenn) I'm reviving this old ticket since now's the time to talk about hardwre and spend some money. I'm also adding labs (soon to be Cloud!) folks @chasemp and @... [13:50:31] jouncebot: next [13:50:32] In 0 hour(s) and 9 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1400) [13:50:50] zeljkof: swat is empty today \O/ [13:58:27] all right all new mc hosts are up and running, replication works in all of them (checked via redis-cli) [13:58:49] now I'll restart nutcracker across the mw servers in codfw [14:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1400). Please do the needful. [14:01:31] Hashar: Nice [14:04:34] Currently testing mc2017 with apache-fast-test just in case [14:04:44] err mw2017 [14:05:56] all good [14:06:38] elukey: when exactly? I'm currently upgrading hhvm there [14:06:48] so just tell me when do you it and I'll stop [14:06:55] so that it doesn't clash [14:07:30] moritzm: sorry just finished, I was testing if the new memcached settings were hitting some errors [14:07:36] normal http requests [14:07:46] should I wait to restart nutcracker in codfw? [14:08:07] (I also need to wait for puppet to run anyway) [14:08:36] ok, then it's just fine [14:14:48] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3009172 (10Ottomata) > I'd like a solution for the stats hosts that involves someting better than an nfs mount. Maybe putting the files they use directly into hdfs? cc @Mforns as... [14:17:32] !log upgrading and restarting db1026 T111654 [14:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:36] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [14:19:03] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:13] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:13] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:14] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:14] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:14] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:14] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:14] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:16] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:16] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:17] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:17] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:17] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:17] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:23] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1071_v4, cp1071_v6, cp2022_v4, cp2022_v6 [14:19:43] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:43] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:43] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:53] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:19:53] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2022_v4, cp2022_v6 [14:20:14] 1071, 2022 [14:20:19] looking [14:20:23] PROBLEM - Host cp2022 is DOWN: PING CRITICAL - Packet loss = 100% [14:23:39] moritzm: fine for me to run puppet on all the appservers in codfw? Just want to make sure that they pick up the nutcracker config before restarting (will do in batches) [14:23:52] !log drain shards from elastic20(09|10|11|12) in preparation for reimage - T151326 [14:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:56] T151326: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326 [14:23:56] !log cp2022 stuck rebooting, power-cycled [14:23:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:53] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [14:24:53] RECOVERY - Host cp2022 is UP: PING OK - Packet loss = 0%, RTA = 36.72 ms [14:25:03] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [14:25:13] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [14:25:13] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [14:25:13] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [14:25:14] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [14:25:14] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [14:25:14] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [14:25:14] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [14:25:14] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [14:25:15] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [14:25:16] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [14:25:16] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [14:25:16] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [14:25:17] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [14:25:20] Wew. [14:25:23] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [14:25:23] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [14:25:23] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [14:25:23] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [14:25:23] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [14:25:23] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [14:25:23] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [14:25:32] elukey: yeah, that's fine, I'm also running various puppet runs in batches (since hhvm needs one to update the php.ini after the upgrade), but it's not an issue [14:25:43] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [14:25:43] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [14:25:43] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [14:25:53] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [14:25:55] moritzm: super [14:26:47] !log restarting nutcracker in all the codfw mw servers to pick up the new shards [14:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:06] (03PS1) 10Gehel: elasticsearch - reimage elastic20(09|10|11|12) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336625 (https://phabricator.wikimedia.org/T151326) [14:29:55] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2005.codfw.wmnet [14:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:06] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2006.codfw.wmnet [14:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:18] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2007.codfw.wmnet [14:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:04] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2008.codfw.wmnet [14:31:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:37] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336626 [14:34:23] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336626 (owner: 10Jcrespo) [14:39:13] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [14:41:13] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:54:00] !log Shutdown db1073 for maintenance - https://phabricator.wikimedia.org/T156126  [14:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:48] (03PS29) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [14:54:54] _joe_: morning! sorry for the lame puppet swat change yesterday. Could not figure out the .erb expansion but I guess I solved it now. [14:55:02] (03CR) 10Andrew Bogott: "My patch made the puppet compiler work, which was all I really cared about. Thank you for improving things though!" [labs/private] - 10https://gerrit.wikimedia.org/r/336462 (owner: 10Andrew Bogott) [14:55:07] _joe_: even added a rspec spec :} ( https://gerrit.wikimedia.org/r/#/c/290895/ ) [14:55:18] <_joe_> hashar: cool, I need your help on a change btw [14:55:25] I am there! [14:55:38] <_joe_> hashar: specifically, https://gerrit.wikimedia.org/r/#/c/336230/ [14:55:50] <_joe_> we should skip validation of some test files there [14:55:59] <_joe_> they're written for puppet 4.x [14:56:13] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [14:56:23] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[hhvm-dbg] [14:56:30] (03PS3) 10ArielGlenn: cleanup old files after dataset100 rsync of dumps to labs [puppet] - 10https://gerrit.wikimedia.org/r/336451 [14:57:05] <_joe_> hashar: actually, I'd skip any verification in modules/stdlib/spec and modules/stdlib/types if possible [14:58:11] arGHGhghg [14:58:29] <_joe_> it's a third-party module after all :) [14:58:34] upgrading stdlib by several versions right ? [14:58:38] <_joe_> yes [14:58:41] <_joe_> to the latest [14:58:51] <_joe_> we weren't doing it since forever [14:58:54] I am pretty sure our stdlib copy is a fork with some custom hacks [14:58:56] (03PS7) 10Giuseppe Lavagetto: rsync: allow extra settings in rsyncd.conf [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [14:59:15] <_joe_> hashar: I am pretty sure there aren't many left [14:59:26] <_joe_> and those, are now irrelevant more or less [14:59:31] your call :-} [14:59:33] regardless [14:59:54] pplint-HEAD just runs "puppet parser validate" on any .pp file changed in HEAD [15:00:01] with no ignore [15:00:04] hoo: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1500). Please do the needful. [15:00:22] (03PS30) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [15:00:35] I get few patches to remove that job entirely in favor of using the gem puppet-syntax that has a rake task so one can just "rake syntax:manifests" and get everything linted [15:00:35] On it [15:00:36] (03PS31) 10Paladox: Phabricator: Fix phd init and systemd script also update ssh-phab to use base class [puppet] - 10https://gerrit.wikimedia.org/r/333358 [15:00:42] but that would still run with whatever puppet version [15:00:49] is defined in Gemfile [15:01:13] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [15:01:23] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:02:08] <_joe_> hashar: so I could just merge my change and it will never affect anyone not updating stdlib, right? [15:02:31] for CI yeah [15:02:40] but whenever the rake task using puppet-syntax lands [15:02:45] people running rake syntax [15:03:01] would end up running puppet parser validate on all of them and that will fail/errors out the same way [15:03:51] _joe_: my patch is https://gerrit.wikimedia.org/r/#/c/331239/ if you wanna try it locally on top of your [15:04:00] bundle install [15:04:15] bundle exec rake lint_head # what CI will run [15:04:30] <_joe_> hashar: we will need to exclude stdlib from that process [15:04:53] PROBLEM - puppet last run on labstore1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:05:38] (03CR) 10Giuseppe Lavagetto: [C: 032] rsync: allow extra settings in rsyncd.conf [puppet] - 10https://gerrit.wikimedia.org/r/290895 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [15:06:48] _joe_: and the child change https://gerrit.wikimedia.org/r/#/c/290896/ adds "forward lookup = no" to the CI rsync server :) [15:06:56] <_joe_> yeah I know [15:07:22] (03PS4) 10Giuseppe Lavagetto: contint: disable DNS lookup for castor rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [15:07:37] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] contint: disable DNS lookup for castor rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/290896 (https://phabricator.wikimedia.org/T136276) (owner: 10Hashar) [15:15:26] !log hoo@tin Synchronized php-1.29.0-wmf.11/extensions/Wikidata: Wikibase uses multiple EntityPrefetchers (T157380) (duration: 02m 11s) [15:15:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:31] T157380: Wikibase uses multiple instances of PrefetchingWikiPageEntityMetaDataAccessor - https://phabricator.wikimedia.org/T157380 [15:15:52] (03PS1) 10Marostegui: Update db1072 IP to match rack change to B3 [dns] - 10https://gerrit.wikimedia.org/r/336631 (https://phabricator.wikimedia.org/T156126) [15:20:05] (03PS6) 10Hashar: contint: move from /mnt to /srv [puppet] - 10https://gerrit.wikimedia.org/r/312523 (https://phabricator.wikimedia.org/T146381) [15:20:56] !log hoo@tin Synchronized php-1.29.0-wmf.10/extensions/Wikidata: Wikibase uses multiple EntityPrefetchers (T157380) (duration: 02m 07s) [15:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:00] T157380: Wikibase uses multiple instances of PrefetchingWikiPageEntityMetaDataAccessor - https://phabricator.wikimedia.org/T157380 [15:21:58] (03PS2) 10Marostegui: Update db1073 IP to match rack change to B3 [dns] - 10https://gerrit.wikimedia.org/r/336631 (https://phabricator.wikimedia.org/T156126) [15:23:09] AHH Paths can be excluded with: PuppetSyntax.exclude_paths = ["vendor/**/*"] [15:23:59] _joe_ Hi again, i've did alot alot more updates since this mornning + i tested it too :). It also includes some fixes for the ssh-phab service running on labs which is needed otherwise it will prevent me from ssh into it. Prod un affected by the changes i did for labs. :) :) [15:26:00] (03PS3) 10Marostegui: Update db1073 IP to match rack change to B3 [dns] - 10https://gerrit.wikimedia.org/r/336631 (https://phabricator.wikimedia.org/T156126) [15:27:03] PROBLEM - puppet last run on dubnium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:27:14] (03CR) 10Cmjohnson: [C: 031] Update db1073 IP to match rack change to B3 [dns] - 10https://gerrit.wikimedia.org/r/336631 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:28:03] !log Eqiad cr1/cr2 - Updated analytics-in4 for new aqs nodes and removed decommed ones [15:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:30] (03CR) 10Hashar: "The pplint-HEAD job is running "puppet parser validate" against any .pp file changed in a patch. That job is going to be removed in favor" [puppet] - 10https://gerrit.wikimedia.org/r/336230 (owner: 10Giuseppe Lavagetto) [15:29:05] (03CR) 10Marostegui: [C: 032] Update db1073 IP to match rack change to B3 [dns] - 10https://gerrit.wikimedia.org/r/336631 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:32:57] (03PS1) 10Marostegui: db-codfw,db-eqiad.php: Change db1073 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336633 (https://phabricator.wikimedia.org/T156126) [15:33:53] RECOVERY - puppet last run on labstore1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [15:36:32] (03CR) 10Jcrespo: "Let me merge first, please." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336633 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:36:43] PROBLEM - puppet last run on kafka1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:36:48] jynus: no problem! [15:37:02] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336626 (owner: 10Jcrespo) [15:38:03] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:13] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:13] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6 [15:38:13] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6 [15:38:14] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6 [15:38:14] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6 [15:38:14] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6 [15:38:14] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:23] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:24] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:24] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:25] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:25] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:26] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1072_v4, cp1072_v6, cp2011_v4, cp2011_v6 [15:38:42] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1026 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336626 (owner: 10Jcrespo) [15:38:43] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:43] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:53] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:53] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:38:53] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2011_v4, cp2011_v6 [15:39:33] PROBLEM - Host cp2011 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:08] looking [15:40:33] !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1026 (duration: 00m 41s) [15:40:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:43] (03CR) 10Jcrespo: [C: 031] "good to go" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336633 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:40:48] (03PS2) 10Jcrespo: db-codfw,db-eqiad.php: Change db1073 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336633 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:40:52] thanks [15:41:11] oh you rebased it for me, how nice :) [15:41:19] it is the least I could do [15:41:27] after blocking you [15:41:44] not a big deal :) [15:41:59] !log cp2011 stuck rebooting, power-cycled [15:42:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:42] (03CR) 10Marostegui: [C: 032] db-codfw,db-eqiad.php: Change db1073 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336633 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:43:53] RECOVERY - Host cp2011 is UP: PING WARNING - Packet loss = 44%, RTA = 36.54 ms [15:43:59] 06Operations, 10Dumps-Generation: determine hardware needs for dumps in eqiad and codfw - https://phabricator.wikimedia.org/T118154#3009548 (10chasemp) Thanks @ArielGlenn, I've been meaning to email you to followup :) To recap a bit of our conversation (and really my limited insight here): * Labstore1003 is... [15:44:03] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [15:44:13] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [15:44:13] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [15:44:14] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [15:44:14] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [15:44:14] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [15:44:14] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [15:44:14] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [15:44:20] (03Merged) 10jenkins-bot: db-codfw,db-eqiad.php: Change db1073 IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336633 (https://phabricator.wikimedia.org/T156126) (owner: 10Marostegui) [15:44:23] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [15:44:23] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [15:44:23] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [15:44:23] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [15:44:23] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [15:44:23] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [15:44:24] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [15:44:24] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [15:44:25] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [15:44:25] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [15:44:26] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [15:44:27] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [15:44:27] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [15:44:27] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [15:44:43] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [15:44:43] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [15:44:53] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [15:44:53] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [15:44:53] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [15:45:47] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: db1073 change IP - T156126 (duration: 00m 40s) [15:45:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:51] T156126: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126 [15:46:37] !log marostegui@tin Synchronized wmf-config/db-codfw.php: db1073 change IP - T156126 (duration: 00m 40s) [15:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:07] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3009583 (10RobH) I stand corrected. So this access request still is outstanding two items: * approval from @Ocaasi_WMF * NDA verification from Legal (which is why @MoritzMuehlenho... [15:54:45] 06Operations, 10ops-esams, 06DC-Ops: Broken IPMI/drac on cp3038 - https://phabricator.wikimedia.org/T157537#3009586 (10RobH) a:05Volans>03mark cp3038 is under warranty until 2018-03-04. It has a service tag of 45MYV42. When the DRAC stops working, often simply removing ALL power to the system will rese... [15:55:03] RECOVERY - puppet last run on dubnium is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:55:54] 06Operations, 10MediaWiki-extensions-InterwikiSorting, 10Wikidata, 10Wikimedia-Extension-setup, and 3 others: Deploy InterwikiSorting extension to production - https://phabricator.wikimedia.org/T150183#3009589 (10Lydia_Pintscher) Nothing should change for editors/readers with this, right? So I don't care w... [16:02:27] (03PS1) 10Jcrespo: mariadb: Depool db1023 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336636 (https://phabricator.wikimedia.org/T111654) [16:04:43] RECOVERY - puppet last run on kafka1022 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [16:05:29] (03CR) 10Marostegui: [C: 031] "Good job!!!! :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336636 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [16:06:20] (03PS2) 10Jcrespo: mariadb: Depool db1030 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336636 (https://phabricator.wikimedia.org/T111654) [16:08:17] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1030 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336636 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [16:09:42] (03Merged) 10jenkins-bot: mariadb: Depool db1030 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336636 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [16:09:50] (03CR) 10jenkins-bot: mariadb: Depool db1030 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336636 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [16:10:30] !log maintain-views and maintain-meta_p full runs on labsdb1009/10/11 [16:10:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:18] !log jynus@tin Synchronized wmf-config/db-eqiad.php: depool db1030 (duration: 00m 41s) [16:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:37] !log pybal 1.13.4 built and uploaded to carbon [16:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:51] 06Operations, 10DBA, 13Patch-For-Review: Move db1073 to B3 - https://phabricator.wikimedia.org/T156126#3009621 (10Marostegui) a:03Marostegui db1073 has been moved. DNS updated db-eqiad,codfw files updated mysql and replication started finely. tendril updated I will pool it in back slowly to once it is wa... [16:17:23] PROBLEM - IPsec on cp4014 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:23] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:23] PROBLEM - IPsec on cp3038 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:23] PROBLEM - IPsec on cp3039 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:23] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:17:23] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:23] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:24] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:24] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:25] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:17:25] PROBLEM - IPsec on cp4015 is CRITICAL: Strongswan CRITICAL - ok: 52 not-conn: cp2017_v4, cp2017_v6 [16:17:42] come on codfw [16:17:43] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:17:43] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:17:53] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:17:53] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:18:03] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:18:04] (03PS4) 10Rush: nodepool: active check for node pool instance states [puppet] - 10https://gerrit.wikimedia.org/r/336404 [16:18:13] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:18:13] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:18:13] PROBLEM - IPsec on cp4013 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:14] PROBLEM - IPsec on cp4006 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:14] PROBLEM - IPsec on cp4005 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:14] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:14] PROBLEM - IPsec on cp3035 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:14] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:23] PROBLEM - IPsec on cp4007 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:23] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:23] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 50 not-conn: cp1050_v4, cp1050_v6, cp2017_v4, cp2017_v6 [16:18:23] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp2017_v4, cp2017_v6 [16:19:03] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:06] !log upgrading and restarting db1030 T111654 [16:19:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:10] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [16:19:47] (03CR) 10Rush: [C: 032] nodepool: active check for node pool instance states [puppet] - 10https://gerrit.wikimedia.org/r/336404 (owner: 10Rush) [16:20:41] !log cp2017 stuck rebooting, power-cycled [16:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:43] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [16:21:43] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [16:21:53] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [16:21:53] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [16:21:54] RECOVERY - Host cp2017 is UP: PING OK - Packet loss = 0%, RTA = 36.05 ms [16:22:03] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 56 ESP OK [16:22:13] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [16:22:13] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [16:22:13] RECOVERY - IPsec on cp4013 is OK: Strongswan OK - 54 ESP OK [16:22:13] RECOVERY - IPsec on cp4006 is OK: Strongswan OK - 54 ESP OK [16:22:14] RECOVERY - IPsec on cp4005 is OK: Strongswan OK - 54 ESP OK [16:22:14] RECOVERY - IPsec on cp3035 is OK: Strongswan OK - 54 ESP OK [16:22:14] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 54 ESP OK [16:22:14] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 54 ESP OK [16:22:23] RECOVERY - IPsec on cp4007 is OK: Strongswan OK - 54 ESP OK [16:22:23] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 54 ESP OK [16:22:23] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [16:22:23] RECOVERY - IPsec on cp4014 is OK: Strongswan OK - 54 ESP OK [16:22:23] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 54 ESP OK [16:22:24] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 54 ESP OK [16:22:24] RECOVERY - IPsec on cp3039 is OK: Strongswan OK - 54 ESP OK [16:22:26] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [16:22:26] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 54 ESP OK [16:22:26] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 54 ESP OK [16:22:26] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 54 ESP OK [16:22:27] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 54 ESP OK [16:22:27] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [16:22:28] RECOVERY - IPsec on cp4015 is OK: Strongswan OK - 54 ESP OK [16:23:08] (03PS10) 10Rush: openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 [16:23:18] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 54 ESP OK [16:26:34] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1030 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336637 [16:27:11] (03CR) 10Rush: [C: 032] openstack: nova fullstack testing [puppet] - 10https://gerrit.wikimedia.org/r/336413 (owner: 10Rush) [16:27:28] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1030 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336637 (owner: 10Jcrespo) [16:28:28] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:30:48] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:30:48] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:30:48] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:30:48] PROBLEM - IPsec on cp1063 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:30:48] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:30:58] PROBLEM - IPsec on cp2005 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:30:58] PROBLEM - IPsec on cp2011 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:30:58] PROBLEM - IPsec on cp2017 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:30:58] PROBLEM - IPsec on cp2022 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:30:58] PROBLEM - IPsec on cp2002 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:30:59] PROBLEM - IPsec on cp2008 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:30:59] PROBLEM - IPsec on cp2024 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:31:08] PROBLEM - IPsec on cp2014 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:31:08] PROBLEM - IPsec on cp2020 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:31:08] PROBLEM - IPsec on cp2026 is CRITICAL: Strongswan CRITICAL - ok: 68 not-conn: cp3045_v4, cp3045_v6 [16:31:08] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:31:08] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:31:18] PROBLEM - IPsec on cp1062 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:31:28] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:31:28] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 54 not-conn: cp3045_v4, cp3045_v6 [16:32:08] PROBLEM - Host cp3045 is DOWN: PING CRITICAL - Packet loss = 100% [16:32:26] !log cp3045 stuck rebooting, power-cycled [16:32:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:38] (03PS1) 10Giuseppe Lavagetto: etcdmirror::instance: fix unit file [puppet] - 10https://gerrit.wikimedia.org/r/336638 [16:34:18] RECOVERY - IPsec on cp1062 is OK: Strongswan OK - 56 ESP OK [16:34:19] (03PS2) 10Giuseppe Lavagetto: etcdmirror::instance: fix unit file [puppet] - 10https://gerrit.wikimedia.org/r/336638 [16:34:28] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 56 ESP OK [16:34:28] RECOVERY - Host cp3045 is UP: PING OK - Packet loss = 0%, RTA = 83.71 ms [16:34:28] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 56 ESP OK [16:34:48] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 56 ESP OK [16:34:48] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 56 ESP OK [16:34:48] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 56 ESP OK [16:34:48] RECOVERY - IPsec on cp1063 is OK: Strongswan OK - 56 ESP OK [16:34:48] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 56 ESP OK [16:35:08] RECOVERY - IPsec on cp2014 is OK: Strongswan OK - 70 ESP OK [16:35:08] RECOVERY - IPsec on cp2020 is OK: Strongswan OK - 70 ESP OK [16:35:08] RECOVERY - IPsec on cp2026 is OK: Strongswan OK - 70 ESP OK [16:35:09] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 56 ESP OK [16:35:09] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 56 ESP OK [16:35:58] RECOVERY - IPsec on cp2005 is OK: Strongswan OK - 70 ESP OK [16:35:58] RECOVERY - IPsec on cp2011 is OK: Strongswan OK - 70 ESP OK [16:35:58] RECOVERY - IPsec on cp2017 is OK: Strongswan OK - 70 ESP OK [16:35:58] RECOVERY - IPsec on cp2022 is OK: Strongswan OK - 70 ESP OK [16:35:58] RECOVERY - IPsec on cp2008 is OK: Strongswan OK - 70 ESP OK [16:35:58] RECOVERY - IPsec on cp2002 is OK: Strongswan OK - 70 ESP OK [16:35:58] (03CR) 10Giuseppe Lavagetto: [C: 032] etcdmirror::instance: fix unit file [puppet] - 10https://gerrit.wikimedia.org/r/336638 (owner: 10Giuseppe Lavagetto) [16:35:59] RECOVERY - IPsec on cp2024 is OK: Strongswan OK - 70 ESP OK [16:37:28] PROBLEM - Varnish HTTP upload-backend - port 3128 on cp3045 is CRITICAL: connect to address 10.20.0.180 and port 3128: Connection refused [16:37:28] PROBLEM - Check systemd state on cp3045 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:37:28] PROBLEM - puppet last run on cp3045 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [16:38:31] looking [16:41:28] RECOVERY - Varnish HTTP upload-backend - port 3128 on cp3045 is OK: HTTP OK: HTTP/1.1 200 OK - 178 bytes in 0.167 second response time [16:41:28] RECOVERY - Check systemd state on cp3045 is OK: OK - running: The system is fully operational [16:42:28] RECOVERY - puppet last run on cp3045 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [16:51:38] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3009760 (10jcrespo) TLS is now deployed on all core servers: ``` root@neodymium:~$ sudo salt --output=txt -C 'G@cluster:mysql and G@mysql_group:core' cmd.run 'mysql -BN --skip-ssl -e "SELEC... [17:00:25] (03PS1) 10Giuseppe Lavagetto: Use etcd_index for the initial replication index in dump and load [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/336640 [17:00:50] (03CR) 10Giuseppe Lavagetto: [C: 032] Use etcd_index for the initial replication index in dump and load [software/etcd-mirror] - 10https://gerrit.wikimedia.org/r/336640 (owner: 10Giuseppe Lavagetto) [17:01:02] (03PS1) 10Giuseppe Lavagetto: Use etcd_index for the initial replication index in dump and load [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/336641 [17:01:16] (03Abandoned) 10Giuseppe Lavagetto: Use etcd_index for the initial replication index in dump and load [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/336641 (owner: 10Giuseppe Lavagetto) [17:01:27] <_joe_> ouch [17:01:33] (03Restored) 10Giuseppe Lavagetto: Use etcd_index for the initial replication index in dump and load [software/etcd-mirror] (debian) - 10https://gerrit.wikimedia.org/r/336641 (owner: 10Giuseppe Lavagetto) [17:02:05] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic20(09|10|11|12).codfw.wmnet [17:02:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:21] (03PS2) 10Gehel: elasticsearch - reimage elastic20(09|10|11|12) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336625 (https://phabricator.wikimedia.org/T151326) [17:03:40] jouncebot: next [17:03:40] In 1 hour(s) and 56 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1900) [17:03:46] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic20(09|10|11|12) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336625 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [17:04:01] !log rolling restart of replication thread of 29 mysql hosts T111654 [17:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:06] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [17:05:38] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009791 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2009.codfw.wmnet'] ```... [17:10:46] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009803 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2011.codfw.wmnet'] ```... [17:11:41] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009804 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2010.codfw.wmnet'] ```... [17:13:19] (03PS3) 10EBernhardson: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) [17:16:52] (03CR) 10jerkins-bot: [V: 04-1] Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) (owner: 10EBernhardson) [17:17:48] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3009812 (10jcrespo) Enabled everywhere except on db1034 and db2057, which probably require a package upgrade. ``` $ sudo salt --output=txt -C 'G@cluster:mysql and G@mysql_group:core' cmd.... [17:18:00] (03PS4) 10EBernhardson: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) [17:20:00] (03PS1) 10ArielGlenn: fix typo in dumps table config checking [dumps] - 10https://gerrit.wikimedia.org/r/336643 [17:23:57] (03PS1) 10Jcrespo: mariadb: Depool db2057 for mariadb upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336644 (https://phabricator.wikimedia.org/T111654) [17:24:45] what is swat status, can I merge a couple of things? [17:25:32] oh, I missread the calendar [17:25:48] I thought there was something now [17:25:48] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3009867 (10Marostegui) >>! In T111654#3009760, @jcrespo wrote: > TLS is now deployed on all core servers: Congratulations, that was a massive and tedious effort. [17:26:07] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1030 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336637 (owner: 10Jcrespo) [17:27:46] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009868 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2012.codfw.wmnet'] ```... [17:28:14] (03CR) 10BryanDavis: [C: 031] "LGTM. exec is in the existing `elsif $::lsbdistcodename == 'trusty'` block" [puppet] - 10https://gerrit.wikimedia.org/r/324957 (https://phabricator.wikimedia.org/T97857) (owner: 10Tim Landscheidt) [17:28:36] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db1030 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336637 (owner: 10Jcrespo) [17:28:47] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1030 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336637 (owner: 10Jcrespo) [17:29:01] (03PS2) 10Jcrespo: mariadb: Depool db2057 for mariadb upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336644 (https://phabricator.wikimedia.org/T111654) [17:30:04] !log jynus@tin Synchronized wmf-config/db-eqiad.php: repool db1030 (duration: 00m 40s) [17:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:32] (03CR) 10BryanDavis: [C: 031] Tools: Do not install adminbot [puppet] - 10https://gerrit.wikimedia.org/r/336351 (https://phabricator.wikimedia.org/T157400) (owner: 10Tim Landscheidt) [17:34:26] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009900 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2009.codfw.wmnet'] ``` and were **ALL** successful. [17:36:00] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009910 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2011.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic201... [17:38:54] (03PS2) 10ArielGlenn: fix private table dumping in dumps [dumps] - 10https://gerrit.wikimedia.org/r/336643 [17:40:29] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3009921 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2010.codfw.wmnet'] ``` and were **ALL** successful. [17:41:56] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3009925 (10Ckepper) We have installed letsencrypt/certbot. You can now start testing on https://tools.pediapress.com/ [17:42:06] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2057 for mariadb upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336644 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [17:43:45] (03Merged) 10jenkins-bot: mariadb: Depool db2057 for mariadb upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336644 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [17:43:56] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:43:58] (03CR) 10jenkins-bot: mariadb: Depool db2057 for mariadb upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336644 (https://phabricator.wikimedia.org/T111654) (owner: 10Jcrespo) [17:45:57] !log added some annotations to the aqs analytics ACLs on cr1/cr2 [17:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:10] !log jynus@tin Synchronized wmf-config/db-codfw.php: depool db2057 (duration: 00m 41s) [17:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:13] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Puppet changes required for elasticsearch 5.x upgrade - https://phabricator.wikimedia.org/T155578#3009952 (10EBernhardson) I've updated the patch with the above findings, it now requires a hiera variable to specify which version of e... [17:51:15] 06Operations, 06Analytics-Kanban, 10netops: Review ACLs for the Analytics VLAN - https://phabricator.wikimedia.org/T157435#3010000 (10elukey) Completed the AQS work due to T157533 (under Brandon's supervision). I am going to keep working on this task during the next days to fix the remaining items. Caveat:... [17:52:26] (03PS5) 10EBernhardson: Update elasticsearch module for es5 compatability [puppet] - 10https://gerrit.wikimedia.org/r/333969 (https://phabricator.wikimedia.org/T155578) [17:52:54] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3010005 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2012.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic201... [17:56:50] (03CR) 10Paladox: [C: 031] "@Jcrespo could you remove your -1 please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [17:57:16] (03PS1) 10Mobrovac: Zotero: Restart the service every three days [puppet] - 10https://gerrit.wikimedia.org/r/336647 [18:02:21] !log upgrading and restarting db2057 T111654 [18:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:27] T111654: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654 [18:04:54] (03CR) 10Jcrespo: [C: 04-1] "No, I plan to package this myself- I have to package the mysql command line client anyway." [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:05:38] (03CR) 10Paladox: [C: 031] "Oh ok, so i can put your name down on the requested repo to be created? (Maintainer)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:05:47] (03CR) 10Paladox: [C: 031] "@Jcrespo ^^" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:05:53] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic20(09|10|11|12).codfw.wmnet [18:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:06] (03CR) 10Jcrespo: [C: 04-1] "> i can put your name down on the requested repo to be created? (Maintainer)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:08:25] (03CR) 10Paladox: [C: 031] "> > i can put your name down on the requested repo to be created?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:09:04] (03CR) 10Paladox: [C: 031] "Do you know when this will be done please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:09:16] (03CR) 10Paladox: [C: 031] "@Jcrespo ^^" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:10:53] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [18:11:19] 06Operations, 06Performance-Team, 05DC-Switchover-Prep-Q3-2016-17, 07Epic, 07Wikimedia-Multiple-active-datacenters: Prepare a reasonably performant warmup tool for MediaWiki caches (memcached/apc) - https://phabricator.wikimedia.org/T156922#3010078 (10elukey) >>! In T156922#2992765, @elukey wrote: > Warn... [18:12:08] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-Elukey: codfw:rack/setup mc2019-mc2036 - https://phabricator.wikimedia.org/T155755#3010085 (10elukey) 05Open>03Resolved Work completed, all the nutcrackers in codfw have been restarted to pick up the change. Please note: after https://gerrit.wikimed... [18:12:35] (03CR) 10Jcrespo: [C: 04-1] "> Do you know when this will be done please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:13:14] (03CR) 10Paladox: [C: 031] "> > Do you know when this will be done please?" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/336002 (https://phabricator.wikimedia.org/T145885) (owner: 10Paladox) [18:14:28] (03PS7) 10Anomie: Set $wgSoftBlockRanges [mediawiki-config] - 10https://gerrit.wikimedia.org/r/324215 [18:19:43] (03PS2) 10Rush: Tools: Do not install adminbot [puppet] - 10https://gerrit.wikimedia.org/r/336351 (https://phabricator.wikimedia.org/T157400) (owner: 10Tim Landscheidt) [18:20:11] (03PS3) 10Krinkle: multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 [18:20:13] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [18:20:16] (03CR) 10Krinkle: [C: 032] multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 (owner: 10Krinkle) [18:21:13] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3395707 keys, up 100 days 9 hours - replication_delay is 0 [18:21:13] PROBLEM - Redis replication status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [18:21:14] (03PS1) 10Cmjohnson: Removing mgmt entry for palladium [dns] - 10https://gerrit.wikimedia.org/r/336652 [18:21:28] Krinkle: Yay, new multiversion code to support! [18:21:33] (half-joking) [18:22:00] (03Merged) 10jenkins-bot: multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 (owner: 10Krinkle) [18:22:09] (03CR) 10jenkins-bot: multiversion: add bin/expanddblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334459 (owner: 10Krinkle) [18:22:13] RECOVERY - Redis replication status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3395591 keys, up 100 days 10 hours - replication_delay is 0 [18:22:15] (03PS2) 10Cmjohnson: Removing mgmt entry for palladium [dns] - 10https://gerrit.wikimedia.org/r/336652 [18:22:32] (03PS3) 10Krinkle: (no-op) Move comment about flow.dblist in settings to the dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334460 [18:22:44] (03CR) 10Krinkle: [C: 032] (no-op) Move comment about flow.dblist in settings to the dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334460 (owner: 10Krinkle) [18:24:50] (03CR) 10Chad: [C: 032] multiversion: Drop remaining MWVersion.php shim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336461 (owner: 10Chad) [18:25:11] (03Merged) 10jenkins-bot: (no-op) Move comment about flow.dblist in settings to the dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334460 (owner: 10Krinkle) [18:25:20] (03CR) 10jenkins-bot: (no-op) Move comment about flow.dblist in settings to the dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334460 (owner: 10Krinkle) [18:25:23] PROBLEM - puppet last run on prometheus2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:26:25] (03PS1) 10Cmjohnson: Adding an entry for an elastic server as a test...I will abandon this change [puppet] - 10https://gerrit.wikimedia.org/r/336653 [18:26:55] 06Operations: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010165 (10Dzahn) [18:27:06] 06Operations: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010180 (10Dzahn) [18:27:31] (03Merged) 10jenkins-bot: multiversion: Drop remaining MWVersion.php shim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336461 (owner: 10Chad) [18:27:48] (03CR) 10jenkins-bot: multiversion: Drop remaining MWVersion.php shim [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336461 (owner: 10Chad) [18:28:48] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3010185 (10Dzahn) [18:28:50] 06Operations: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010165 (10Dzahn) [18:31:11] (03CR) 10Cmjohnson: [C: 032] Removing mgmt entry for palladium [dns] - 10https://gerrit.wikimedia.org/r/336652 (owner: 10Cmjohnson) [18:32:17] (03Abandoned) 10Cmjohnson: Adding an entry for an elastic server as a test...I will abandon this change [puppet] - 10https://gerrit.wikimedia.org/r/336653 (owner: 10Cmjohnson) [18:32:20] we do 1 billion mysql queries a day on db1089 [18:32:48] (03CR) 10Rush: [C: 032] Tools: Do not install adminbot [puppet] - 10https://gerrit.wikimedia.org/r/336351 (https://phabricator.wikimedia.org/T157400) (owner: 10Tim Landscheidt) [18:33:04] !log demon@tin Synchronized multiversion/: Dropping old MWVersion shim (duration: 00m 57s) [18:33:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:14] jynus: that's ...pretty amazing [18:34:39] (03PS1) 10Dzahn: admin: rename user volkere to volker-e [puppet] - 10https://gerrit.wikimedia.org/r/336655 (https://phabricator.wikimedia.org/T157591) [18:34:40] impressive number [18:35:27] 06Operations, 10netops: Add firewall exception to get to wdqs[12]003.(codfw|eqiad).wmnet:8888 from analytics cluster - https://phabricator.wikimedia.org/T157593#3010219 (10Gehel) [18:35:32] not really, it is an average of 12K QPS [18:36:28] I have seen about that (and +) I guess but not /sustained/ over long periods [18:36:35] 06Operations, 10netops, 05Goal: Decomission palladium - https://phabricator.wikimedia.org/T147320#3010238 (10Cmjohnson) [18:36:37] 06Operations, 10ops-eqiad: decom palladium (datacenter) - https://phabricator.wikimedia.org/T149395#3010236 (10Cmjohnson) 05Open>03Resolved Palladium is wiped, off rack, racktables update and dns removed. [18:36:50] (03Abandoned) 10Dereckson: [throttle] New rule for 2017-02-08 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336582 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [18:37:04] QPS is a bit meaningless [18:37:34] it is not the same set sql_mode='' than INSERT INTO revision [18:38:22] (03CR) 10Mobrovac: "PCC OK https://puppet-compiler.wmflabs.org/5382/" [puppet] - 10https://gerrit.wikimedia.org/r/336647 (owner: 10Mobrovac) [18:38:39] (03PS6) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [18:39:22] jouncebot: next [18:39:22] In 0 hour(s) and 20 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1900) [18:40:31] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2057 for mariadb upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336657 [18:40:47] (03PS2) 10Dereckson: Enable Quiz on Spanish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336553 (https://phabricator.wikimedia.org/T157513) (owner: 10Platonides) [18:40:59] (03CR) 10Dereckson: [C: 031] Enable Quiz on Spanish Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336553 (https://phabricator.wikimedia.org/T157513) (owner: 10Platonides) [18:42:51] (03PS7) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [18:43:40] (03CR) 10Dereckson: [C: 04-1] Increase account creation limit for a couple of schools on it.wikiversity (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336552 (https://phabricator.wikimedia.org/T157504) (owner: 10Platonides) [18:45:11] (03CR) 10Dereckson: [C: 04-1] "The commit message current title could be interpreted differently, implying you want to remove (undeploy) it from labs." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 (owner: 10Milimetric) [18:47:01] (03PS1) 10Chad: Gerrit: Add gerrit-roots to new gerrit2001 in Dallas [puppet] - 10https://gerrit.wikimedia.org/r/336658 [18:47:24] (03CR) 10jerkins-bot: [V: 04-1] labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [18:48:03] (03CR) 10Dzahn: [C: 04-1] "let's add admin groups in role/common/gerrit/server.yaml and not worry about hostnames" [puppet] - 10https://gerrit.wikimedia.org/r/336658 (owner: 10Chad) [18:48:23] (03PS8) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [18:48:27] (03CR) 10VolkerE: [C: 031] admin: rename user volkere to volker-e [puppet] - 10https://gerrit.wikimedia.org/r/336655 (https://phabricator.wikimedia.org/T157591) (owner: 10Dzahn) [18:48:35] (03CR) 10Dzahn: [C: 031] "oh, i see the "temporary" part now. yea , got it" [puppet] - 10https://gerrit.wikimedia.org/r/336658 (owner: 10Chad) [18:49:29] (03PS2) 10Dzahn: Gerrit: Add gerrit-roots to new gerrit2001 in Dallas [puppet] - 10https://gerrit.wikimedia.org/r/336658 (https://phabricator.wikimedia.org/T152525) (owner: 10Chad) [18:50:17] (03CR) 10jerkins-bot: [V: 04-1] labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [18:50:52] (03CR) 10Milimetric: "I disagree. If I had said "Remove labs enable of Dashiki", that would mean what you suggest. But I'll clean up the message anyway." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 (owner: 10Milimetric) [18:50:54] (03CR) 10Dzahn: [C: 032] Gerrit: Add gerrit-roots to new gerrit2001 in Dallas [puppet] - 10https://gerrit.wikimedia.org/r/336658 (https://phabricator.wikimedia.org/T152525) (owner: 10Chad) [18:51:53] (03CR) 10Krinkle: [C: 031] Drop www.*.org symlinks to wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334450 (owner: 10Chad) [18:52:02] (03PS2) 10Milimetric: Fix labs-specific Dashiki hack with generic enable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336446 [18:52:04] (03PS9) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [18:52:23] PROBLEM - puppet last run on ms-be1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:52:26] (03CR) 10Dzahn: "[gerrit2001:~] $ id demon" [puppet] - 10https://gerrit.wikimedia.org/r/336658 (https://phabricator.wikimedia.org/T152525) (owner: 10Chad) [18:52:50] (03PS2) 10Dereckson: Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [18:52:55] (03CR) 10Dereckson: [C: 031] Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [18:53:08] Dereckson: scheduleable for this swat? [18:53:22] one patch left for the window [18:53:24] (03CR) 10Krinkle: "These symlinks may he useful to keep so that it's easy to expand a docroot if it becomes needed in the future (like for wikipedia.org the " [puppet] - 10https://gerrit.wikimedia.org/r/334449 (owner: 10Chad) [18:54:00] (03PS2) 10Dzahn: admin: rename user volkere to volker-e [puppet] - 10https://gerrit.wikimedia.org/r/336655 (https://phabricator.wikimedia.org/T157591) [18:54:09] (03CR) 10Dereckson: Disable RelatedSites on English, French and Italian Wikivoyages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) (owner: 10Jdlrobson) [18:54:15] mafk: yes [18:54:23] RECOVERY - puppet last run on prometheus2001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:54:42] Dereckson: okay, will need guidance while checking [18:55:02] mafk: okay, I'll take care of the SWAT so [18:55:16] parfait [18:55:47] (03PS3) 10MarcoAurelio: Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) [18:57:05] (03PS2) 10Dereckson: Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [18:57:16] (03PS1) 10Jcrespo: mariadb: Depool db1034 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336661 (https://phabricator.wikimedia.org/T111654) [18:57:36] (03CR) 10Dzahn: [C: 032] "i'll manually move home and cleanup the older user since this is just on bastion hosts and rutherfordium" [puppet] - 10https://gerrit.wikimedia.org/r/336655 (https://phabricator.wikimedia.org/T157591) (owner: 10Dzahn) [18:59:14] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [18:59:15] (03PS3) 10MarcoAurelio: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) [18:59:25] (03PS3) 10MarcoAurelio: Create autopatrolled and rollbacker permissions for fa.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) [19:00:04] addshore, hashar, anomie, ostriches, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T1900). [19:00:04] mafk, kaldari, MatmaRex, and Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [19:00:13] So I can SWAT. [19:00:14] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3397160 keys, up 100 days 10 hours - replication_delay is 0 [19:00:18] hi. [19:00:19] Seems kaldari isn't available. [19:00:22] I'm here [19:00:23] Hi mafk [19:00:23] Hi MatmaRex [19:00:28] hi Dereckson [19:00:47] MatmaRex: seems your change are merged, let's start with them [19:02:20] !log temp. disabling puppet and doing some debugging on bastion hosts, renaming a user [19:02:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:02:30] MatmaRex: live on mwdebug1002 [19:03:18] ACKNOWLEDGEMENT - puppet last run on bast1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): User[volker-e] daniel_zahn T157591 [19:03:18] ACKNOWLEDGEMENT - puppet last run on bast2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 44 seconds ago with 1 failures. Failed resources (up to 3 shown): User[volker-e] daniel_zahn T157591 [19:03:59] Dereckson: looks good [19:04:16] brb 1 minute [19:04:19] urgent thing [19:04:23] mafk: MatmaRex: k [19:05:14] MatmaRex: syncing [19:05:26] back [19:05:49] !log dereckson@tin Synchronized php-1.29.0-wmf.10/extensions/UploadWizard/UploadWizard.config.php: Disable Firefogg support (T157201) (duration: 00m 46s) [19:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:53] T157201: Stop supporting Firefogg in Upload Wizard now that Firefox is making it impossible - https://phabricator.wikimedia.org/T157201 [19:06:03] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): User[volker-e] [19:06:42] (03CR) 10Urbanecm: "Thank you for abandoning @Dereckson. Tried to catch deployer, without sucess. They must give bigger notice." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336582 (https://phabricator.wikimedia.org/T154312) (owner: 10Urbanecm) [19:06:53] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:06:58] (03PS1) 10Jdlrobson: wgMinervaUseFooterV2 config flag no longer necessary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336664 (https://phabricator.wikimedia.org/T157075) [19:07:05] Urbanecm: I've seen the change on my phone at 10:30 CET, ie 5h30 after event start [19:07:14] thanks [19:07:15] I told myself it was a little late [19:07:31] !log bastion hosts, people.wm: deluser volkere, let puppet create volker-e, move data, delete old home dir (T157591) [19:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:35] T157591: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591 [19:07:37] s/5:30/4:30 [19:07:52] 06Operations, 10DBA, 13Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#3010354 (10jcrespo) db1034 is left, pending of the reimage marked above^. Of the non core hosts, only the following are left: db1020.eqiad.wmnet: NULL - **m2 master** db1009.eqiad.wmnet:... [19:08:04] !log dereckson@tin Synchronized php-1.29.0-wmf.11/extensions/UploadWizard/UploadWizard.config.php: Disable Firefogg support (T157201) (duration: 00m 44s) [19:08:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:59] Reedy: on https://phabricator.wikimedia.org/T157105 you wrote "Probably wants a little deeper digging first though", but Kaldari deployed it for this SWAT, you've already done the verification you wanted to? [19:09:12] Nope, lol [19:09:32] It shouldn't be a problem, but... [19:10:01] (03CR) 10Dereckson: "Deployment should be coordinated with security." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [19:10:14] MatmaRex: ping me when you're back [19:10:19] mafk: ^ [19:10:29] * Dereckson notes: 3 letters before tab [19:10:37] 20:05] mafk back [19:11:02] Dereckson: ^ [19:11:24] mafk: for https://gerrit.wikimedia.org/r/#/c/323651/ you can follow regular process: have it merged to master, then l10nupdate will pick it the next day 3am [19:11:43] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:50] hmmmm l10nupdate updates *existing* messages [19:12:05] well yeah, we'll probably need to SWAT it too [19:12:21] 06Operations, 13Patch-For-Review: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010370 (10Dzahn) @Volker_E This should all be done now. Your new user is "volker-e". I checked on the 4 bastion hosts and the "people.wikimedia.org" server. There was no data to move except... [19:12:28] Dereckson: that one is for master, after that's merged I can cherry-pick to the current wmf. branch [19:12:36] if that's how it is supposed to work [19:12:43] 06Operations: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010371 (10Dzahn) [19:12:50] mafk: you're also supposed to have it merged to master before SWAT scheduling [19:13:13] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) (owner: 10MarcoAurelio) [19:13:59] 06Operations: Enhance account handling (meta bug) - https://phabricator.wikimedia.org/T142815#3010374 (10Dzahn) [19:14:01] 06Operations: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010165 (10Dzahn) 05Open>03Resolved ``` [rutherfordium:~] $ id volker-e uid=12186(volker-e) gid=500(wikidev) groups=500(wikidev),600(all-users) [rutherfordium:/home/volker-e/public_html] $ ls 4.0K -rw-r--r--... [19:14:55] (03Merged) 10jenkins-bot: Create autopatrolled and rollbacker permissions for fa.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) (owner: 10MarcoAurelio) [19:15:36] mafk: autopatrolled rollbacker live on mwdebug1002 [19:15:49] checking [19:16:34] 06Operations, 10Gerrit, 06Release-Engineering-Team, 13Patch-For-Review: setup/install gerrit2001/WMF6408 - https://phabricator.wikimedia.org/T152525#3010396 (10Dzahn) @demon @20after4 and @catrope can now SSH to gerrit2001 and have root like on the current prod server [19:16:36] looks good to me at listgrouprights config as expected [19:16:40] (03CR) 10jenkins-bot: Create autopatrolled and rollbacker permissions for fa.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336401 (https://phabricator.wikimedia.org/T156163) (owner: 10MarcoAurelio) [19:17:29] Dereckson: lgtm [19:17:47] ok syncing [19:17:54] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Create autopatrolled and rollbacker permissions for fa.wikiquote (T156163) (duration: 00m 43s) [19:17:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:59] T156163: Add Rollback, Autopatrolled, Template Editor, Patroller and Rollback to Persian Wikiquote - https://phabricator.wikimedia.org/T156163 [19:18:02] also, that wikimediamessages patch we can leave it for another time if we want the changes merged first on the extension + avalaible on translatewiki [19:18:11] k [19:18:22] (03PS4) 10Dereckson: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) (owner: 10MarcoAurelio) [19:18:28] I'll poke Raymond_ about it [19:18:37] I think he has +2 there [19:20:19] (03PS5) 10Dereckson: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) (owner: 10MarcoAurelio) [19:20:23] mafk: https://gerrit.wikimedia.org/r/#/c/336397/4/wmf-config/InitialiseSettings.php check line 3478 [19:20:37] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) (owner: 10MarcoAurelio) [19:20:58] mafk: "git diff" also such extraneous spaces in red [19:21:01] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3010423 (10Dzahn) @Ckepper very cool :) thank you looks good to me and gets A rating here https://www.ssllabs.com/ssltest/analyze.htm... [19:21:05] damn whitespace [19:21:06] ...also highlights such... [19:21:10] not sure where it came [19:21:15] fixing in gerrit just quick [19:21:16] Some editors allow to kill spaces at end of line [19:21:19] generally ons ave [19:21:23] RECOVERY - puppet last run on ms-be1009 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:21:31] notepad++ does it [19:21:36] I guess I forgot to run it [19:21:37] (I've removed it in PS5) [19:21:43] ah good [19:21:49] well, it's better to have an editor doing it *automatically* :p [19:22:15] (03CR) 10Mattflaschen: [C: 031] Enable Echo on loginwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336548 (https://phabricator.wikimedia.org/T157105) (owner: 10Kaldari) [19:22:23] Some editors, like PhpStorm, have a little strange behavior: they do it on explicit save, but not on autosave [19:22:24] (03Merged) 10jenkins-bot: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) (owner: 10MarcoAurelio) [19:22:37] (03CR) 10jenkins-bot: Short aliases for Module/Module_talk for Malayalam Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336397 (https://phabricator.wikimedia.org/T56951) (owner: 10MarcoAurelio) [19:22:45] you ask me a lot of things for a noob dev :P [19:22:53] duly noted of course [19:22:56] and appreciated [19:23:11] 06Operations, 10Ops-Access-Requests: Access to people.wikimedia.org for Volker_E - https://phabricator.wikimedia.org/T143465#3010456 (10Dzahn) user has been renamed to "volker-e" in T157591. The new URL is now https://people.wikimedia.org/~volker-e/ [19:23:11] What editor do you use by the way? Notepad++? [19:23:25] Dereckson: yes, told you previously :) [19:23:28] 06Operations: rename shell user 'volkere' to 'volker-e' - https://phabricator.wikimedia.org/T157591#3010165 (10Dzahn) new URL https://people.wikimedia.org/~volker-e/ [19:23:47] Short aliases for Module/Module_talk for Malayalam Wikimedia projects live on mwdebug1002.eqiad.wmnet [19:24:03] checking [19:24:34] http://superuser.com/questions/699382/is-there-a-way-in-notepadto-remove-white-spaces-in-empty-lines-automatically so you can do an alt + shift + s [19:24:58] (and carefully git diff to avoid to create a frankenstein patch fixing whitespaces + doing your change) [19:25:22] Dereckson: ഘ as aliase on mlwiki works as expected, doing the talk one [19:25:23] (in such case, `git add -p` allows to select precisely the part of the file you want to commit) [19:25:49] (03CR) 10Jdlrobson: [C: 04-1] "merge on 9th February or after" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336664 (https://phabricator.wikimedia.org/T157075) (owner: 10Jdlrobson) [19:26:04] (03PS4) 10Dereckson: Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [19:26:42] Dereckson: Module (talk) aliases working for mlwiki so should be working for the other projects as well [19:26:50] good [19:27:38] :) [19:29:28] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Namespace configuration for ml. projects (T56951) (duration: 00m 41s) [19:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:32] T56951: Rename namespace module to Malayalam in Malayalam language wikis - https://phabricator.wikimedia.org/T56951 [19:29:39] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [19:30:03] IF it doesn't work, it's simple to check: we'll see a fatal exception on any Category: page ^ [19:30:21] (03CR) 10Chad: "Well you gave me a +1 on https://gerrit.wikimedia.org/r/#/c/334450/." [puppet] - 10https://gerrit.wikimedia.org/r/334449 (owner: 10Chad) [19:30:27] (03Abandoned) 10Chad: Beta: Just use standard docroot directly for most sites [puppet] - 10https://gerrit.wikimedia.org/r/334449 (owner: 10Chad) [19:30:29] (03Abandoned) 10Chad: Drop www.*.org symlinks to wwwportal [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334450 (owner: 10Chad) [19:31:05] (03PS3) 10Dzahn: interface: rps::modparams, aggregate_member not in autoload layout [puppet] - 10https://gerrit.wikimedia.org/r/332959 [19:31:09] (03Merged) 10jenkins-bot: Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [19:31:23] Dereckson: on mwdebug you mean right? [19:31:27] not yet [19:31:29] now yes [19:31:29] (03CR) 10jenkins-bot: Define category collation for olo.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/334424 (https://phabricator.wikimedia.org/T146612) (owner: 10MarcoAurelio) [19:31:46] (I waited the merge to sync there) [19:31:53] so lets pull it to mwdebug and we'll see [19:32:03] yeah, it's on mwdebug1002 now [19:32:37] checking [19:33:07] Dereckson: https://olo.wikipedia.org/wiki/Kategourii:P%C3%A4iv%C3%A4t looks the same on mwdebug [19:33:18] maybe it needs time to re-sort? [19:34:16] you need something with Å ideally [19:34:18] (03CR) 10Dzahn: [C: 032] "nothing inside the classes is changing, they are just moved to a separate file" [puppet] - 10https://gerrit.wikimedia.org/r/332959 (owner: 10Dzahn) [19:34:27] special:categories is our friend here [19:35:10] Dereckson: no Å [19:35:27] https://olo.wikipedia.org/wiki/Erikoine:Kaikki_sivut?from=%C3%85&to=&namespace=0 <- I confirm [19:35:28] (03PS5) 10Dzahn: contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) [19:36:14] ah ! https://en.wikipedia.org/wiki/Finnish_orthography#Collation_order [19:36:17] er https://olo.wikipedia.org/wiki/Kategourii:Uskondot [19:36:26] Uskondo and Šintolaižus [19:36:52] No fatals for me? [19:37:43] for me neither [19:38:06] good to ship? [19:38:28] Yes, but I'm not sure we've a sample where the collation has been observed to be working [19:38:43] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:39:15] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Set category collation for olo.wikipedia (T146612, T147064) (duration: 00m 43s) [19:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:21] T147064: Determine category collation for Livvi-Karelian Wikipedia (olo.wikipedia.org) - https://phabricator.wikimedia.org/T147064 [19:39:21] T146612: Create Livvi-Karelian Wikipedia at olo.wikipedia.org - https://phabricator.wikimedia.org/T146612 [19:40:07] Dereckson: any maintenancescript needed? [19:40:15] yes [19:40:15] (03PS2) 10Dereckson: Enable VE on fr.wiktionary Projet: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) [19:40:19] indeed [19:40:32] (03CR) 10Dzahn: "re: "PPC is still confused on cont1001:" [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [19:40:42] updateCollation.php [19:40:54] (03CR) 10Dzahn: [C: 032] contint/zuul: skip Icinga monitoring if server not master [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [19:41:05] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) (owner: 10Dereckson) [19:41:05] https://www.mediawiki.org/wiki/Manual:$wgCategoryCollation#Details [19:41:27] mwscript updateCollation.php --wiki:olowiki [19:43:38] (03PS10) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [19:43:39] mafk: done [19:44:15] !log mwscript updateCollation.php --wiki=olowiki --previous-collation=uppercase (T147064, 4238 rows processed) [19:44:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:23] PROBLEM - Redis replication status tcp_6479 on rdb2005 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.32.133 on port 6479 [19:44:40] 72 Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0). [19:44:49] I wonder if it's updateCollation or another issue [19:45:05] https://olo.wikipedia.org/wiki/Kategourii:P%C3%A4iv%C3%A4t looks better [19:45:14] RECOVERY - Redis replication status tcp_6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 10.192.32.133:6479 has 1 databases (db0) with 3396540 keys, up 100 days 11 hours - replication_delay is 0 [19:47:29] (03PS3) 10Dereckson: Enable VE on fr.wiktionary Projet: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) [19:47:35] (03CR) 10Rush: labstore: Diamond collector to track directory sizes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [19:47:51] (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) (owner: 10Dereckson) [19:49:11] (03PS3) 10Andrew Bogott: Horizon: Only display puppet roles that have filtertags in the puppet comments. [puppet] - 10https://gerrit.wikimedia.org/r/335593 (https://phabricator.wikimedia.org/T149589) [19:50:38] So for VE change, we're waiting operations-mw-config-composer-hhvm-jessie [19:50:58] mafk: indeed [19:51:15] hm? [19:51:22] phab ticket close ? [19:51:23] numeric collation looks better [19:51:27] yes we can [19:51:34] (and so we can also close olo wikipedia creation) [19:51:46] (03Merged) 10jenkins-bot: Enable VE on fr.wiktionary Projet: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) (owner: 10Dereckson) [19:51:51] (03CR) 10Andrew Bogott: [C: 032] Horizon: Only display puppet roles that have filtertags in the puppet comments. [puppet] - 10https://gerrit.wikimedia.org/r/335593 (https://phabricator.wikimedia.org/T149589) (owner: 10Andrew Bogott) [19:51:55] (03CR) 10jenkins-bot: Enable VE on fr.wiktionary Projet: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336347 (https://phabricator.wikimedia.org/T156660) (owner: 10Dereckson) [19:52:18] VE change on mwdebug1002 [19:53:37] (03PS1) 10Dzahn: bastionhost,bugzilla: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/336668 (https://phabricator.wikimedia.org/T93645) [19:53:46] https://fr.wiktionary.org/w/index.php?title=Projet:Prononciation&veaction=edit works [19:54:01] (03CR) 10Dzahn: "doing the bastionhost part @ https://gerrit.wikimedia.org/r/#/c/336668/" [puppet] - 10https://gerrit.wikimedia.org/r/334310 (https://phabricator.wikimedia.org/T93645) (owner: 10Juniorsys) [19:54:36] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable VE on fr.wiktionary Projet: namespace (T156660) (duration: 00m 44s) [19:54:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:40] T156660: Enable the Visual Editor for Projet namespace on frwiktionary - https://phabricator.wikimedia.org/T156660 [19:54:48] (03PS2) 10Dzahn: bastionhost,bugzilla: lint fixes [puppet] - 10https://gerrit.wikimedia.org/r/336668 (https://phabricator.wikimedia.org/T93645) [19:54:55] And with that, SWAT done. [19:55:30] (with wikimediamessages pending to +2 by their mantainers which we will cherry-pick afterwards) [19:57:31] (03Abandoned) 10Nuria: Adding uaprser to eventlogging deps [puppet] - 10https://gerrit.wikimedia.org/r/335854 (https://phabricator.wikimedia.org/T153207) (owner: 10Nuria) [19:57:45] "grouppage-shell": "{{ns:project}}:Shell users", [19:57:56] there is a special rule for plural by the wat [19:57:56] y [19:58:06] er well for, "group-shell": "Shell users", [19:58:09] not for the page [19:59:39] hmmmm, the rule is only to add a plural when we know how many (to choose among singular, dual, plural, etc.) [20:00:03] PROBLEM - puppet last run on db1049 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:00:05] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T2000). Please do the needful. [20:00:15] * thcipriani does needful [20:00:23] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5384/" [puppet] - 10https://gerrit.wikimedia.org/r/336668 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [20:00:27] (03PS11) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [20:04:54] (03PS3) 10Andrew Bogott: Keystone hooks: Set up default security groups for new projects. [puppet] - 10https://gerrit.wikimedia.org/r/332899 (https://phabricator.wikimedia.org/T136871) [20:06:09] jdlrobson: thanks for the quick work on T157515 :) Could I get you to to fix the line-width in your patch and then I'll backport to wmf.11 for the train? [20:06:09] T157515: Notice: Undefined variable: latest in /srv/mediawiki/php-1.29.0-wmf.11/extensions/MobileFrontend/includes/api/ApiMobileView.php on line 499 - https://phabricator.wikimedia.org/T157515 [20:09:40] (03PS12) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [20:09:43] (03PS4) 10Andrew Bogott: Keystone hooks: Set up default security groups for new projects. [puppet] - 10https://gerrit.wikimedia.org/r/332899 (https://phabricator.wikimedia.org/T136871) [20:11:33] (03PS1) 10Dzahn: multiple roles: lint-fix role::backup::host includes [puppet] - 10https://gerrit.wikimedia.org/r/336672 (https://phabricator.wikimedia.org/T93645) [20:11:37] (03CR) 10jerkins-bot: [V: 04-1] labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [20:13:11] (03CR) 10Dzahn: "Before we had 3 CRIT/ACKed services in Icinga on contint2001, jenkins_zmq_publisher," [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [20:14:53] (03CR) 10Dzahn: "can we do the same in modules/role/manifests/ci/master.pp for nrpe::monitor_service { 'jenkins_zmq_publisher': now ?:)" [puppet] - 10https://gerrit.wikimedia.org/r/327695 (https://phabricator.wikimedia.org/T150771) (owner: 10Dzahn) [20:17:40] (03PS13) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [20:20:05] (03CR) 10Andrew Bogott: [C: 032] Keystone hooks: Set up default security groups for new projects. [puppet] - 10https://gerrit.wikimedia.org/r/332899 (https://phabricator.wikimedia.org/T136871) (owner: 10Andrew Bogott) [20:21:12] PROBLEM - puppet last run on mw1295 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:24:56] (03PS14) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [20:27:28] jdlrobson: thank you! [20:28:02] RECOVERY - puppet last run on db1049 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:47:07] !log thcipriani@tin Synchronized php-1.29.0-wmf.11/extensions/MobileFrontend/includes/api/ApiMobileView.php: [[gerrit:336676|Pass revision id to parseSectionsData to avoid warnings]] T157515 (duration: 00m 42s) [20:47:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:13] T157515: Notice: Undefined variable: latest in /srv/mediawiki/php-1.29.0-wmf.11/extensions/MobileFrontend/includes/api/ApiMobileView.php on line 499 - https://phabricator.wikimedia.org/T157515 [20:50:12] RECOVERY - puppet last run on mw1295 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:50:44] * halfak excitedly waits for the deployment time for ORES :D [20:50:44] (03PS15) 10Rush: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [20:51:48] (03CR) 10Jdlrobson: Disable RelatedSites on English, French and Italian Wikivoyages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/335830 (https://phabricator.wikimedia.org/T128326) (owner: 10Jdlrobson) [20:52:02] !log gehel@puppetmaster1001 conftool action : set/pooled=no; selector: name=elastic20(13|14|15|16).codfw.wmnet [20:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:25] (03PS1) 10Thcipriani: group1 wikis to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336678 [20:52:27] (03CR) 10Thcipriani: [C: 032] group1 wikis to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336678 (owner: 10Thcipriani) [20:53:48] (03PS1) 10Gehel: elasticsearch - reimage elastic20(13|14|15|16) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336679 (https://phabricator.wikimedia.org/T151326) [20:53:50] (03Merged) 10jenkins-bot: group1 wikis to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336678 (owner: 10Thcipriani) [20:54:13] (03CR) 10jenkins-bot: group1 wikis to 1.29.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336678 (owner: 10Thcipriani) [20:54:18] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.29.0-wmf.11 [20:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:56:23] (03CR) 10Rush: labstore: Diamond collector to track directory sizes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [20:57:13] (03CR) 10Gehel: [C: 032] elasticsearch - reimage elastic20(13|14|15|16) to jessie and move data to /srv [puppet] - 10https://gerrit.wikimedia.org/r/336679 (https://phabricator.wikimedia.org/T151326) (owner: 10Gehel) [20:58:15] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3010991 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2013.codfw.wmnet'] ```... [20:58:44] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3010995 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2014.codfw.wmnet'] ```... [20:58:48] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3010996 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2015.codfw.wmnet'] ```... [20:58:51] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3010997 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2016.codfw.wmnet'] ```... [21:00:04] gwicke, cscott, arlolra, subbu, bearND, halfak, and Amir1: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T2100). Please do the needful. [21:01:42] (03CR) 10Rush: [C: 031] "nvmd size is set to None further up :)" [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [21:01:57] (03PS16) 10Rush: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [21:01:59] o/ [21:02:07] I'm ready to start with ORES. [21:03:34] Looks like a host key changed. I'm double checking before I move forward [21:05:12] RECOVERY - Host eventdonations.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.91 ms [21:06:00] Updating the deploy repo on deployment.eqiad.wmnet [21:06:28] Jeff_Green: how did you do that ^ :) [21:06:41] (03PS17) 10Madhuvishy: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) [21:06:53] mutante: i did nothing :-) [21:07:22] PROBLEM - puppet last run on labvirt1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:08:34] Jeff_Green: weird, eventdonations was down in Icinga since forever and we had it acked as "yea, ICMP is dropped somewhere, the service on it is reachable, just Icinga cant ping the host" [21:08:48] well, now it works [21:09:18] weird, I have no idea [21:09:31] benefactorevents is there now, i added the cert check [21:09:42] but as a host, that is also down for Icinga [21:10:48] https://phabricator.wikimedia.org/T156850#3007986 [21:10:58] did you see the part about cert in 22 days? [21:11:02] i can't ping either of them from home fwiw [21:11:03] Looks like I'm getting a permission issue for "vcs@git-ssh.wikimedia.org" but I don't get it on deployment-tin. [21:11:17] yeah, Rob and I have already talked about that renewal [21:11:42] Jeff_Green: cant ping it either, it's cloudapp.net. ok cool @ renewal [21:12:04] thanks for the heads up! [21:12:07] welcome [21:12:52] Jeff_Green: is the ticket resolved? [21:13:02] Can anyone tell me if there's a good reason that ssh-agent might not work with deployment.eqiad.wmnet, but it would work on deployment-tin? [21:13:33] *deployment-tin.eqiad.wmflabs [21:13:44] maybe a known host conflict halfak? [21:13:50] not sure how it would surface in all cases [21:14:04] agent forwarding isn't allowed in production. do you mean ssh -A ? [21:14:08] Hmm.. I don't think that's it. I'm trying to use git-ssh. [21:14:12] Yes [21:14:22] PROBLEM - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [21:14:25] it's disabled on the sshd side [21:14:28] git-ssh is necessary for pulling my git repo for ORES [21:14:37] (03PS4) 10Eevans: Enable JMX exporter on RESTBase Staging nodes in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/335826 (https://phabricator.wikimedia.org/T155120) [21:14:56] diffusion https doesn't handle repos above a pretty small size [21:15:28] Any ideas for getting my key in the right place for going a pull via git-ssh? [21:16:27] Generate a unique key on that host? [21:17:04] As in, generate a key on tin and then add that key to phab so that vcs@git-ssh.wikimedia.org will be able to recognize it? [21:17:19] I guess I could do that and then delete the key from tin -- or would it be OK it just leave it there? [21:17:19] Yep that'd work [21:17:46] I guess putting it in your homedir and properly chmod'ing it is fine [21:17:55] kk [21:17:56] doing [21:18:14] RainbowSprinkles, I find it an interesting problem to follow your usernames :) [21:18:18] Not a complaint [21:18:19] :D [21:18:33] As soon as you get used to this one.... :p [21:18:42] Jeff_Green: well, that was a short change "Host eventdonations.wikimedia.org is DOWN" .. shrug , ok :) [21:18:52] ha [21:19:14] halfak: you could forward a ssh-agent configured to ask before use [21:20:06] It worked! [21:20:12] \o/ [21:21:36] Platonides, can you forward an agent when sshd refuses it? [21:21:44] jouncebot: next [21:21:45] In 2 hour(s) and 38 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170209T0000) [21:22:18] This update is taking a while :/ [21:22:24] modules/ssh/templates/sshd_config.erb:AllowAgentForwarding no [21:24:56] This is an inane amount of time to wait. This "submodule update" might prevent today's deployment. [21:25:01] That would be sad. [21:25:26] * mutante wants to add patches to swat via a command to jouncebot [21:27:12] Aha! I needed to sync the switch sway from https to git-ssh. [21:27:15] Continuing now [21:28:06] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3011046 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2013.codfw.wmnet'] ``` and were **ALL** successful. [21:28:22] PROBLEM - Check systemd state on elastic2012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [21:28:27] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3011047 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2014.codfw.wmnet'] ``` and were **ALL** successful. [21:29:07] Note to self et al.: update deploy directory before the deployment window next time [21:31:29] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler doe not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011063 (10Josve05a) [21:31:48] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011075 (10Josve05a) [21:31:54] Old deployment is 228b9b4ff925851bcb36bbeafe54359433ea1e92 [21:32:21] RECOVERY - Check systemd state on elastic2012 is OK: OK - running: The system is fully operational [21:32:21] New deployment is 7c80636313b088928c8eba5d5bdf0b62b8db7f76 [21:32:33] !log halfak@tin Started deploy [ores/deploy@7c80636]: (no justification provided) [21:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:45] halfak: https://phabricator.wikimedia.org/T157621 You know anything about this? [21:33:23] Josve05a, not at all. Should I? [21:34:14] (03PS2) 10Dzahn: multiple roles: lint-fix role::backup::host includes [puppet] - 10https://gerrit.wikimedia.org/r/336672 (https://phabricator.wikimedia.org/T93645) [21:34:19] No. You just seem to know everything that's going on in here, so took a chance that you might have known of some problems which was causing this [21:36:18] !log halfak@tin Finished deploy [ores/deploy@7c80636]: (no justification provided) (duration: 03m 45s) [21:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:36:30] RECOVERY - puppet last run on labvirt1003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [21:37:28] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011091 (10Paladox) p:05Triage>03Unbreak! [21:37:51] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 15 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#3011094 (10GWicke) @Tgr, the concerns you raise are primarily about the implementation, and not really about the API. I think it is important to separa... [21:38:35] Deployment of ORES failed at the canary and I've rolled back. [21:38:44] I'll look into the error and try again at another window. [21:38:52] All done. Status: sad :( [21:40:56] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3011114 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2015.codfw.wmnet'] ```... [21:41:15] !log halfak@tin Started deploy [ores/deploy@7c80636]: (no justification provided) [21:41:19] Woops [21:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:27] It's OK. I'll stop it at the canary [21:42:41] !log halfak@tin Finished deploy [ores/deploy@7c80636]: (no justification provided) (duration: 01m 26s) [21:42:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:49] Rolled back again. Closing that terminal. [21:43:47] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic20(13|14).codfw.wmnet [21:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:15] (03PS1) 10Platonides: Remove outdated comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336704 [21:45:06] (03CR) 10Dzahn: [C: 031] "http already redirects to https now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336340 (https://phabricator.wikimedia.org/T157398) (owner: 10Platonides) [21:46:56] (03CR) 10Platonides: "PediaPress is not redirecting http: accesses to https:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336340 (https://phabricator.wikimedia.org/T157398) (owner: 10Platonides) [21:47:20] /23/23 [21:50:02] (03CR) 10Dzahn: [C: 032] multiple roles: lint-fix role::backup::host includes [puppet] - 10https://gerrit.wikimedia.org/r/336672 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [21:50:23] jouncebot now [21:50:23] For the next 0 hour(s) and 9 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T2000) [21:50:24] For the next 0 hour(s) and 9 minute(s): Services – Parsoid / OCG / Citoid / Mobileapps / ORES / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170208T2100) [21:51:31] (03CR) 10Dzahn: [C: 031] "ah:) ok! yep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336340 (https://phabricator.wikimedia.org/T157398) (owner: 10Platonides) [21:52:07] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3006952 (10RStallman-legalteam) The NDA for Samtar has been signed and is on file with legal. Thank you! [21:52:45] (03PS18) 10Zppix: labstore: Diamond collector to track directory sizes [puppet] - 10https://gerrit.wikimedia.org/r/335855 (https://phabricator.wikimedia.org/T126623) (owner: 10Madhuvishy) [21:54:22] (03PS1) 10Ottomata: Fix for missing _regexes.py, release 0.7.2-2 [debs/python-ua-parser] - 10https://gerrit.wikimedia.org/r/336712 (https://phabricator.wikimedia.org/T156821) [21:58:14] !log mholloway-shell@tin Started deploy [mobileapps/deploy@0efa7b8]: Update service-mobileapp-node to f45bfff [21:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:57] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3011188 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['elastic2015.codfw.wmnet'] ``` Of which those **FAILED**: ``` set(['elastic201... [22:01:10] !log mholloway-shell@tin Finished deploy [mobileapps/deploy@0efa7b8]: Update service-mobileapp-node to f45bfff (duration: 02m 55s) [22:01:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:34] (03PS1) 10Dzahn: multiple roles: lint-fix base::firewall includes [puppet] - 10https://gerrit.wikimedia.org/r/336715 (https://phabricator.wikimedia.org/T93645) [22:06:40] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011217 (10Paladox) I Can't reproduce this error locally, i updated my wiki to wmf 11 and it still works i even updated... [22:08:42] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/5387/" [puppet] - 10https://gerrit.wikimedia.org/r/336715 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:17:28] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011229 (10Paladox) I doint see it on https://commons.wikimedia.org/wiki/Special:SpecialPages [22:18:53] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011232 (10Josve05a) >>! In T157621#3011217, @Paladox wrote: > I Can't reproduce this error locally, i updated my wiki... [22:19:26] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011235 (10Paladox) oh [22:19:37] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011236 (10Josve05a) >>! In T157621#3011229, @Paladox wrote: > I doint see it on https://commons.wikimedia.org/wiki/Spe... [22:20:09] (03PS1) 10Dzahn: multiple roles: lint-fix standard/base::firewall includes [puppet] - 10https://gerrit.wikimedia.org/r/336720 (https://phabricator.wikimedia.org/T93645) [22:20:25] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011238 (10Paladox) i only see Media reports and uploads File list File usage on other wikis List of files with duplic... [22:21:07] paladox: might be too many files to parse the page for the broswer, so it kills the process to load the page...or something? [22:21:20] Wikipedia works [22:21:22] !log elastic2016 not coming up after reimage - powercycling [22:21:24] https://en.wikipedia.org/wiki/Special:TimedMediaHandler [22:21:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:30] though that's using wmf 10 [22:22:27] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011240 (10Josve05a) >>! In T157621#3011238, @Paladox wrote: > i only see > > Media reports and uploads > File list >... [22:23:13] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011242 (10Paladox) @mmodell helped me look into this, and he found this [0e81126f] PHP Fatal Error: request has excee... [22:23:27] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler, 07Wikimedia-log-errors: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011244 (10mmodell) Ok I found the error & stack trace in kibana: `PHP Fatal Error: request... [22:23:41] (03PS2) 10Dzahn: multiple roles: lint-fix standard/base::firewall includes [puppet] - 10https://gerrit.wikimedia.org/r/336720 (https://phabricator.wikimedia.org/T93645) [22:23:46] 06Operations, 06Commons, 10MediaWiki-Special-pages, 10TimedMediaHandler, 07Wikimedia-log-errors: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621#3011259 (10mmodell) [22:25:15] (03Abandoned) 10Ottomata: Fix for missing _regexes.py, release 0.7.2-2 [debs/python-ua-parser] - 10https://gerrit.wikimedia.org/r/336712 (https://phabricator.wikimedia.org/T156821) (owner: 10Ottomata) [22:25:32] (03PS1) 10Ottomata: Fix for missing _regexes.py, release 0.7.2-2 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/336722 (https://phabricator.wikimedia.org/T156821) [22:26:41] 06Operations, 10CirrusSearch, 06Discovery, 10Elasticsearch, and 2 others: Upgrade cirrus / elasticsearch to Jessie - https://phabricator.wikimedia.org/T151326#3011277 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['elastic2016.codfw.wmnet'] ```... [22:26:59] !log demon@tin Synchronized php-1.29.0-wmf.11/includes/WebResponse.php: Debugging fun times (duration: 00m 50s) [22:27:00] (03CR) 10Ottomata: [V: 032 C: 032] Fix for missing _regexes.py, release 0.7.2-2 [debs/python-ua-parser] (debian) - 10https://gerrit.wikimedia.org/r/336722 (https://phabricator.wikimedia.org/T156821) (owner: 10Ottomata) [22:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:54] (03CR) 10Dzahn: [C: 032] multiple roles: lint-fix standard/base::firewall includes [puppet] - 10https://gerrit.wikimedia.org/r/336720 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [22:29:54] thcipriani: You haz logs ^ [22:30:17] RainbowSprinkles: weee! awesome! Thanks :) [22:31:46] Also, in what batshit world is this possible? https://logstash.wikimedia.org/goto/3b916ba69ef87993d2fedcce272dcab3 [22:32:01] (03PS2) 10Dzahn: openstack: switch installserver to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336356 [22:32:08] Something called a wmf.5 codepath on terbium? [22:32:38] Somebody calling initImageData? [22:32:42] (03CR) 10Dzahn: [C: 032] "this will only affect "bare metal" labs which is currently not used" [puppet] - 10https://gerrit.wikimedia.org/r/336356 (owner: 10Dzahn) [22:33:40] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2015.codfw.wmnet [22:33:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:50] (03PS3) 10Dzahn: openstack: switch installserver to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336356 [22:34:58] (03CR) 10Dzahn: [V: 032 C: 032] openstack: switch installserver to install1002 [puppet] - 10https://gerrit.wikimedia.org/r/336356 (owner: 10Dzahn) [22:37:04] ACKNOWLEDGEMENT - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn Amazon AWS, filters ICMP (most of the time :) [22:38:28] PROBLEM - Check systemd state on elastic2015 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:39:13] (03PS2) 10Dzahn: DHCP: switch install1001->1002, 2001->2002 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/336364 [22:43:15] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3011352 (10RobH) @RStallman-legalteam: Thanks! If we can get endorsement from @Ocaasi_WMF, and then have this assigned back to me, and I'll merge access live. [22:43:17] !log rolling back for wmf.11 from group1 due to T157621 [22:43:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:23] T157621: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621 [22:44:04] (03CR) 10Dzahn: [C: 032] DHCP: switch install1001->1002, 2001->2002 as TFTP server [puppet] - 10https://gerrit.wikimedia.org/r/336364 (owner: 10Dzahn) [22:44:52] (03PS1) 10Thcipriani: Revert "group1 wikis to 1.29.0-wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336723 [22:45:09] (03CR) 10Dzahn: [C: 032] netboot: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 (owner: 10Dzahn) [22:45:45] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 back to 1.29.0-wmf.10 [22:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:54] (03CR) 10Thcipriani: [C: 032] Revert "group1 wikis to 1.29.0-wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336723 (owner: 10Thcipriani) [22:47:16] (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336723 (owner: 10Thcipriani) [22:47:25] (03CR) 10jenkins-bot: Revert "group1 wikis to 1.29.0-wmf.11" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336723 (owner: 10Thcipriani) [22:48:18] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:48:31] (03PS2) 10Dzahn: netboot/partman: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 (https://phabricator.wikimedia.org/T84380) [22:49:14] (03PS3) 10Dzahn: netboot/partman: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 (https://phabricator.wikimedia.org/T84380) [22:49:26] (03CR) 10Dzahn: [V: 032 C: 032] netboot/partman: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [22:49:32] (03PS4) 10Dzahn: netboot/partman: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 (https://phabricator.wikimedia.org/T84380) [22:49:38] (03CR) 10Dzahn: [V: 032 C: 032] netboot/partman: remove install2001 [puppet] - 10https://gerrit.wikimedia.org/r/336363 (https://phabricator.wikimedia.org/T84380) (owner: 10Dzahn) [22:50:51] (03PS3) 10Dzahn: add netmon1002 to site [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T156040) [22:51:06] (03CR) 10Dzahn: [C: 04-2] "stalled" [puppet] - 10https://gerrit.wikimedia.org/r/333780 (https://phabricator.wikimedia.org/T156040) (owner: 10Dzahn) [22:51:33] (03PS1) 10RobH: fix ganglia.w.o ssl expiry check [puppet] - 10https://gerrit.wikimedia.org/r/336725 [22:52:08] (03CR) 10Dzahn: "@Robh @Bblack can we delete ecc-uni.wm.org" [puppet] - 10https://gerrit.wikimedia.org/r/334209 (owner: 10Dzahn) [22:52:33] (03CR) 10Dzahn: "@Robh @bblack another one here, uni.wikimedia.org" [puppet] - 10https://gerrit.wikimedia.org/r/334210 (owner: 10Dzahn) [22:52:33] mutante: bblack will need to field that one [22:52:57] (03CR) 10Dzahn: [C: 031] "! yes, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/336725 (owner: 10RobH) [22:52:59] he handled the latest round of renewals and purchases so he'd know best what is needed [22:53:14] robh: ok:) and thanks ^ [22:53:19] forgot about that [22:53:38] yeah fixing two of the recent le migrations [22:53:44] they are showing with old check intervals [22:53:53] (03CR) 10RobH: [C: 032] fix ganglia.w.o ssl expiry check [puppet] - 10https://gerrit.wikimedia.org/r/336725 (owner: 10RobH) [22:54:00] yep, i should have done it right away [22:56:14] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3011396 (10Ocaasi_WMF) Approved by me! Thanks! [22:56:27] (03PS1) 10Thcipriani: Revert "Revert "group1 wikis to 1.29.0-wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336726 [22:56:33] hmmm is mwscript still around in production boxen? [22:56:51] trying to get into eval.php to test some stuff on that tmh bug [22:57:51] mwrepl is on prod boxes [22:58:09] https://wikitech.wikimedia.org/wiki/Debugging_in_production#Debugging_in_shell [22:58:13] usr/local/bin/mwrepl: line 32: expanddblist: command not found [22:59:37] does it only work on a specific machine? [22:59:42] hrm. this works on mwdebug1002 anyway [22:59:47] i shelled into an app server [22:59:50] lemme try that [23:00:28] I think it should work wherever, but I could be/probably am wrong. [23:00:33] "welcome to hiphop debugger" woo [23:00:39] :) [23:01:17] alright. I'm going to re-roll forward to wmf.11. [23:01:35] (03CR) 10Thcipriani: [C: 032] Revert "Revert "group1 wikis to 1.29.0-wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336726 (owner: 10Thcipriani) [23:02:52] (03Merged) 10jenkins-bot: Revert "Revert "group1 wikis to 1.29.0-wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336726 (owner: 10Thcipriani) [23:03:04] (03CR) 10jenkins-bot: Revert "Revert "group1 wikis to 1.29.0-wmf.11"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336726 (owner: 10Thcipriani) [23:04:10] (03PS1) 10RobH: fixing librenms.w.o check to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/336727 [23:04:24] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 to 1.29.0-wmf.11 -- T157621 is not code-change related [23:04:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:04:30] T157621: Special:TimedMediaHandler does not exist and won't even load a webpage - https://phabricator.wikimedia.org/T157621 [23:04:51] mutante: can you +1 https://gerrit.wikimedia.org/r/#/c/336727/1 ? [23:05:04] if you are about that is =] [23:05:11] (03CR) 10Dzahn: [C: 031] fixing librenms.w.o check to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/336727 (owner: 10RobH) [23:05:14] thx! [23:05:18] np [23:05:25] (03CR) 10RobH: [C: 032] fixing librenms.w.o check to LE interval [puppet] - 10https://gerrit.wikimedia.org/r/336727 (owner: 10RobH) [23:05:41] (03PS1) 10Dzahn: mirrors: lint-ignore 'passwords::mirrors not in autoload module layout' [puppet] - 10https://gerrit.wikimedia.org/r/336728 (https://phabricator.wikimedia.org/T93645) [23:06:35] (03PS2) 10Dzahn: mirrors: lint-ignore 'passwords::mirrors not in autoload module layout' [puppet] - 10https://gerrit.wikimedia.org/r/336728 (https://phabricator.wikimedia.org/T93645) [23:12:31] RECOVERY - Check systemd state on elastic2015 is OK: OK - running: The system is fully operational [23:16:01] RECOVERY - puppet last run on labsdb1005 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [23:16:08] aha [23:19:18] thcipriani: I can't get VE to load on wikitech and am wondering if that is .11 related. [23:19:21] PROBLEM - puppet last run on elastic2015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:19:46] bd808: hrm hadn't seen that one, checking other wikis [23:19:51] I'm having a hard time figuring out why it is hanging part way through the loading process [23:21:57] commons seems ok, so maybe something strangely specific to wikitech [23:23:17] hrm wmgVisualEditorAccessRESTbaseDirectly is a setting I just noticed [23:23:18] (03PS1) 10Chad: Scap clean: Rework --l10n-only into --keep-static [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336730 (https://phabricator.wikimedia.org/T157631) [23:23:26] that is set to false for wikitech [23:24:05] !log gehel@puppetmaster1001 conftool action : set/pooled=yes; selector: name=elastic2016.codfw.wmnet [23:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:34] bd808: I can rollback labswiki to see if wmf.11 is the culprit. [23:24:53] thcipriani: maybe wait a bit. James_F is taking a look [23:24:56] ok [23:25:25] bd808: Loads for me. [23:25:33] bd808: It was seriously slow, but it loaded. [23:25:53] James_F: hmmm... ok. Let me see if I've got other weirdness in my site js then [23:26:27] Seriously slow ~= 51 seconds to load https://wikitech.wikimedia.org/wiki/Deployments?veaction=edit [23:26:56] Of course on second load that was ~7 seconds (yay caching), still too slow. [23:28:36] blerg. still no joy for me [23:28:40] (03CR) 10Dzahn: [C: 032] mirrors: lint-ignore 'passwords::mirrors not in autoload module layout' [puppet] - 10https://gerrit.wikimedia.org/r/336728 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:29:19] VisualEditor works for me. [23:29:49] FWIW, I was able to load Deployment in VE fairly quickly. [23:30:03] *Deployments [23:32:25] bd808: Anything in the console? Is the XHR taking forever or is it aborting? [23:32:36] weird. That one won't load for me either [23:32:52] no error logs and I can't really see any network request that is outstanding [23:32:56] * bd808 tries Chrome [23:33:37] (03PS1) 10Dzahn: mariadb: ingore lint warnings about autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/336731 (https://phabricator.wikimedia.org/T93645) [23:34:10] (03PS2) 10Dzahn: mariadb: ingore lint warnings about autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/336731 (https://phabricator.wikimedia.org/T93645) [23:34:10] bd808: If only you were in the office I could debug on your device. ;-) [23:34:29] RECOVERY - puppet last run on elastic2015 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:34:54] chrome works as anon. trying FF incognito [23:35:29] PROBLEM - puppet last run on labvirt1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:37:14] (03PS3) 10Dzahn: mariadb: ignore lint warnings about autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/336731 (https://phabricator.wikimedia.org/T93645) [23:37:16] James_F: it seems to be specific to my normal Firefox profile. It works fine (although slow) in a new profile with no add-ons [23:37:34] thcipriani: ^ local problem. Sorry for the noise [23:37:38] bd808: Interesting. [23:37:51] bd808: Possibly an ad blocker is killing the XHR? [23:37:53] bd808: np, good looking-out :) [23:37:57] (03CR) 10Dzahn: [C: 032] "comments-only that make it ignore warnings" [puppet] - 10https://gerrit.wikimedia.org/r/336731 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:38:42] (03PS4) 10Dzahn: mariadb: ignore lint warnings about autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/336731 (https://phabricator.wikimedia.org/T93645) [23:38:52] (03PS1) 10Jdlrobson: Beta cluster should show related pages 100% of time [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336732 (https://phabricator.wikimedia.org/T157372) [23:38:54] (03PS1) 10Jdlrobson: Labs instances should reflect production value for RelatedArticlesShowInFooter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336733 [23:38:56] VisualEditor on firefox works for me [23:39:00] But its slow. [23:39:06] On safari it's fast. [23:39:20] James_F: possibly. I do have a pile of blocking extensions. [23:39:33] * bd808 tries removing layers of tin foil [23:39:42] hey, to update config on labs only do I need to use the SWAT window? [23:39:42] * James_F grins. [23:39:49] or does that just cause a distraction? [23:40:46] James_F: and... now its magically working with no config changes :/ [23:41:07] Oy. [23:42:54] (03CR) 10Jdlrobson: [C: 031] "I am wrong. Since this defaults to true anyway in MobileFrontend this is not a problem and can be swatted immediately." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/336664 (https://phabricator.wikimedia.org/T157075) (owner: 10Jdlrobson) [23:44:42] (03PS1) 10Dzahn: puppet-lint: remove exception for wrong autoloader layout [puppet] - 10https://gerrit.wikimedia.org/r/336735 (https://phabricator.wikimedia.org/T93645) [23:45:45] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3011636 (10Dzahn) https://gerrit.wikimedia.org/r/#/q/topic:puppet-lint.rc+%28status:open+OR+status:merged%29 [23:46:36] (03CR) 10Dzahn: [C: 032] "Jenkins saying Verified+2 is proof that we can remove this now. :)" [puppet] - 10https://gerrit.wikimedia.org/r/336735 (https://phabricator.wikimedia.org/T93645) (owner: 10Dzahn) [23:50:56] 06Operations, 10Ops-Access-Requests: Request for access to stat1003 for Sam Tarling - https://phabricator.wikimedia.org/T157483#3011652 (10RobH) a:05Ocaasi_WMF>03RobH [23:53:37] 06Operations, 07Puppet, 07Epic, 07Need-volunteer, 13Patch-For-Review: align puppet-lint config with coding style - https://phabricator.wikimedia.org/T93645#3011658 (10Dzahn) [23:55:04] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3011661 (10Dzahn) a:03Dzahn [23:55:30] 06Operations, 10Collection, 10Traffic, 07HTTPS, 13Patch-For-Review: Book collections communicate with pediapress using http: - https://phabricator.wikimedia.org/T157398#3003811 (10Dzahn) https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1480106&oldid=1478688