[00:00:04] RoanKattouw ostriches Krenair MaxSem: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160309T0000). [00:00:04] odder ebernhardson AaronSchulz RoanKattouw jdlrobson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:09] I'll do it [00:00:10] \e [00:00:21] I have a fatal fix to add to that [00:00:35] present [00:00:36] A cherry-pick of https://gerrit.wikimedia.org/r/#/c/276053/ to wmf16 once Jenkins finishes with it [00:00:45] so, like 20 minutes ;P [00:01:08] RoanKattouw, not reviewed? :D [00:01:47] jdlrobson, your change is already in wmf16? [00:02:35] MaxSem: lemme double check [00:04:30] okay, sent all the extension changes to h^H Zuul, let's do config meanwhile [00:06:35] yay [00:06:50] (03CR) 10MaxSem: [C: 032] Modify wgImportSources for plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275275 (https://phabricator.wikimedia.org/T129015) (owner: 10Odder) [00:07:22] 12 minutes for the cherrypick, since the mediawiki-extensions-hhvm job takes 6 minutes (and it needed to merge to master first) [00:07:40] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2101562 (10Jdlrobson) [00:08:05] (03Merged) 10jenkins-bot: Modify wgImportSources for plwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275275 (https://phabricator.wikimedia.org/T129015) (owner: 10Odder) [00:09:09] * AaronSchulz can smell the regularly scheduled cannabis again...some one is having fun every day... [00:09:16] nice one [00:09:42] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/275275 (duration: 00m 29s) [00:09:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:56] odder, please check ^ [00:10:03] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2101565 (10Krenair) a:3Krenair * Copied my usual files across (basically what I have in prod puppet modules/admin/files/home/krenair), installed git (and bas... [00:10:23] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [00:10:43] (03CR) 10MaxSem: [C: 032] Stop pushing ES updates to nobelium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275858 (owner: 10EBernhardson) [00:11:43] 6Operations, 10Wikimedia-General-or-Unknown, 15User-bd808: Update Wikimedia Debug extensions for Chrome and Firefox for configurable backend selection - https://phabricator.wikimedia.org/T129283#2101573 (10ori) a:3bd808 Chrome extension done in https://github.com/wikimedia/ChromeWikimediaDebug/commit/2e7c8... [00:12:07] (03Merged) 10jenkins-bot: Stop pushing ES updates to nobelium [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275858 (owner: 10EBernhardson) [00:13:10] MaxSem: Can I add https://gerrit.wikimedia.org/r/276062 ? (I added it to the wiki page too) [00:13:19] !log maxsem@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/275858/ (duration: 00m 28s) [00:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:13:25] yeah, RoanKattouw [00:13:54] MaxSem: Looks OK, thank you [00:14:18] ebernhardson, ^^ [00:14:19] Thanks [00:14:57] (03CR) 10MaxSem: [C: 032] Enable async swift writes for remaining backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272922 (owner: 10Aaron Schulz) [00:15:40] (03Merged) 10jenkins-bot: Enable async swift writes for remaining backends [mediawiki-config] - 10https://gerrit.wikimedia.org/r/272922 (owner: 10Aaron Schulz) [00:16:20] MaxSem: verified no writes coming into nobelium anymore [00:16:55] !log maxsem@tin Synchronized wmf-config/filebackend-production.php: https://gerrit.wikimedia.org/r/272922 (duration: 00m 27s) [00:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:17:05] AaronSchulz, ^^ [00:17:13] ok [00:17:52] <3 mediawiki-extensions-php55 [00:19:50] MaxSem: you can do https://gerrit.wikimedia.org/r/#/c/275378/ to ;) [00:19:58] probably won't break anything [00:20:34] 6Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#2101592 (10FloorKoudijs) Hi @Dzahn for asking. I have never heard of those email aliases. They don't sound familiar to me at all, I'm afraid. Sorry I couldn't be of more help. Best of luck! [00:22:36] (03CR) 10MaxSem: [C: 032] Added some jobqueue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275378 (owner: 10Aaron Schulz) [00:23:47] (03Merged) 10jenkins-bot: Added some jobqueue comments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275378 (owner: 10Aaron Schulz) [00:25:54] !log maxsem@tin Synchronized wmf-config/: https://gerrit.wikimedia.org/r/275378 - comment only (duration: 00m 27s) [00:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:26:05] (03PS1) 10Rillke: Add pixabay.com to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276069 [00:26:43] (03CR) 10Rillke: [C: 04-1] "Needs community consensus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276069 (owner: 10Rillke) [00:29:12] !log maxsem@tin Synchronized php-1.27.0-wmf.16/extensions/Echo/: SWAT (duration: 00m 38s) [00:29:15] RoanKattouw, ^ [00:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:10] !log maxsem@tin Synchronized php-1.27.0-wmf.16/extensions/MobileFrontend/: SWAT (duration: 00m 29s) [00:30:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:18] jdlrobson, ^ [00:31:39] ori: do you use clush from your workstation, or from an internal host? [00:32:16] !log maxsem@tin Synchronized php-1.27.0-wmf.15/extensions/MobileFrontend/: SWAT (duration: 00m 32s) [00:32:20] jdlrobson, ^ [00:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:32:44] MaxSem: boom [00:32:45] confirmed [00:32:53] RoanKattouw, ^ [00:33:01] urandom: from my workstation, but not by choice. Need to figure out an authentication scheme for production usage, since key forwarding is not permitted. [00:33:03] !log maxsem@tin Synchronized php-1.27.0-wmf.15/extensions/Echo/: SWAT (duration: 00m 32s) [00:33:06] (03CR) 10Rillke: "https://commons.wikimedia.org/wiki/Commons:Village_pump#Add_a_source_of_public_domain_images_to_the_upload-by-url_configuration" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276069 (owner: 10Rillke) [00:33:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:33:29] ori: does it work reliably? [00:33:36] 6Operations, 10MobileFrontend, 10Traffic, 3Reading-Web-Sprint-67-If, Then, Else...?, and 3 others: Incorrect TOC and section edit links rendering in Vector due to ParserCache corruption via ParserOutput::setText( ParserOutput::getText() ) - https://phabricator.wikimedia.org/T124356#2101616 (10Jdlrobson) SW... [00:33:52] ori: i get key exchange errors [00:34:09] ori: to be fair, i get them from dsh too, if i enable concurrent excecution [00:34:42] i don't, but I use a custom-written ssh agent I wrote for playing nicely with the yubikey [00:34:43] maybe something to do with ProxyCommand? [00:35:27] https://gist.github.com/atdt/bac712b0a45ebe06b614 [00:36:00] maybe! I don't use it very much; I use salt. (Salt is not as good, and I want to replace it with clush, but salt is installed in production..) [00:36:57] (03PS1) 10Aaron Schulz: Set cross-DC swift writes to be sync for originals for switchover testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276071 [00:36:58] ori: but you still proxy commands through bast1001.wikimedia.org? [00:40:26] urandom: bast4001 since it's closer to me but yeah [00:40:44] (03PS9) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [00:42:38] (03CR) 10jenkins-bot: [V: 04-1] WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [00:45:05] (03PS10) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [00:45:26] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2101674 (10RobH) [00:46:28] (03CR) 10jenkins-bot: [V: 04-1] WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [00:51:06] (03PS11) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [00:51:09] ori: so that was fun. i changed my ssh config to point to bast4001, and clush spawned thousands of ssh process and took down my machine [00:51:22] *hard* [00:54:59] (03CR) 10Josve05a: "I am the one that started fixing with the categorization and tagging of these images that we have on Commons and..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276069 (owner: 10Rillke) [00:56:57] (03PS12) 10BBlack: WIP: cache_app_route() w/ split [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [00:56:59] (03PS1) 10BBlack: Test commit for CI only: codfw->direct [puppet] - 10https://gerrit.wikimedia.org/r/276077 [00:57:01] (03PS1) 10BBlack: Test commit for CI only: rb split, appservers codfw [puppet] - 10https://gerrit.wikimedia.org/r/276078 [01:03:20] (03Abandoned) 10BBlack: Test commit for CI only: rb split, appservers codfw [puppet] - 10https://gerrit.wikimedia.org/r/276078 (owner: 10BBlack) [01:03:32] (03Abandoned) 10BBlack: Test commit for CI only: codfw->direct [puppet] - 10https://gerrit.wikimedia.org/r/276077 (owner: 10BBlack) [01:07:37] 7Blocked-on-Operations, 6Operations, 6Wikipedia-iOS-App-Product-Backlog: Provide access to iOS team for piwik production server - https://phabricator.wikimedia.org/T124218#2101737 (10JMinor) Thanks. I was able to log in this afternoon, no problems. [01:10:09] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2101739 (10RobH) [01:11:20] (03PS13) 10BBlack: cache_app_route(): parser func for cache->app routing [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [01:12:05] (03CR) 10BBlack: [C: 031] "Works in CI, including some abandoned test commits to switch routes to direct:direct split." [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [01:22:33] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2101749 (10Krenair) * Copied the wikitech import cron across * Copied MW config * Set up new copy of MediaWiki * Installed memcached and mysql-server-core-5.5/... [01:33:21] RECOVERY - cassandra-a CQL 10.64.0.117:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on port 9042 [01:39:56] !log csteipp@tin Synchronized php-1.27.0-wmf.16/includes/api/ApiQueryInfo.php: (no message) (duration: 00m 33s) [01:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:40:30] !log redeploy patch for T129120 [01:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:51:51] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2101765 (10Dzahn) ``` parted_server: OUT: 1 0-7998537727 7998537728 primary linux-swap /dev/mapper/bast2001--vg-swap_1 parted_server: Partitions printed parted_server: OUT: parted_server: Closing i... [02:01:42] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2101805 (10RobH) [02:02:52] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) a:5RobH>3mobrovac @mobrovac, I'm assigning this task to you for service implementation, as you were the initial requestor on T128475. If this isn't correct, feel free to state such and assign... [02:05:27] 6Operations, 10ops-codfw: update physical label on sc[ab]200[1-2] - https://phabricator.wikimedia.org/T129305#2101814 (10RobH) [02:06:09] 6Operations, 6Language-Engineering, 6Services, 13Patch-For-Review, and 2 others: Prepare cxserver/zotero for the codfw switchover - https://phabricator.wikimedia.org/T125065#2101829 (10RobH) [02:06:12] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) [02:06:20] 6Operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#2101831 (10RobH) [02:06:23] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) [02:06:29] 6Operations, 10Graphoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#2101833 (10RobH) [02:06:35] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) [02:06:43] 6Operations, 6Services: setup/deploy sc[a-b]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) [02:06:46] 6Operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#2101836 (10RobH) [02:07:01] 6Operations, 6Services: setup/deploy sc[ab]200[1-2] - https://phabricator.wikimedia.org/T129234#2099212 (10RobH) [02:07:18] (03PS1) 10Dzahn: netboot: bast2001->raid1-lvm, no more wildcard [puppet] - 10https://gerrit.wikimedia.org/r/276085 [02:07:38] (03PS2) 10Dzahn: netboot: bast2001->raid1-lvm, no more wildcard [puppet] - 10https://gerrit.wikimedia.org/r/276085 (https://phabricator.wikimedia.org/T128899) [02:07:53] 6Operations, 6Services, 3Mobile-Content-Service, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mobileapps for the codfw switchover - https://phabricator.wikimedia.org/T125061#1973542 (10RobH) [02:07:55] 6Operations, 10Graphoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare graphoid for the codfw switchover - https://phabricator.wikimedia.org/T125060#1973519 (10RobH) [02:07:59] 6Operations, 10Mathoid, 6Services, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Prepare mathoid for the codfw switchover - https://phabricator.wikimedia.org/T125058#1973500 (10RobH) [02:08:03] 6Operations, 6Services, 10hardware-requests: codfw: (2+2) sca & scb service clusters - https://phabricator.wikimedia.org/T128475#2101841 (10RobH) 5Open>3Resolved This #hardware-requests has been filled with the setup on task T129234. As service implementation is now the only remaining step, I'm resolvin... [02:08:09] ticket dependency spam echoooooooo [02:08:34] maybe we shouldnt echo when we just add blockers? [02:08:41] (03PS3) 10Dzahn: netboot: bast2001->raid1-lvm, no more wildcard [puppet] - 10https://gerrit.wikimedia.org/r/276085 (https://phabricator.wikimedia.org/T128899) [02:08:48] (03CR) 10Dzahn: [C: 032] netboot: bast2001->raid1-lvm, no more wildcard [puppet] - 10https://gerrit.wikimedia.org/r/276085 (https://phabricator.wikimedia.org/T128899) (owner: 10Dzahn) [02:09:02] oh, i suppose its the reference echoing, not the task that has the blocker, meh. [02:09:19] mutante: huzzah, down with wildcards in there, its always a bad idea. [02:11:13] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [02:12:40] robh: ok! yea, i'm just trying to get bast2001 to work again, next trying the same one like on the other bastion and wildcard mixed with the other lines.. yea [02:12:58] !log ori@tin Synchronized php-1.27.0-wmf.16/includes/diff/DairikiDiff.php: I4d4b8f81c: Dont quote assert expressions in DairikiDiff (duration: 00m 31s) [02:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:13:26] !log ori@tin Synchronized php-1.27.0-wmf.15/includes/diff/DairikiDiff.php: I4d4b8f81c: Dont quote assert expressions in DairikiDiff (duration: 00m 27s) [02:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:18:11] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:20:09] robh: ..and the installer got past partitioning.. now i just need it to not show those RAID errors anymore [02:20:52] hopefully that fixes it, seems like it could have caused the issue though [02:21:03] yep [02:21:59] Krenair: also, i'm back if you need anything for wt-static-jessie [02:22:34] nope, everything is going well at the moment. thanks for asking [02:22:51] ok, cool [02:29:25] (03PS1) 10Dzahn: (WIP) switch wikitech-static to new jessie VM [dns] - 10https://gerrit.wikimedia.org/r/276088 (https://phabricator.wikimedia.org/T126385) [02:31:54] what the heck.. so the install was almost done and then switches to black screen on console again [02:32:01] way after base system [02:32:55] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.15) (duration: 13m 28s) [02:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:34:15] sc-dhcp-client... [02:34:35] that's literally what i see now [02:35:12] it's still doing things but console output got messed up at a certain point [02:46:29] (03CR) 10Dzahn: Add a deployment source & target class for phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [02:47:50] (03CR) 10Dzahn: [C: 031] "i'm not that familiar with scap and keyholder but definitely +1 for the "key_content => secret"/cleaner pattern to specify private key. we" [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [02:49:56] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2101906 (10Dzahn) once we are ready to switch i was going to do this https://gerrit.wikimedia.org/r/#/c/276088/ [02:51:26] (03CR) 1020after4: [C: 031] Add a deployment source & target class for phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [02:54:01] (03CR) 10Dzahn: Add a deployment source & target class for phabricator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274502 (https://phabricator.wikimedia.org/T114363) (owner: 1020after4) [02:55:46] 6Operations, 10Mail: status of studentgroups@ and studentclubs@ mail aliases? - https://phabricator.wikimedia.org/T127550#2101912 (10Dzahn) Hey @FloorKoudijs thank you! I have a feeling these are not used since a long time, but i'll check logfiles to make sure. [03:08:20] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.16) (duration: 17m 51s) [03:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:16:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Mar 9 03:16:21 UTC 2016 (duration 8m 2s) [03:16:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:33:44] (03PS1) 10Yuvipanda: tools: Add specific paging monitoring for PAWS too [puppet] - 10https://gerrit.wikimedia.org/r/276089 (https://phabricator.wikimedia.org/T129209) [03:35:39] (03CR) 10Yuvipanda: [C: 032] tools: Add specific paging monitoring for PAWS too [puppet] - 10https://gerrit.wikimedia.org/r/276089 (https://phabricator.wikimedia.org/T129209) (owner: 10Yuvipanda) [03:37:48] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Wikitechwiki has 4xx responses to requests for some static assets inc. poweredby_mediawiki_88x31.png and WikiEditor's button-sprite.svg - https://phabricator.wikimedia.org/T128747#2101935 (10Krenair) 5Open>3Resolved a:5Krenair>3None [03:42:51] (03PS1) 10Yuvipanda: tools: Add paws as a separate host [puppet] - 10https://gerrit.wikimedia.org/r/276090 (https://phabricator.wikimedia.org/T129209) [03:44:38] (03CR) 10Yuvipanda: [C: 032] tools: Add paws as a separate host [puppet] - 10https://gerrit.wikimedia.org/r/276090 (https://phabricator.wikimedia.org/T129209) (owner: 10Yuvipanda) [03:45:35] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2101940 (10Krenair) * Fiddled with apache config some more (`Require all granted`) to make this work under Apache 2.4 * Installed php5-mysql * `mysqldump` on o... [03:51:33] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2101941 (10Krenair) a:5Krenair>3Andrew @Andrew, do you think this is OK to switch now? [04:01:11] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:01:50] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [04:04:52] aha! [04:04:54] that explains it [04:05:02] fucking strontium [04:06:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:13:05] (03PS1) 10Yuvipanda: Revert "tools: Add specific paging monitoring for PAWS too" [puppet] - 10https://gerrit.wikimedia.org/r/276092 [04:13:27] (03CR) 10jenkins-bot: [V: 04-1] Revert "tools: Add specific paging monitoring for PAWS too" [puppet] - 10https://gerrit.wikimedia.org/r/276092 (owner: 10Yuvipanda) [04:16:29] (03PS1) 10Yuvipanda: Revert "tools: Add paws as a separate host" [puppet] - 10https://gerrit.wikimedia.org/r/276093 [04:16:34] (03Abandoned) 10Yuvipanda: Revert "tools: Add specific paging monitoring for PAWS too" [puppet] - 10https://gerrit.wikimedia.org/r/276092 (owner: 10Yuvipanda) [04:16:45] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Add paws as a separate host" [puppet] - 10https://gerrit.wikimedia.org/r/276093 (owner: 10Yuvipanda) [04:19:59] PROBLEM - dhclient process on bast2001 is CRITICAL: Connection refused by host [04:20:07] PROBLEM - puppet last run on bast2001 is CRITICAL: Connection refused by host [04:20:29] PROBLEM - Check size of conntrack table on bast2001 is CRITICAL: Connection refused by host [04:20:48] PROBLEM - DPKG on bast2001 is CRITICAL: Connection refused by host [04:20:49] PROBLEM - RAID on bast2001 is CRITICAL: Connection refused by host [04:20:58] PROBLEM - NTP on bast2001 is CRITICAL: NTP CRITICAL: No response from NTP server [04:21:08] PROBLEM - salt-minion processes on bast2001 is CRITICAL: Connection refused by host [04:21:28] PROBLEM - Disk space on bast2001 is CRITICAL: Connection refused by host [04:21:37] PROBLEM - configured eth on bast2001 is CRITICAL: Connection refused by host [04:23:38] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [04:28:33] 6Operations, 6Labs, 13Patch-For-Review: Setup private docker registry with authentication support in tools - https://phabricator.wikimedia.org/T118758#2102004 (10yuvipanda) [04:29:59] (03PS1) 10JGirault: Bump portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276094 (https://phabricator.wikimedia.org/T125472) [04:36:04] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2102029 (10Dzahn) after this ^ change and reinstalling again the installer went past the partioning and looked all good, then console output became messed up, became extremely slow but was apparently sti... [04:43:33] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2102031 (10Dzahn) "reboot" didn't work from here. "kill 1" finally got me out. additonally, "console com2" randomly stops working on this DRAC and i had to reset it twice, then it worked again.. ser... [04:44:29] PROBLEM - SSH on bast2001 is CRITICAL: Connection timed out [04:46:17] ok, bast2001, it breaks more while i try to fix it [04:46:28] now dont even see it boot up anymore [04:48:38] PROBLEM - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% [04:52:08] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet has 1 failures [04:53:41] ACKNOWLEDGEMENT - Host bast2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T128899 [04:54:38] RECOVERY - Host bast2001 is UP: PING OK - Packet loss = 0%, RTA = 36.48 ms [05:11:23] tried again.. installer starts, works normal and quick. at some random time half way thru installing base system becomes super slow.. then moves on .. appears to freeze [05:12:22] lol, just wanted to give up and it switched from 58 to 59% [05:12:52] super slow but not dead [05:19:59] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:37:09] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [24.0] [05:40:29] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [05:45:45] (03PS1) 10Ori.livneh: Allow X-Wikimedia-Debug header to contain multiple fields [puppet] - 10https://gerrit.wikimedia.org/r/276100 [05:56:56] (03PS2) 10Ori.livneh: Allow X-Wikimedia-Debug header to contain multiple fields [puppet] - 10https://gerrit.wikimedia.org/r/276100 [05:57:03] (03CR) 10Ori.livneh: [C: 032] Allow X-Wikimedia-Debug header to contain multiple fields [puppet] - 10https://gerrit.wikimedia.org/r/276100 (owner: 10Ori.livneh) [05:57:15] (03CR) 10Ori.livneh: [V: 032] Allow X-Wikimedia-Debug header to contain multiple fields [puppet] - 10https://gerrit.wikimedia.org/r/276100 (owner: 10Ori.livneh) [06:07:08] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR [06:07:18] PROBLEM - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR [06:21:12] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2102089 (10Dzahn) ``` 21:15 < mutante> tried again.. installer starts, works normal and quick. at some random time half way thru installing base system becomes super slo... [06:26:10] !log bast2001 - still installing in snail mode - please feel free to check if it's done, and if so re-add to puppet so users get created.thx [06:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:30:38] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:59] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:29] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:38] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:39] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:57:28] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:58:09] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:58:19] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:49:15] <_joe_> !log uploading hhvm 3.12, extensions to reprepro [07:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:25] (03PS1) 10Giuseppe Lavagetto: Reroute jobqueue writes from rdb1003 to rdb1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276105 (https://phabricator.wikimedia.org/T123675) [08:13:57] 6Operations, 13Patch-For-Review: Reinstall redis servers (Job queues) with Jessie (NOTE: rdb1002 is special and is excluded!) - https://phabricator.wikimedia.org/T123675#2102137 (10elukey) [08:13:59] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Figure out how to migrate the jobqueues - https://phabricator.wikimedia.org/T124673#2102136 (10elukey) [08:16:17] (03PS1) 10Giuseppe Lavagetto: jobqueue_redis: remove temporarily rdb1003 for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/276106 (https://phabricator.wikimedia.org/T124675) [08:21:16] (03PS1) 10Jcrespo: Repool db2035 after partitioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276107 [08:23:05] (03CR) 10Jcrespo: [C: 032] Repool db2035 after partitioning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276107 (owner: 10Jcrespo) [08:23:22] Hey, I need a little help from ops for this task: https://phabricator.wikimedia.org/T129112 [08:23:37] I made all wheels [08:23:49] but I need a repo in gerrit to put them [08:24:15] can anyone help on making the repo? [08:24:34] (03PS2) 10Giuseppe Lavagetto: Reroute jobqueue writes from rdb1003 to rdb1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276105 (https://phabricator.wikimedia.org/T123675) [08:26:25] (03CR) 10Giuseppe Lavagetto: [C: 032] Reroute jobqueue writes from rdb1003 to rdb1005 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276105 (https://phabricator.wikimedia.org/T123675) (owner: 10Giuseppe Lavagetto) [08:26:37] <_joe_> jynus: are you deploying your change? [08:26:45] <_joe_> because I need to deploy mine [08:27:07] Your branch is behind 'origin/master' by 2 commits, and can be fast-forwarded. [08:27:14] I will rebase both [08:27:26] deploy mine, deploy yours if you tell me so [08:27:41] <_joe_> just get it to tin, and sync-file yours [08:27:49] ok, doing [08:27:50] <_joe_> I'll do mine afterwards :) [08:29:03] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2035 (duration: 00m 39s) [08:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:29:11] all yours [08:29:30] (03PS1) 10ArielGlenn: make wget skip urls with query params when retrieving wikitech dumps [puppet] - 10https://gerrit.wikimedia.org/r/276109 [08:31:12] (03CR) 10ArielGlenn: [C: 032] make wget skip urls with query params when retrieving wikitech dumps [puppet] - 10https://gerrit.wikimedia.org/r/276109 (owner: 10ArielGlenn) [08:31:27] 6Operations, 10ops-codfw: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2102153 (10MoritzMuehlenhoff) [08:31:52] <_joe_> elukey: I'm syncing now [08:32:02] ok! [08:32:11] !log oblivian@tin Synchronized wmf-config/jobqueue-eqiad.php: re-routing writes from rdb1003 to rdb1005 for reimaging (duration: 00m 36s) [08:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:10] hhvm is not screaming from what I can see, good :) [08:33:42] <_joe_> connections are going down slow [08:35:48] _joe_ rdb1005 commands as expected have increased [08:36:05] <_joe_> but most connections I see now are from jobrunners, we can proceed I think [08:36:33] <_joe_> all are from jobrunners and videoscalers [08:36:44] <_joe_> heh I just found a problem in our procedure, fixing it [08:36:50] <_joe_> I'll go on with the puppet change [08:37:35] (03PS2) 10Giuseppe Lavagetto: jobqueue_redis: remove temporarily rdb1003 for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/276106 (https://phabricator.wikimedia.org/T124675) [08:37:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] jobqueue_redis: remove temporarily rdb1003 for reimaging [puppet] - 10https://gerrit.wikimedia.org/r/276106 (https://phabricator.wikimedia.org/T124675) (owner: 10Giuseppe Lavagetto) [08:42:18] (03PS3) 10DCausse: Enable completion suggester as default on all but top 12 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) (owner: 10EBernhardson) [08:43:15] 6Operations, 6Performance-Team, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Dedicate 1/2 codfw jobrunners to gwtoolset jobs - https://phabricator.wikimedia.org/T129317#2102170 (10Joe) [08:43:33] (03CR) 10DCausse: [C: 031] "it was just a rebase" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275605 (https://phabricator.wikimedia.org/T128775) (owner: 10EBernhardson) [08:48:39] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 56 failures [08:55:33] (03PS1) 10Muehlenhoff: Fix duplicate declaration of Package[python-designateclient] [puppet] - 10https://gerrit.wikimedia.org/r/276112 [09:05:38] (03PS8) 10Jcrespo: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [09:07:09] 6Operations, 10ops-codfw: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2102153 (10ArielGlenn) Sshing over there lands me in busybox, dmesg shows a lot of [14686.650278] sd 0:0:0:0: [sda] Unhandled sense code [14686.650287] sd 0:0:0:0: [sda] [14686.650290] Result: hostbyte... [09:26:41] !log puppet disabled on rdb1003 for debian re-image. Redis servers going to be stopped too as pre-step for backup. [09:26:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:10] PROBLEM - Apache HTTP on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [09:30:18] PROBLEM - HHVM rendering on mw1119 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.005 second response time [09:32:45] !log rebooting iron for kernel update [09:32:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:20] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [09:43:48] (03CR) 10Giuseppe Lavagetto: "While it's obvious that:" [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [09:44:38] is there a current issue, I get an error "Errors were encountered while undeleting the file: [09:44:39] The file "mwstore://local-multiwrite/local-public/6/61/GERMAN_TROOPS_ADVANCING_PAST_ABANDONED_AMERICAN_EQUIPMENT.jpg" is in an inconsistent state within the internal storage backends" when trying to undelete at Commons [09:45:00] (03PS1) 10Muehlenhoff: Have linux-meta point to 4.4 and add a new meta package linux-meta-3.19 to explicitly select 3.19 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/276121 [09:45:32] (03CR) 10Mobrovac: [C: 031] parsoid::testing: use master_dc variables [puppet] - 10https://gerrit.wikimedia.org/r/275814 (https://phabricator.wikimedia.org/T124670) (owner: 10Giuseppe Lavagetto) [09:45:57] sDrewth: there shouldn't, but I'm checking [09:46:03] thx godog [09:46:29] PROBLEM - Apache HTTP on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:28] PROBLEM - HHVM rendering on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:29] PROBLEM - SSH on mw1140 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:32] (03CR) 10Mobrovac: [C: 031] iegreview: use $parsoid_primary [puppet] - 10https://gerrit.wikimedia.org/r/275539 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [09:47:40] (03CR) 10Muehlenhoff: [C: 032 V: 032] Have linux-meta point to 4.4 and add a new meta package linux-meta-3.19 to explicitly select 3.19 [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/276121 (owner: 10Muehlenhoff) [09:47:48] PROBLEM - Check size of conntrack table on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:49] PROBLEM - DPKG on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:08] PROBLEM - Disk space on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:08] PROBLEM - RAID on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:09] PROBLEM - salt-minion processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:18] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:48:38] PROBLEM - dhclient process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:38] PROBLEM - HHVM processes on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:48:46] (03CR) 10Mobrovac: [C: 031] mobileapps: point to $rb_primary, not to the local restbase cluster [puppet] - 10https://gerrit.wikimedia.org/r/275538 (owner: 10Giuseppe Lavagetto) [09:48:49] PROBLEM - configured eth on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:49:18] PROBLEM - nutcracker process on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:51:39] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2102319 (10fgiunchedi) 5Resolved>3Open reopening as reported on irc for commons ``` 09:44 is there a current issue, I get an error "Errors were encountered while undeleting the fi... [09:51:48] (03CR) 10Mobrovac: cxserver: use $rb_primary in configuring restbase urls (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275537 (https://phabricator.wikimedia.org/T125065) (owner: 10Giuseppe Lavagetto) [09:52:20] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: puppet fail [09:52:29] PROBLEM - nutcracker port on mw1140 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:49] RECOVERY - dhclient process on mw1140 is OK: PROCS OK: 0 processes with command name dhclient [09:53:49] RECOVERY - HHVM processes on mw1140 is OK: PROCS OK: 6 processes with command name hhvm [09:53:58] RECOVERY - configured eth on mw1140 is OK: OK - interfaces up [09:54:08] RECOVERY - nutcracker port on mw1140 is OK: TCP OK - 0.000 second response time on port 11212 [09:54:19] RECOVERY - SSH on mw1140 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.6 (protocol 2.0) [09:54:20] RECOVERY - nutcracker process on mw1140 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [09:54:38] RECOVERY - Check size of conntrack table on mw1140 is OK: OK: nf_conntrack is 0 % full [09:54:40] RECOVERY - DPKG on mw1140 is OK: All packages OK [09:54:59] RECOVERY - Disk space on mw1140 is OK: DISK OK [09:54:59] RECOVERY - RAID on mw1140 is OK: OK: no RAID installed [09:54:59] RECOVERY - salt-minion processes on mw1140 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:56:26] 6Operations, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2098179 (10Billinghurst) Godog asks that I put the fuller information that I have. Trying to undelete https://commons.wikimedia.org/w/index.php?title=File:GERMAN_TROOPS_ADVANCING_PAST_ABANDONED_AMERICAN_EQUIPMENT... [09:57:47] (03CR) 10Mobrovac: [C: 031] restbase: make restbase configuration $master_dc (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/275536 (https://phabricator.wikimedia.org/T126235) (owner: 10Giuseppe Lavagetto) [09:58:57] 6Operations, 6Commons, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2102342 (10Billinghurst) [10:01:15] (03CR) 10Mobrovac: [C: 031] "Makes sense, _joe_. +1'ing the plan :)" [puppet] - 10https://gerrit.wikimedia.org/r/275443 (https://phabricator.wikimedia.org/T125673) (owner: 10Giuseppe Lavagetto) [10:11:15] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2102371 (10Southparkfan) Not very important, but is it possible to enable OPcache in PHP? Even if you only allow 64MB of cache usage, this should make the wiki... [10:13:16] (03PS1) 10Jcrespo: Avoid infinite loops when using circular replication [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) [10:14:20] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [10:14:51] (03PS2) 10Jcrespo: Avoid infinite loops when using circular replication [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) [10:15:12] I am thinking of breaking tendril to solve an issue, everybody ok with that? [10:15:19] ^volans [10:16:53] * volans looking [10:17:12] not with the fix, with breaking tendril [10:19:39] (03CR) 10Volans: [C: 04-1] Avoid infinite loops when using circular replication (031 comment) [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [10:20:29] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:20:38] true, nice catch [10:20:55] with breaking tendril I'm ok, we cannot be blocked by tendril, although we should found a solution :) [10:21:09] (03PS3) 10Jcrespo: Avoid infinite loops when using circular replication [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) [10:21:24] well, that is the solution^ [10:21:50] I just expect unexpected consequences when I deploy it [10:22:28] (03CR) 10Volans: [C: 031] "LGTM" [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [10:22:40] (03CR) 10Jcrespo: [C: 032] Avoid infinite loops when using circular replication [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [10:22:53] (03CR) 10Jcrespo: [V: 032] Avoid infinite loops when using circular replication [software/tendril] - 10https://gerrit.wikimedia.org/r/276127 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [10:22:55] (03CR) 10Addshore: [C: 031] Whitelist feeds included on Wikimedia Germany Engineering page on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [10:23:38] I do not remember at all how that is deployed, will try to discover it [10:27:10] ACKNOWLEDGEMENT - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-0/0/1: down - Core: cr1-ulsfo:xe-1/2/0 (Telia, IC-313592, 51ms) {#11372} [10Gbps wave]BR Filippo Giunchedi telia planned maint 6-12 UTC [10:27:10] ACKNOWLEDGEMENT - Router interfaces on cr1-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/2/0: down - Core: cr1-eqord:xe-0/0/1 (Telia, IC-313592, 51ms) {#1502} [10Gbps wave]BR Filippo Giunchedi telia planned maint 6-12 UTC [10:28:39] puppet seems to deploy automatically the latest version [10:29:13] WCGW! [10:29:34] it is tendril [10:29:36] I mean [10:35:45] !log setting up master-master replication on tools (labsdb100[45]) [10:35:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:37:36] I think with the recursive tree walking implementation, the algorithm is deep-first [10:37:48] which may lead to some interesting graphs [10:39:13] !log puppet disabled on rdb1004 (plus Redis instances) as precautionary step for master reimage - rdb1003 [10:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:33] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1011-b instance [puppet] - 10https://gerrit.wikimedia.org/r/276130 [10:42:01] labsdb1004 -> masters: labsdb1005 slaves: labsdb1005 [10:42:10] labsdb1005 -> masters: labsdb1004 slaves: labsdb1004 [10:42:16] yay, it works [10:42:43] neat! good job [10:43:12] I knew "algorithms/graph theory"boring uni classes would be useful one day [10:44:42] if someone has suggestions for simple graph plotting libraries to substitute the current one, please speak up [10:45:14] jynus: I'll quote mr Perlis "Computation has made the tree flower. [10:46:08] PROBLEM - Host rdb1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:46:16] ahhhhh sorryyy icingaaa [10:46:19] this is me [10:46:20] all good [10:46:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1011-b instance [puppet] - 10https://gerrit.wikimedia.org/r/276130 (owner: 10Filippo Giunchedi) [10:47:29] RECOVERY - Host rdb1003 is UP: PING OK - Packet loss = 0%, RTA = 0.48 ms [10:56:33] (03PS1) 10Jcrespo: Avoid infinite loops when using circular replication [software/dbtree] - 10https://gerrit.wikimedia.org/r/276134 (https://phabricator.wikimedia.org/T119642) [10:57:26] (03CR) 10Jcrespo: [C: 032 V: 032] Avoid infinite loops when using circular replication [software/dbtree] - 10https://gerrit.wikimedia.org/r/276134 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [11:01:16] I am runing puppet on mw1152 [11:04:23] !log deployed manually new code at dbtree (noc.wikimedia.org) [11:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:05:37] PROBLEM - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: Connection refused [11:05:47] that's me ^ silencing [11:06:19] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is CRITICAL: Connection refused Filippo Giunchedi bootstrapping [11:08:00] 6Operations, 6Commons, 10media-storage: Unable to undelete file - https://phabricator.wikimedia.org/T129212#2102502 (10fgiunchedi) 5Open>3Resolved resolving again, this should be fully fixed now. The issue was that after merging https://gerrit.wikimedia.org/r/272922 another `swiftrepl` run was needed to... [11:08:40] 6Operations, 10DBA, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#2102505 (10jcrespo) Tendril and dbtree... [11:11:57] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#2102514 (10jcrespo) [11:16:48] 6Operations, 10Traffic: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2102515 (10ema) I've seen the same issue on my test instance in labs. @Ottomata try adding codfw: 127.0.0.1 to cache::text::nodes in ./hieradata/labs.yaml [11:20:21] !log hhvm restarted on mw1140 [11:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:20:47] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [11:21:27] RECOVERY - HHVM rendering on mw1140 is OK: HTTP OK: HTTP/1.1 200 OK - 67640 bytes in 2.382 second response time [11:21:36] RECOVERY - Apache HTTP on mw1140 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.534 second response time [11:24:16] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:27:13] !log setting up master-master cross-datacenter replication for s2 (db1018-db2017) [11:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:27:59] 6Operations, 10Traffic: Fix puppet on deployment-cache* hosts in beta labs - https://phabricator.wikimedia.org/T129270#2102532 (10ema) p:5Triage>3Normal [11:29:32] I said I was going to break tendril, and I delivered! https://dbtree.wikimedia.org/ [11:32:25] 6Operations, 10RESTBase, 13Patch-For-Review: install restbase1010-restbase1015 - https://phabricator.wikimedia.org/T128107#2102560 (10fgiunchedi) status update: restbase1010 is in service with two instances restbase1011 is bootstrapping its second instance (ETA 7-8h) after that's done we can start decommiss... [11:35:25] lol [11:35:28] (03PS1) 10Jcrespo: Fix other slaves not being examined when one was already visited [software/tendril] - 10https://gerrit.wikimedia.org/r/276137 (https://phabricator.wikimedia.org/T119642) [11:35:58] (03CR) 10Ema: "This one should also be a 405 then:" [puppet] - 10https://gerrit.wikimedia.org/r/275916 (owner: 10BBlack) [11:36:00] (03PS2) 10Filippo Giunchedi: (temporarily) enable thrift rpc in staging [puppet] - 10https://gerrit.wikimedia.org/r/275917 (https://phabricator.wikimedia.org/T125906) (owner: 10Eevans) [11:36:06] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] (temporarily) enable thrift rpc in staging [puppet] - 10https://gerrit.wikimedia.org/r/275917 (https://phabricator.wikimedia.org/T125906) (owner: 10Eevans) [11:36:26] (03CR) 10Jcrespo: [C: 032 V: 032] Fix other slaves not being examined when one was already visited [software/tendril] - 10https://gerrit.wikimedia.org/r/276137 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [11:39:49] !log rolling restart cassandra in staging after merging https://gerrit.wikimedia.org/r/#/c/275917/ [11:39:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:43:28] (03PS1) 10Jcrespo: Fix other slaves not being examined when one was already visited [software/dbtree] - 10https://gerrit.wikimedia.org/r/276140 (https://phabricator.wikimedia.org/T119642) [11:43:49] (03CR) 10Jcrespo: [C: 032 V: 032] Fix other slaves not being examined when one was already visited [software/dbtree] - 10https://gerrit.wikimedia.org/r/276140 (https://phabricator.wikimedia.org/T119642) (owner: 10Jcrespo) [11:44:40] (03PS1) 10Muehlenhoff: * Add a versioned dependency on the updated firmware package [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/276141 [11:45:09] (03CR) 10Muehlenhoff: [C: 032 V: 032] * Add a versioned dependency on the updated firmware package [debs/linux-meta] - 10https://gerrit.wikimedia.org/r/276141 (owner: 10Muehlenhoff) [11:45:35] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: puppet fail [11:46:42] 6Operations, 10DBA, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Create a Master-master topology between datacenters for easier failover (setup circular replication dallas -> eqiad for mysql databases) - https://phabricator.wikimedia.org/T119642#2102740 (10jcrespo) [11:47:51] !log uploaded firmware-nonfree 20151018 to jessie-wikimedia on carbon [11:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:55:22] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2102752 (10fgiunchedi) @kaldari @tgr I think we might be close to unblocking thumbnail generation in beta since there's now a swift cluster there. I've outlined what I think the next steps are... [11:58:20] (03Abandoned) 10Elukey: Remove rdb1003.eqiad from the Redis Job Queues for maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275585 (https://phabricator.wikimedia.org/T128730) (owner: 10Elukey) [12:00:51] !log re-enabled puppet on rdb1004 [12:00:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:04:25] (03PS1) 10Elukey: Revert "Reroute jobqueue writes from rdb1003 to rdb1005" to put back rdb1003 in service after maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276143 [12:05:52] (03CR) 10Elukey: [C: 032] Revert "Reroute jobqueue writes from rdb1003 to rdb1005" to put back rdb1003 in service after maintenance. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276143 (owner: 10Elukey) [12:07:42] !log elukey@tin Synchronized wmf-config/jobqueue-eqiad.php: Add rdb1003 back to the Redis JobQueue pool after maintenance (duration: 00m 34s) [12:07:44] (03PS1) 10ArielGlenn: filter out known output for wikitech dumps copy cron job [puppet] - 10https://gerrit.wikimedia.org/r/276144 [12:07:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:09:34] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /page/revision/{revision} (Get rev by ID) is CRITICAL: Test Get rev by ID returned the unexpected status 500 (expecting: 200) [12:10:17] (03CR) 10ArielGlenn: [C: 032] filter out known output for wikitech dumps copy cron job [puppet] - 10https://gerrit.wikimedia.org/r/276144 (owner: 10ArielGlenn) [12:10:29] godog: probably because of the cass restart ^ ? [12:12:01] mobrovac: yeah could be [12:12:05] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:14:03] PROBLEM - RAID on sca2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:14:54] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [12:15:31] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2102788 (10Aklapper) Adding Pats, maybe she can answer the last question. [12:15:45] RECOVERY - RAID on sca2001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [12:17:50] is someone working on sc[ab]200x ? [12:18:23] reports of network issues from germany [12:18:31] in -tech and #wikipedia [12:19:30] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2102795 (10faidon) Just checked dmesg. Lots of medium errors — sda is clearly failed which is what makes the installer be so slow. [12:20:36] (03PS1) 10Elukey: Revert "jobqueue_redis: remove temporarily rdb1003 for reimaging" to put rdb1003 back into the game after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/276147 [12:20:57] 6Operations, 13Patch-For-Review: reinstall bast2001 with jessie - https://phabricator.wikimedia.org/T128899#2089630 (10MoritzMuehlenhoff) JFTR, I opened https://phabricator.wikimedia.org/T129316 for this earlier the day. [12:21:10] (03PS2) 10Elukey: Revert "jobqueue_redis: remove temporarily rdb1003 for reimaging" to put rdb1003 back into the game after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/276147 [12:23:27] (03CR) 10Elukey: [C: 032] Revert "jobqueue_redis: remove temporarily rdb1003 for reimaging" to put rdb1003 back into the game after maintenance. [puppet] - 10https://gerrit.wikimedia.org/r/276147 (owner: 10Elukey) [12:24:34] PROBLEM - RAID on sca2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:25:00] !log added rdb1003 back to the Job Runners queue. All the jobchron processes on jobrunners/videoscalers need to be restarted. [12:25:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:26:18] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for carbon (python) [puppet] - 10https://gerrit.wikimedia.org/r/275833 (owner: 10Muehlenhoff) [12:27:33] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/275830 (owner: 10Muehlenhoff) [12:28:00] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/275829 (owner: 10Muehlenhoff) [12:28:05] RECOVERY - RAID on sca2001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [12:28:54] 6Operations, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102820 (10faidon) [12:35:08] 6Operations, 10Traffic: varnishkafka logrotate cronspam - https://phabricator.wikimedia.org/T129344#2102834 (10elukey) a:3elukey [12:36:08] (03PS2) 10Muehlenhoff: Add ferm rules for statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/275829 [12:36:30] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for statsdlb [puppet] - 10https://gerrit.wikimedia.org/r/275829 (owner: 10Muehlenhoff) [12:37:13] (03PS2) 10Muehlenhoff: Add ferm rules for carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/275830 [12:37:31] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for carbon-c-relay [puppet] - 10https://gerrit.wikimedia.org/r/275830 (owner: 10Muehlenhoff) [12:38:00] (03PS2) 10Muehlenhoff: Add ferm rules for carbon (python) [puppet] - 10https://gerrit.wikimedia.org/r/275833 [12:38:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for carbon (python) [puppet] - 10https://gerrit.wikimedia.org/r/275833 (owner: 10Muehlenhoff) [12:46:20] (03PS1) 10Muehlenhoff: Enable ferm on graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/276152 [12:46:54] !log set disable on cr2-knams:xe-1/2/0 (Init7), issues with at least one large network [12:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:51:16] (03PS1) 10Mobrovac: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) [12:55:08] (03CR) 10Mobrovac: "This patch definitely needs OpsEns eyes." [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [12:58:39] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2102862 (10hashar) [13:02:49] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2102878 (10hashar) Indeed after rebooting integration-slave-jessie1001: ``` $ systemctl status xvf... [13:11:12] (03CR) 10Alexandros Kosiaris: "One more round of comments. Tbh, I am not sure why the CI environment just needs to build the dependencies and not even run some checks ag" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [13:16:37] 6Operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2102891 (10mobrovac) [13:17:16] 6Operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2102895 (10mobrovac) I added the //Focus (2)// goal. Now, we need to decide on which goal to drop. [13:19:27] !log installing beanshell security updates [13:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:22:14] 6Operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals April - June 2016 (Q4 2015/16) - https://phabricator.wikimedia.org/T118871#2102899 (10mobrovac) [13:24:38] 6Operations: 4.4 Linux kernel - https://phabricator.wikimedia.org/T126320#2102902 (10MoritzMuehlenhoff) 4.4 kernel uploaded to carbon. The only thing still missing in perf for 4.4. [13:32:41] (03PS1) 10Muehlenhoff: Fix role analytics_cluster::client in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/276156 [13:36:05] RECOVERY - Router interfaces on cr1-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 72, down: 0, dormant: 0, excluded: 0, unused: 0 [13:37:25] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [13:46:41] (03CR) 10Faidon Liambotis: [C: 031] VCL: switch some 403s to 404 or 405 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275916 (owner: 10BBlack) [13:48:02] (03PS1) 10Joal: Add rsync job for unique_devices dataset. [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) [13:57:36] no [13:57:39] oops :) [13:57:44] (03PS1) 10Muehlenhoff: Remove Django from statistic::compute package list [puppet] - 10https://gerrit.wikimedia.org/r/276160 [14:04:41] (03CR) 10Filippo Giunchedi: [C: 031] Enable ferm on graphite2001 [puppet] - 10https://gerrit.wikimedia.org/r/276152 (owner: 10Muehlenhoff) [14:04:50] (03PS1) 10Joal: Use generic job class to rsync cron datasets [puppet] - 10https://gerrit.wikimedia.org/r/276163 [14:05:05] (03CR) 10Hashar: Services: introduce service::packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [14:08:23] (03CR) 10Krinkle: VCL: switch some 403s to 404 or 405 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275916 (owner: 10BBlack) [14:09:07] (03PS4) 10Mobrovac: Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) [14:09:23] (03PS2) 10Volans: Repool es200[124] after data migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275166 (https://phabricator.wikimedia.org/T127330) [14:09:41] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2103037 (10hashar) And the xvfb puppet stanza in modules/xvfb/manifests/init.pp is: ``` lang=ruby... [14:10:34] (03CR) 10Mobrovac: Services: introduce service::packages (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [14:14:46] !log installing Django security updates [14:14:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:15:04] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 1 failures [14:16:47] (03CR) 10Ottomata: [C: 032 V: 032] Fix role analytics_cluster::client in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/276156 (owner: 10Muehlenhoff) [14:17:07] thanks moritzm^ :) [14:18:53] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [14:19:27] ^that's the currently ongoing installation of the django update on labmon1001 [14:19:29] (03CR) 10Filippo Giunchedi: Set cross-DC swift writes to be sync for originals for switchover testing (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276071 (owner: 10Aaron Schulz) [14:20:18] (03PS1) 10Joal: Upgrade camus and camus-checker jar versions [puppet] - 10https://gerrit.wikimedia.org/r/276164 [14:21:05] (03CR) 10Ottomata: [C: 031] "+1, Ariel?" [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [14:22:44] (03CR) 10Ottomata: [C: 031] "Nice! Ariel?" [puppet] - 10https://gerrit.wikimedia.org/r/276163 (owner: 10Joal) [14:24:48] (03PS1) 10Joal: Make camus overwrite exisiting files [puppet] - 10https://gerrit.wikimedia.org/r/276165 (https://phabricator.wikimedia.org/T128611) [14:25:01] 6Operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#2103076 (10jcrespo) Tickets to read before starting this: T22757. [14:27:20] (03CR) 10ArielGlenn: "Can this not use the exact same minute as other jobs? It would be nice to stagger them a bit." [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [14:27:41] (03CR) 10Andrew Bogott: [C: 032 V: 032] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/276112 (owner: 10Muehlenhoff) [14:27:47] (03PS2) 10Andrew Bogott: Fix duplicate declaration of Package[python-designateclient] [puppet] - 10https://gerrit.wikimedia.org/r/276112 (owner: 10Muehlenhoff) [14:27:59] 6Operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#2103083 (10jcrespo) p:5Normal>3Low @Krenair: yes, then we can close the ticket. [14:29:26] (03PS3) 10Volans: Rebalance external storage in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275166 (https://phabricator.wikimedia.org/T127330) [14:29:31] (03CR) 10Andrew Bogott: "(feel free to merge without me, otherwise I will merge as soon as I'm sitting someplace)" [puppet] - 10https://gerrit.wikimedia.org/r/276112 (owner: 10Muehlenhoff) [14:29:45] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2103087 (10hashar) I have dig in the systemctl documentation and the unit needs to be enabled: ```... [14:30:22] andrewbogott: ok, will go ahead and merge [14:30:33] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix duplicate declaration of Package[python-designateclient] [puppet] - 10https://gerrit.wikimedia.org/r/276112 (owner: 10Muehlenhoff) [14:31:29] (03CR) 10ArielGlenn: "Is the pageviews job meant to be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/276163 (owner: 10Joal) [14:31:31] (03CR) 10Ottomata: "How do you know its not used? It is probably a historic leftover, but folks mostly wanted these things installed for their own projects i" [puppet] - 10https://gerrit.wikimedia.org/r/276160 (owner: 10Muehlenhoff) [14:33:43] (03CR) 10Muehlenhoff: "I checked for running processes using Django whenever I was installing security on stat* and there were any around. But maybe these are al" [puppet] - 10https://gerrit.wikimedia.org/r/276160 (owner: 10Muehlenhoff) [14:33:55] RECOVERY - puppet last run on californium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:34:44] RECOVERY - DPKG on labmon1001 is OK: All packages OK [14:36:13] moritzm: isn't python-django just a lib that folks could use to build a python webapp? [14:36:36] would thikn processes would not be called 'django' [14:36:39] but still, probably no one uses it [14:36:46] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2103101 (10hashar) Now that I have manually enabled the unit, it is started on boot (was not the c... [14:38:19] (03PS2) 10Joal: Make camus overwrite exisiting files [puppet] - 10https://gerrit.wikimedia.org/r/276165 (https://phabricator.wikimedia.org/T128611) [14:38:44] (03PS2) 10Ottomata: Upgrade camus and camus-checker jar versions [puppet] - 10https://gerrit.wikimedia.org/r/276164 (owner: 10Joal) [14:38:55] (03CR) 10Ottomata: [C: 032 V: 032] Upgrade camus and camus-checker jar versions [puppet] - 10https://gerrit.wikimedia.org/r/276164 (owner: 10Joal) [14:39:02] (03PS3) 10Ottomata: Make camus overwrite exisiting files [puppet] - 10https://gerrit.wikimedia.org/r/276165 (https://phabricator.wikimedia.org/T128611) (owner: 10Joal) [14:39:11] (03CR) 10Ottomata: [C: 032 V: 032] Make camus overwrite exisiting files [puppet] - 10https://gerrit.wikimedia.org/r/276165 (https://phabricator.wikimedia.org/T128611) (owner: 10Joal) [14:41:34] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:43:21] thanks moritzm [14:44:38] (03PS1) 10Hashar: xvfb: start service at boot time [puppet] - 10https://gerrit.wikimedia.org/r/276169 (https://phabricator.wikimedia.org/T129345) [14:45:52] (03CR) 10Joal: "Estimated size of dataset: +/-1Mo per year" [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [14:45:59] (03PS2) 10Ottomata: Remove Django from statistic::compute package list [puppet] - 10https://gerrit.wikimedia.org/r/276160 (owner: 10Muehlenhoff) [14:46:09] (03CR) 10Ottomata: [C: 032 V: 032] Remove Django from statistic::compute package list [puppet] - 10https://gerrit.wikimedia.org/r/276160 (owner: 10Muehlenhoff) [14:46:36] (03CR) 10Jcrespo: [C: 031] "Great work!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275166 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [14:47:28] (03PS1) 10Muehlenhoff: Enable ferm on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/276170 [14:47:30] (03CR) 10Volans: [C: 032] Rebalance external storage in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275166 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [14:48:33] (03CR) 10Alexandros Kosiaris: [C: 032] Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [14:48:57] (03CR) 10Alexandros Kosiaris: Services: introduce service::packages (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [14:49:12] (03PS5) 10Alexandros Kosiaris: Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [14:49:17] (03CR) 10Alexandros Kosiaris: [V: 032] Services: introduce service::packages [puppet] - 10https://gerrit.wikimedia.org/r/274675 (https://phabricator.wikimedia.org/T128280) (owner: 10Mobrovac) [14:50:15] (03CR) 10Alexandros Kosiaris: [C: 032] xvfb: start service at boot time [puppet] - 10https://gerrit.wikimedia.org/r/276169 (https://phabricator.wikimedia.org/T129345) (owner: 10Hashar) [14:50:20] (03PS2) 10Alexandros Kosiaris: xvfb: start service at boot time [puppet] - 10https://gerrit.wikimedia.org/r/276169 (https://phabricator.wikimedia.org/T129345) (owner: 10Hashar) [14:50:29] (03PS2) 10Joal: Add rsync job for unique_devices dataset. [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) [14:50:31] (03Merged) 10jenkins-bot: Rebalance external storage in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275166 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [14:50:49] (03CR) 10Alexandros Kosiaris: [V: 032] xvfb: start service at boot time [puppet] - 10https://gerrit.wikimedia.org/r/276169 (https://phabricator.wikimedia.org/T129345) (owner: 10Hashar) [14:52:46] (03PS1) 10Bmansurov: Change LanguageOverlay bucket rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276172 (https://phabricator.wikimedia.org/T128917) [14:53:26] (03PS3) 10Joal: Add rsync job for unique_devices dataset. [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) [14:54:12] anyone getting docserver-http: HTTP 404 in VE? [14:54:33] !log volans@tin Synchronized wmf-config/db-codfw.php: Rebalance external storage servers in codfw T127330 (duration: 00m 41s) [14:54:35] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [14:54:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:44] (03CR) 10Joal: "Changed execution from minute 51 to minute 31" [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [14:55:14] kart_, which wiki? [14:56:23] jynus: enwiki [14:56:35] (03CR) 10Joal: "Pageviews job is a historical uncleaned class. cron::pageviews is not referenced anywhere, pageviews are synchronized using job class." [puppet] - 10https://gerrit.wikimedia.org/r/276163 (owner: 10Joal) [14:56:43] (03Abandoned) 10Muehlenhoff: Enable ferm on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/276170 (owner: 10Muehlenhoff) [14:56:45] (03PS2) 10Joal: Use generic job class to rsync cron datasets [puppet] - 10https://gerrit.wikimedia.org/r/276163 [14:57:04] (will chekc again after sometime) [14:57:19] (03PS4) 10Ottomata: Enable base::firewall on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/274715 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [14:57:34] (03PS2) 10ArielGlenn: explicitly set perms on the empty directory for rsync deletes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/275805 [14:57:47] (03CR) 10Ottomata: [C: 032 V: 032] Enable base::firewall on eventlog1001 [puppet] - 10https://gerrit.wikimedia.org/r/274715 (https://phabricator.wikimedia.org/T113343) (owner: 10Muehlenhoff) [14:58:45] (03CR) 10ArielGlenn: [V: 032] explicitly set perms on the empty directory for rsync deletes [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/275805 (owner: 10ArielGlenn) [14:58:49] (03PS3) 10Joal: Use generic job class to rsync cron datasets [puppet] - 10https://gerrit.wikimedia.org/r/276163 [14:59:35] (03PS3) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [14:59:36] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 13Patch-For-Review, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2103172 (10hashar) Now puppet seems to enable xvfb properly: ``` Debug: Exec... [14:59:43] (03CR) 10Mobrovac: "FTR, the proper way of deploying is:" [puppet] - 10https://gerrit.wikimedia.org/r/275853 (https://phabricator.wikimedia.org/T128237) (owner: 10Mholloway) [14:59:49] oh, moritzm, i just noticed the if $hostname == eventlog1001 [14:59:53] i think we can just put it on both hosts [14:59:59] eventlog2001 has never really been used [15:00:02] and isn't running anything [15:00:06] so we might as well apply it there now [15:00:30] and, puppet has run on eventlog1001 [15:00:33] lemme know if you see anything [15:00:37] looks good to me from here [15:01:35] (03PS4) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [15:02:06] ok, I'll just drop the conditional once we're confident 1001 works, I've added logging rules now [15:03:13] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2103178 (10Andrew) I'm happy to switch over, as soon as daniel removes the WIP from https://gerrit.wikimedia.org/r/#/c/276088 [15:03:38] ottomata: so far just the usual PXE/bootp noise [15:03:41] (03CR) 10Andrew Bogott: [C: 031] "This can be merged as soon as Alex says he's ready." [dns] - 10https://gerrit.wikimedia.org/r/276088 (https://phabricator.wikimedia.org/T126385) (owner: 10Dzahn) [15:04:27] k cool [15:04:33] i'm sure its fine [15:05:51] ottomata: one bit: the current rules for the zeromq legacy stream are limited to hafnium, but there's been dropped packets from graphite1001 [15:06:04] (port 8600) [15:06:22] (03CR) 10BBlack: [C: 031] Add basic support for varnishtest [puppet] - 10https://gerrit.wikimedia.org/r/275779 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [15:07:04] hm, looking [15:07:43] coal... [15:08:40] mobrovac: role::performance [15:08:41] ottomata: also, there's a typo in the mediawiki_exceptions_logging rules, needs to be udp instead of udp. I'm stopping ferm and fix the patch [15:08:42] class ::coal [15:08:47] 6Operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Scaling, 13Patch-For-Review, 7WorkType-NewFunctionality: Xvfb service does not start on Nodepool instances - https://phabricator.wikimedia.org/T129345#2103195 (10hashar) 5Open>3Resolved ``` jenkins@ci-jessie-wikimedia-47306... [15:08:56] endpoint => 'tcp://eventlogging.eqiad.wmnet:8600', [15:09:10] 'eventlogging.eqiad.wmnet' ?! [15:09:35] huh [15:09:35] eventlogging.eqiad.wmnet. 3600 IN CNAME eventlog1001.eqiad.wmnet. [15:09:37] didn't know about that one [15:10:55] moritzm: i guess just add graphite1001 [15:11:03] (03PS3) 10Ema: Add basic support for varnishtest [puppet] - 10https://gerrit.wikimedia.org/r/275779 (https://phabricator.wikimedia.org/T128188) [15:11:14] Krinkle: re the "Domain not configured" page - another thing I've noticed is we seem to serve that as a 200 rather than a 404 currently [15:11:30] (03CR) 10Ema: [C: 032 V: 032] Add basic support for varnishtest [puppet] - 10https://gerrit.wikimedia.org/r/275779 (https://phabricator.wikimedia.org/T128188) (owner: 10Ema) [15:11:47] ottomata: yeah, will do that [15:11:55] (03PS1) 10Muehlenhoff: Fix protocol for ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/276175 [15:12:40] (03PS2) 10Muehlenhoff: Fix protocol for ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276175 [15:13:10] (03PS3) 10Ottomata: Fix protocol for ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276175 (owner: 10Muehlenhoff) [15:13:17] (03PS2) 10BBlack: VCL: switch some 403s to 404 or 405 [puppet] - 10https://gerrit.wikimedia.org/r/275916 [15:13:19] (03CR) 10Ottomata: [C: 031] Fix protocol for ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276175 (owner: 10Muehlenhoff) [15:13:33] (03CR) 10BBlack: VCL: switch some 403s to 404 or 405 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275916 (owner: 10BBlack) [15:14:00] (03CR) 10BBlack: [C: 032 V: 032] VCL: switch some 403s to 404 or 405 [puppet] - 10https://gerrit.wikimedia.org/r/275916 (owner: 10BBlack) [15:14:57] (03PS1) 10Joal: Remove refinery-hive.jar from auxpath in hive-site [puppet] - 10https://gerrit.wikimedia.org/r/276176 [15:15:17] (03PS5) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [15:15:46] (03PS2) 10Ottomata: Remove refinery-hive.jar from auxpath in hive-site [puppet] - 10https://gerrit.wikimedia.org/r/276176 (owner: 10Joal) [15:15:56] (03CR) 10Ottomata: [C: 032 V: 032] Remove refinery-hive.jar from auxpath in hive-site [puppet] - 10https://gerrit.wikimedia.org/r/276176 (owner: 10Joal) [15:17:47] (03PS1) 10Muehlenhoff: Also allow access for eventlogging-zmq-legacy-stream to graphite [puppet] - 10https://gerrit.wikimedia.org/r/276178 [15:18:43] bblack: Yeah, it's apache default docroot [15:18:46] regular html response [15:19:12] (03PS6) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [15:19:25] bblack: We've started centralising some of them in mediawiki-config/erropages [15:19:29] but it's early days [15:19:32] rebase hell [15:19:36] (03PS4) 10Muehlenhoff: Fix protocol for ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276175 [15:19:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix protocol for ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/276175 (owner: 10Muehlenhoff) [15:20:25] right [15:20:42] * apergos walks away for a little while. clearly a bad time to get a puppet changeset merged [15:21:47] (03PS1) 10Dereckson: Wikipedia while at Women of the World Festival throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276179 (https://phabricator.wikimedia.org/T124284) [15:26:26] (03PS2) 10Dereckson: Wikipedia while at Women of the World Festival throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276179 (https://phabricator.wikimedia.org/T129342) [15:28:24] (03PS2) 10Muehlenhoff: Also allow access for eventlogging-zmq-legacy-stream to graphite [puppet] - 10https://gerrit.wikimedia.org/r/276178 [15:28:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also allow access for eventlogging-zmq-legacy-stream to graphite [puppet] - 10https://gerrit.wikimedia.org/r/276178 (owner: 10Muehlenhoff) [15:29:42] (03PS7) 10ArielGlenn: make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) [15:31:17] (03CR) 10ArielGlenn: [C: 032] make timeout in rsync header a parameter, increase value for stats rsyncd [puppet] - 10https://gerrit.wikimedia.org/r/275787 (https://phabricator.wikimedia.org/T127514) (owner: 10ArielGlenn) [15:31:57] 6Operations, 13Patch-For-Review: Randomly failing puppetmaster sync to strontium - https://phabricator.wikimedia.org/T128895#2103258 (10Volans) I saw that happened again today from `#wikimedia-operations` IRC channel: ``` Wed 04:45:15 grrrit-wm| (CR) Yuvipanda: [C: 2] tools: Add paws as a separate host [pu... [15:34:05] (03PS1) 10BBlack: common VCL: correct text for 405 on non-local PURGE [puppet] - 10https://gerrit.wikimedia.org/r/276180 [15:34:07] (03PS1) 10BBlack: upload VCL: use allowed_methods instead of local check [puppet] - 10https://gerrit.wikimedia.org/r/276181 [15:34:09] (03PS1) 10BBlack: common VCL: fix OPTIONS logic now that upload is fixed [puppet] - 10https://gerrit.wikimedia.org/r/276182 [15:34:27] (03CR) 10BBlack: [C: 032 V: 032] common VCL: correct text for 405 on non-local PURGE [puppet] - 10https://gerrit.wikimedia.org/r/276180 (owner: 10BBlack) [15:38:24] (03CR) 10BBlack: [C: 032 V: 032] upload VCL: use allowed_methods instead of local check [puppet] - 10https://gerrit.wikimedia.org/r/276181 (owner: 10BBlack) [15:38:47] (03PS3) 10Tim Landscheidt: ores: Move role classes to module role [puppet] - 10https://gerrit.wikimedia.org/r/270102 [15:40:20] (03CR) 10ArielGlenn: [C: 031] "looks great." [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [15:42:08] (03PS2) 10BBlack: common VCL: fix OPTIONS logic now that upload is fixed [puppet] - 10https://gerrit.wikimedia.org/r/276182 [15:42:38] (03CR) 10BBlack: [C: 032 V: 032] common VCL: fix OPTIONS logic now that upload is fixed [puppet] - 10https://gerrit.wikimedia.org/r/276182 (owner: 10BBlack) [15:43:52] (03PS1) 10Muehlenhoff: Also enable ferm on eventlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/276189 [15:45:59] 6Operations, 7Availability, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: swiftrepl replication pass for thumbnails eqiad -> codfw - https://phabricator.wikimedia.org/T125791#2103307 (10fgiunchedi) the initial copy is still ongoing, at 474M over 685M objects. wrt the codfw switchover:... [15:46:33] (03PS2) 10Muehlenhoff: Also enable ferm on eventlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/276189 [15:46:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] Also enable ferm on eventlog2001 [puppet] - 10https://gerrit.wikimedia.org/r/276189 (owner: 10Muehlenhoff) [15:48:11] (03CR) 10ArielGlenn: [C: 031] "Ah, missed that. Good work, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/276163 (owner: 10Joal) [15:55:04] 6Operations, 13Patch-For-Review: Add ferm rules for eventlog hosts - https://phabricator.wikimedia.org/T113343#2103328 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff The eventlogging hosts now have ferm applied. [15:55:37] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2103333 (10RobH) a:3Ironholds Ok, so there are a few process driven steps we need to accomplish: * @Ironholds needs to review and sign L3. ** I rea... [15:56:56] (03PS2) 10ArielGlenn: adapt trebuchet-trigger for timeout to restart function [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/269465 (https://phabricator.wikimedia.org/T63882) [15:56:58] (03PS3) 10ArielGlenn: make fetch/checkout report a little clearer [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/219841 (https://phabricator.wikimedia.org/T103013) [15:56:59] 6Operations, 10Ops-Access-Requests: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2103336 (10RobH) I stand corrected, the analytics-search-user is a sudo group, so it needs Operations meeting review. The next meeting is Monday, 201... [16:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160309T1600). [16:00:05] bmansurov Dereckson: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:12] here [16:00:38] I can SWAT. [16:00:47] (03PS1) 10RobH: add bearloga & ironholds to analytics-search-user [puppet] - 10https://gerrit.wikimedia.org/r/276190 (https://phabricator.wikimedia.org/T129260) [16:01:07] (03PS1) 10ArielGlenn: bump version to 0.5.7-1 [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/276191 [16:02:02] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: puppet fail [16:02:14] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276172 (https://phabricator.wikimedia.org/T128917) (owner: 10Bmansurov) [16:02:18] (03PS4) 10Ottomata: Use generic job class to rsync cron datasets [puppet] - 10https://gerrit.wikimedia.org/r/276163 (owner: 10Joal) [16:02:25] (03CR) 10Ottomata: [C: 032 V: 032] Use generic job class to rsync cron datasets [puppet] - 10https://gerrit.wikimedia.org/r/276163 (owner: 10Joal) [16:02:29] (03PS4) 10Ottomata: Add rsync job for unique_devices dataset. [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [16:02:40] (03CR) 10Ottomata: [C: 032 V: 032] Add rsync job for unique_devices dataset. [puppet] - 10https://gerrit.wikimedia.org/r/276158 (https://phabricator.wikimedia.org/T126767) (owner: 10Joal) [16:02:50] (03Merged) 10jenkins-bot: Change LanguageOverlay bucket rates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276172 (https://phabricator.wikimedia.org/T128917) (owner: 10Bmansurov) [16:03:04] Hello. [16:05:12] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: Change LanguageOverlay bucket rates [[gerrit:276172]] (duration: 00m 52s) [16:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:20] ^ bmansurov sync'd check if possible [16:05:23] Dereckson: hello [16:05:47] thcipriani: thanks, don't see the change yet. will give it a few more [16:07:02] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276179 (https://phabricator.wikimedia.org/T129342) (owner: 10Dereckson) [16:07:34] (03Merged) 10jenkins-bot: Wikipedia while at Women of the World Festival throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276179 (https://phabricator.wikimedia.org/T129342) (owner: 10Dereckson) [16:09:30] !log thcipriani@tin Synchronized wmf-config/throttle.php: SWAT: Wikipedia while at Women of the World Festival throttle rule [[gerrit:276179]] (duration: 00m 30s) [16:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:38] ^ Dereckson throttle sync'd! [16:10:24] thcipriani: thanks [16:10:45] thcipriani: all is good now. thanks again [16:11:01] bmansurov: glad to hear it :) thanks for checking! [16:11:22] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [16:13:08] seems that the jenkins job beta-mediawiki-config-update-eqiad is failing to update the beta cluster for mediawiki-config [16:17:05] (03PS1) 10Joal: Correct typo in unique_devices rsync cron [puppet] - 10https://gerrit.wikimedia.org/r/276195 [16:18:21] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [16:21:41] (03CR) 10BryanDavis: "Failing to start in beta cluster --" [puppet] - 10https://gerrit.wikimedia.org/r/274696 (https://phabricator.wikimedia.org/T126677) (owner: 10Muehlenhoff) [16:21:51] (03PS1) 10Alexandros Kosiaris: Sync up eqiad/codfw LVS IP assignments for services [dns] - 10https://gerrit.wikimedia.org/r/276196 (https://phabricator.wikimedia.org/T129234) [16:22:45] (03CR) 10Ottomata: [C: 032 V: 032] Correct typo in unique_devices rsync cron [puppet] - 10https://gerrit.wikimedia.org/r/276195 (owner: 10Joal) [16:22:51] heh didn't see that [16:23:02] and I did check the directories but not quite that carefully it seems [16:25:08] (03CR) 10Mobrovac: "Related change-set: Ia48bcb3c07a606e4949dbcb444ccb40de4a08f25" [dns] - 10https://gerrit.wikimedia.org/r/276196 (https://phabricator.wikimedia.org/T129234) (owner: 10Alexandros Kosiaris) [16:25:47] (03PS1) 10RobH: setting up labwebtest-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) [16:26:23] wt2html: Exceeded max resource use: wikitextSize. Aborting! for parsoid ongoing [16:26:36] andrewbogott: so which of the labwebtest roles would be best to include the new labwebtest-roots group? [16:27:20] I have the new usergroup in a patchset, but its not implemented on the labwebtest host by anything yet (also not merged, figured we could append it into that patchset) [16:27:33] looking... [16:28:27] I think it maybe needs to be its own roll and then inserted via hiera [16:28:42] Outside of hiera there’s nothing unique applied to that box [16:29:12] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:29:19] well, or you could put it in the horizon class and then add a realm switch. That seems more obscure though. [16:32:19] hiera seems sensible, i just didnt wanna start doing it in a vacuum [16:34:27] (03PS2) 10Alexandros Kosiaris: Introduce the SC[AB] clusters in codfw [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [16:34:29] (03PS1) 10Alexandros Kosiaris: lvs: SC[AB] services lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/276199 (https://phabricator.wikimedia.org/T129234) [16:35:25] 6Operations, 10RESTBase-Cassandra, 6Services, 13Patch-For-Review: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2103414 (10Eevans) Update: [[https://issues.apache.org/jira/browse/CASSANDRA-8464|Changes]] to Cassandra's [[https://github.com/apache/cassandra/blob/cassa... [16:35:42] 6Operations, 6Services, 3Mobile-Content-Service: Investigate server flapping after 3/7/2016 deploy - https://phabricator.wikimedia.org/T129237#2103415 (10mobrovac) p:5Unbreak!>3High [16:35:51] robh: I’m out for a couple of hours, but — no rush with this. [16:36:23] cool, im happy to have something non procurement between my spreadsheet updates ;D [16:37:03] 6Operations, 6Services, 3Mobile-Content-Service: Investigate server flapping after 3/7/2016 deploy - https://phabricator.wikimedia.org/T129237#2099382 (10mobrovac) So, I've checked out the offending deploy commit on `scb1001` under my local user and ran it without problems. I think it's worth giving the depl... [16:53:07] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2103475 (10Ppena) Thanks for the question guys. I don't recognize the shop@ email... I sent a test email to try and figure out who is receiving it and if its still active. Do you have a list of people... [16:56:40] (03CR) 10Alexandros Kosiaris: "some extra changes here and there. I 've split from the common hiera the per DC parts" [puppet] - 10https://gerrit.wikimedia.org/r/276153 (https://phabricator.wikimedia.org/T129234) (owner: 10Mobrovac) [16:57:43] beta-mediawiki-config-update-eqiad FAILURE Failed deployment on the EQIAD beta cluster :-/ Please contact a member of the beta project to fixup the working directory on the destination server. in 1s [16:58:01] @ https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/4451/console [17:02:07] 6Operations, 10Ops-Access-Requests: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2103523 (10Dereckson) [17:07:57] 6Operations, 10Ops-Access-Requests: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2103542 (10greg) Thanks @Dereckson. I'm going to assess where we are with the SWAT membership today (I got a lot of private emails from interested people) and where/when... [17:15:18] (03PS2) 10RobH: setting up labwebtest-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) [17:16:17] (03CR) 10Alex Monk: "Should be called labtestweb rather than labwebtest" [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) (owner: 10RobH) [17:16:34] Krenair: damn i bungled that halfway and i thought i fixed it all, heh [17:17:30] fixing now [17:19:10] (03PS3) 10RobH: setting up labtestweb-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) [17:21:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000000.0] [17:21:52] Dereckson: I may have fixed the beta cluster update job. Testing now [17:26:50] (03CR) 10Alex Monk: [C: 031] "I was waiting for you guys to confirm it all appears working as expected" [dns] - 10https://gerrit.wikimedia.org/r/276088 (https://phabricator.wikimedia.org/T126385) (owner: 10Dzahn) [17:28:27] (03CR) 10RobH: [C: 04-1] "Error: Failed to compile catalog for node labtestweb2001.wikimedia.org: Role class role::labtestweb not found at /mnt/jenkins-workspace/pu" [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) (owner: 10RobH) [17:32:41] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2103673 (10RobH) [17:34:44] 6Operations, 10ops-codfw, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: rack/setup/deploy rdb200[5-6] - https://phabricator.wikimedia.org/T129178#2103680 (10RobH) So @joe is working with @elukey for the implementation of these. Once we have the OS installed and keys accepted, this task should get assigned... [17:35:05] 6Operations, 10Ops-Access-Requests: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2103681 (10hashar) I endorse @Dereckson . He has a large expertise in reviewing operations/mediawiki-config and I think he is the de facto lead triages for #wikimedia-si... [17:35:31] (03CR) 10GWicke: restbase: make restbase configuration $master_dc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275536 (https://phabricator.wikimedia.org/T126235) (owner: 10Giuseppe Lavagetto) [17:35:50] (03CR) 10GWicke: [C: 04-1] restbase: make restbase configuration $master_dc [puppet] - 10https://gerrit.wikimedia.org/r/275536 (https://phabricator.wikimedia.org/T126235) (owner: 10Giuseppe Lavagetto) [17:36:20] (03PS7) 10Ottomata: Use xtrabackup/innobackupex to do regular full and incremental backups of analytics-meta MySQL [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) [17:36:42] robh:ok [17:37:22] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 51.72% of data above the critical threshold [5000000.0] [17:37:24] 6Operations, 10Ops-Access-Requests: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2103685 (10Krenair) I've been watching @Dereckson triage #Wikimedia-Site-Requests for a long while, and also endorse this. [17:40:33] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2103701 (10EBernhardson) Do we have enough rack space to rack up all 16 new servers, before taking down the old 16? Just trying to plan out how we will do the switchover bet... [17:40:42] !log deployed patch for T122056 [17:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:41:09] (03CR) 10jenkins-bot: [V: 04-1] Use xtrabackup/innobackupex to do regular full and incremental backups of analytics-meta MySQL [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [17:44:25] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2103735 (10RobH) @EBernhardson space is at a premium in eqiad. I don't think we have enough space to evenly distribute all 16 new systems without removing some of the old s... [17:47:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:49:01] (03PS8) 10Ottomata: Use xtrabackup/innobackupex to do regular full and incremental backups of analytics-meta MySQL [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) [17:49:08] 6Operations, 13Patch-For-Review: Sudden increase in NOTICE events from hhvm while trying to de-pool rdb1003 for maintenance - https://phabricator.wikimedia.org/T128730#2103776 (10elukey) @aaron: today I reimaged rdb1003 using the following procedure: https://etherpad.wikimedia.org/p/redis-rdb-reimaging Every... [17:49:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [17:52:33] 6Operations, 6Discovery, 10hardware-requests: Refresh elastic10{01..16}.eqiad.wmnet servers - https://phabricator.wikimedia.org/T128000#2103787 (10RobH) Update from IRC chat: * Get quotes to match the CPU/RAM of the last codfw elastic purchase. * Get quotes to match the existing eqiad elastic spec & compari... [17:53:46] !log deployed patch for T110143 to wmf16 [17:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:59:04] Is it usual that it takes 47s before the appservers determine that my password is wrong when trying to login at enwp? [17:59:21] backend-timing: D=47252228 t=1457546138828184 [18:04:40] (03CR) 10Krinkle: [C: 04-1] "Looks fine, just -1 as placeholder since the cookie got renamed." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247970 (https://phabricator.wikimedia.org/T91820) (owner: 10Ori.livneh) [18:05:17] 6Operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#2103878 (10Nemo_bis) p:5Low>3Normal a:3AlexMonk-WMF [18:05:19] bblack: have a moment for a short PM ? [18:05:48] (03PS1) 10Ori.livneh: X-Wikimedia-Debug: profile if 'profiler' attribute set [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276220 [18:05:50] (03PS1) 10Ori.livneh: Remove unused wmf-deployment symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276221 [18:06:06] jynus: Which patches should I prioritise in review? [18:06:32] Krinkle, this is the one: https://gerrit.wikimedia.org/r/#/c/267659/ [18:06:41] blocking me a lot [18:06:44] ori: Curious about https://gerrit.wikimedia.org/r/#/c/222673/ was/is this causing an issue? [18:07:15] 6Operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#2103887 (10AlexMonk-WMF) a:5AlexMonk-WMF>3Krenair [18:07:45] Krinkle: can't look right now, sorry -- on my way out the door. [18:07:49] k [18:17:16] 6Operations, 10ops-eqiad: Rack and Initial setup db1074-79 - https://phabricator.wikimedia.org/T128753#2103933 (10Cmjohnson) db1074,5,6 installed without an issue and are now ssh accessible. 1077 and 1078 did not install correctly and when I access via palladium I am put at this prompt. ~ # pwd / ~ # [18:20:01] matanya: yes [18:20:22] (03PS1) 10Filippo Giunchedi: varnish: route upload cache backends to codfw [puppet] - 10https://gerrit.wikimedia.org/r/276223 (https://phabricator.wikimedia.org/T129089) [18:20:24] (03PS1) 10Filippo Giunchedi: varnish: route codfw as 'direct' for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276224 (https://phabricator.wikimedia.org/T129089) [18:20:26] (03PS1) 10Filippo Giunchedi: varnish: route eqiad to codfw for upload caches [puppet] - 10https://gerrit.wikimedia.org/r/276225 (https://phabricator.wikimedia.org/T129089) [18:21:43] (03PS3) 10Chad: Gerrit manifest cleanup [puppet] - 10https://gerrit.wikimedia.org/r/275911 [18:22:05] 6Operations, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: switch upload varnish backends to codfw ahead of full switch - https://phabricator.wikimedia.org/T129089#2103988 (10fgiunchedi) see also related documentation https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Media_storage.2F... [18:24:20] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to to analytics-search-user for Mikhail Popov and Oliver Keyes - https://phabricator.wikimedia.org/T129260#2103989 (10Ironholds) Signed! Okay, that gives us a 4-day window to switch everything over. [18:25:16] 6Operations, 7HHVM, 13Patch-For-Review: Rise in "parent, LightProcess exiting" fatals - https://phabricator.wikimedia.org/T124956#1970973 (10greg) 5Invalid>3Resolved >>! In T124956#2092151, @ori wrote: > These aren't actually fatals. When HHVM is configured to have a nonzero number of LightProcess worker... [18:25:17] (03PS1) 10Volans: Change codfw extenal storage topology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276229 (https://phabricator.wikimedia.org/T127330) [18:25:22] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2104009 (10Florian) @Jalexander: Is there any news you could share or a status update? :) [18:25:26] 6Operations, 7HHVM, 13Patch-For-Review: Rise in "parent, LightProcess exiting" console spam - https://phabricator.wikimedia.org/T124956#2104010 (10greg) [18:26:25] (03CR) 10Volans: "I'll change the topology with repl.pl and then merge this." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276229 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [18:27:39] (03CR) 10Jcrespo: [C: 031] Change codfw extenal storage topology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276229 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [18:28:15] ping me when finished so I can setup the master-master [18:28:32] (03PS2) 10Filippo Giunchedi: Set cross-DC swift writes to be sync for originals for switchover testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276071 (https://phabricator.wikimedia.org/T129089) (owner: 10Aaron Schulz) [18:29:08] (03CR) 10Krinkle: [C: 04-1] "Not sure what's up with these host/ip mismatches or whether that affects eqiad currently. To be figured out. Might be okay-ish." (038 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [18:29:26] jynus: That's a first pass. Will look closer now :) [18:29:41] (with a script) [18:30:31] wow [18:30:36] I did not expect that [18:31:13] in fact, that is live [18:31:43] Krinkle: lolz [18:31:51] can you do the same on db-eqiad? [18:31:55] :) [18:31:56] OK [18:32:33] regexp match the php file, put in a file on mira, $(host $ip) each and then compare one by one quickly [18:32:34] because AFAIK, that is a copy from our live config [18:32:38] Right [18:33:29] in any case, I will fix that and double check it, but that is technically not my patch [18:33:40] Yeah, I'm just going over the file as a whole [18:33:47] my biggest issue was with the idea [18:33:54] (03CR) 10Aaron Schulz: [C: 031] "Change itself looks fine. The host problems need fixing here or in a follow up (though that's pre-existing)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [18:34:25] leaving it like that, we have the issue of potentially write to the real masters from codfw [18:34:44] despite being in read-only mode [18:35:10] do we trust mediawiki? because your words tell me we maybe shouldn't :-) [18:35:37] (03CR) 10Krinkle: [C: 031] "Can't find usage, though considering past breakage, let's be extra careful and try this on a canary (if you haven't already)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276221 (owner: 10Ori.livneh) [18:36:26] jynus: Functionally it would be the correct thing. Right now any requests local to codfw, for one reason or another when we're not in read-only must write to eqiad (or fail hard) [18:36:52] In Multi-DC we'll prevent that by enforcing master connection to only be on POST and from the edge we'll route POST to the primary DB [18:37:01] ok [18:37:12] But any legacy code or regression or things we missed, should connect to eqiad as fallback (or make it throw an exception) [18:37:17] obviously, on failover [18:37:22] the non-ssl thing bothers me [18:37:42] for our first stage of just testing failover, is there a reason we can't just use local read-only masters, Krinkle ? [18:37:43] connecting to codfw masters (which may be stale) could cause corruption of secondary data [18:37:58] we would change the masters of both eqiad and codfw to codfw masters [18:38:00] I'd be fine if the patch went back to doing it that way [18:38:02] Either we have it connect to eqiad masters or it throws fatal [18:38:04] so in a future iteration [18:38:15] we should have that config in a 3rd file [18:38:23] and explicitly set the masters [18:38:31] (but the config can remain as pointing to eqiad, and ideally would not be in a DC specific fine, we may wanna centralise the master value and dbhost=>IP mapping) [18:38:38] not base them on position of the slave weights [18:38:50] Yeah [18:39:11] AaronSchulz, that was why I had local read-only masters on my first iteration [18:39:18] and then I changed it [18:39:23] there is SSL now [18:39:43] (03CR) 10Aaron Schulz: "I'd prefer the original read-only local master approach for a current phase of multi-DC work since we have no SSL atm, as Jaime remarked." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [18:39:48] however, current live masters do not support SSL at all, we do it through a slave [18:40:12] hence the weird topology: https://dbtree.wikimedia.org/ [18:40:43] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [18:40:44] Krinkle: any objecting to going to back the the way the patch was before then? [18:40:48] but we cannot setup that back, due to mediawiki limitation of no-multi-tier slaves [18:41:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 53.33% of data above the critical threshold [5000000.0] [18:41:12] you are going to get me crazy! :-P [18:41:26] AaronSchulz: Only if 1) the mysql server itself is configured as read-only and 2) we make it impossible for MediaWiki to connect in non-readonly mode and thinking it is DB_MASTER [18:41:44] Krinkle, all slaves except the analytics are allways in read only mode [18:41:50] k [18:41:51] and labs [18:42:19] "we make it impossible for MediaWiki to connect in non-readonly mode" not my domain [18:42:23] so cannot say [18:42:38] in fact [18:42:45] (03PS4) 10RobH: setting up labtestweb-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) [18:42:46] I know that when deployment was in mira [18:42:54] it tried to write to a local master [18:42:59] not to the real master [18:43:07] no, I lie [18:43:20] it complained about read-only mode on codfw [18:43:36] if we want db-codfw to always list the stand-by master as master when it isn't, we need to make sure mediawiki knows not to connect to it to avoid secondary data corruption with stale reads and such. Writes would fail either way, but we shouldn't rely on it faiing because mysql bounces the write. [18:43:37] but I think it was pointing to the global eqiad master [18:43:56] Krinkle: the readonlyreason fields should handle (2) [18:44:17] the slave lag thing is annoying? can we switch to the heartbeat stuff beforehand? [18:44:23] !log Changing topology of local codfw masters for es2 and es3 before merging https://gerrit.wikimedia.org/r/#/c/276229/1 T127330 [18:44:24] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [18:44:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:32] s/?/. hehe [18:44:44] AaronSchulz, not a blocker, really [18:45:06] AaronSchulz: So MediaWiki does not establish a DB_MASTER connection at all if the master has a readonly reason (not even for a select?) [18:45:21] well I guess the lag will just be estimated too low sometimes, heh [18:45:23] for maintenance, it complained [18:45:32] my biggest fear would be the queue [18:45:39] (03PS5) 10RobH: setting up labtestweb-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) [18:45:40] so not a hard blocker, right [18:45:50] but that doesn't exist on codfw really [18:46:03] jynus: What are the weights based on? [18:46:27] jynus: e.g. 'api' '2.8TB 160GB' has 100 in eqiad, but '3.3TB 160GB, api' in this commit uses 50 [18:46:27] Krinkle, it is a combination of fine-tunning eqiad for months [18:46:37] and knowing what works and what doesn't [18:46:46] a comparison of the hardware [18:46:50] jynus: only thing curious about the weights was high-weight vslow servers [18:46:50] and the roles [18:47:03] the idea is that those are not definitive [18:47:06] * AaronSchulz doesn't see that in eqiad [18:47:24] * jynus points to the patch description [18:47:40] but if we start with something, we can fine-tune [18:48:05] no terbium and no dumps on codfw [18:48:23] which are, respectivelly, vslow and dump, IIRC [18:48:45] (03CR) 10Krinkle: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 (owner: 10Jcrespo) [18:48:47] I had it at 0 on a previous version, changed it later when I learned about that [18:48:54] right, but are you disabling the chron jobs or do they just not run during the switchover window? [18:49:14] (e.g. the special page rebuilds) [18:49:27] we can do whatever, but if they run, they will run on a local slave [18:49:52] now, if that is the case, they will probably fail [18:49:57] (03CR) 10RobH: "The latest test compile of this is http://puppet-compiler.wmflabs.org/1997/" [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) (owner: 10RobH) [18:49:58] due to read only on eqiad [18:50:10] there is no terbium equivalen as of now [18:50:17] (03PS1) 10Ema: Add h2_spdy_stats.stp [puppet] - 10https://gerrit.wikimedia.org/r/276233 (https://phabricator.wikimedia.org/T96848) [18:50:21] if we setup it, I will lower it [18:50:45] the question is, I cannot test without that, and I assure you the current config is worse! [18:51:07] I'm thinking about during the switchover test, everything will be codfw and r/w, so special page rebuilds would use the vslow server, which also has 400 load in some shards [18:51:18] I change the config like 4 times a day [18:51:19] * AaronSchulz feels like he is missing something obvious though [18:51:30] I think we will change by that time [18:51:33] apologies in advance, heh ;) [18:51:37] right [18:51:43] but I need a decision about local vs remote masters [18:51:58] performance/security vs. correctness/potential issues [18:52:11] I can see both arguments [18:52:32] (03PS1) 10Cmjohnson: Adding labsdb1008 to db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/276235 [18:52:33] !log anomie@tin Synchronized php-1.27.0-wmf.16/includes/session/SessionManager.php: Add backtrace to log for MW_NO_SESSION warning mode [[gerrit:276232]] (duration: 00m 50s) [18:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:52:47] I think the SSL bit for the current masters is a blocker. Having crappier slave lag estimates due to a local master and SHOW SLAVE STATUS is far more tolerable. [18:53:06] * AaronSchulz just needs to get Krinkle to agree :) [18:53:43] to be fair, I was recently surprised that lag is lower than though over wan [18:53:50] (under normal conditions) [18:53:58] AaronSchulz: I'm fine with db-codfw having a 'master' entry for something that is a slave at that point. As long as connections aren't even attempted (not fail hard on the mysql side) [18:54:04] thanks to semi-sync [18:54:08] that's the case right? So it's fine. [18:54:30] maybe I can log that? [18:54:39] actually, that would be seen on the logs [18:54:45] jynus: do codfw masters get replicated from eqiad the same way eqiad slaves get replication? [18:54:54] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:54:58] and codfw slaves replicate from their local master or from the primary? [18:55:06] no, look: [18:55:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:55:16] (03PS6) 10RobH: setting up labtestweb-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) [18:55:18] (03CR) 10Dzahn: [C: 031] setting up labtestweb-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) (owner: 10RobH) [18:55:24] Krinkle, https://dbtree.wikimedia.org/ [18:55:44] Krenair: sorry for all the pings from the patchset, im rebasing for merge now so it'll stop soon! [18:56:04] see s2 for a "final" configuration [18:56:08] jynus: Hm.. so for s1 the codfw master replicates from an eqiad slave and then onward to its slaves. For s2 the codfw master is an eqiad slave. [18:56:15] robh, no problem [18:56:22] the others have pending a master failover [18:56:33] which I was going to wait until full db failover [18:56:38] but s2 is the normal state [18:56:40] mutante: when you're done with stuff - do you have a moment to lookover some icinga stuff I tried to and failed to do yesterday? https://phabricator.wikimedia.org/T129209 [18:56:41] k [18:56:50] in fact [18:56:59] there is one link not represented there [18:57:14] which is the db2017 -> db1018 replication [18:57:35] (03CR) 10RobH: [C: 032] setting up labtestweb-roots and adding krenair to it [puppet] - 10https://gerrit.wikimedia.org/r/276197 (https://phabricator.wikimedia.org/T129097) (owner: 10RobH) [18:57:41] which is replicating nothing, and can be deleted if you belive it could create problems [18:57:51] jynus: uh? Interesting [18:57:56] but rember that db2017 is read only [18:57:56] It's a loop? [18:58:16] Krinkle, subscribed to ops- ? [18:58:20] (03CR) 10Volans: [C: 032] "Topology changed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276229 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [18:58:33] see my last message with a graph, that is the intended setup [18:58:33] I am, but not caught up with everything [18:58:38] sounds good [18:59:03] topology doesn't change, only the mediawiky-designed master [18:59:05] (03Merged) 10jenkins-bot: Change codfw extenal storage topology [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276229 (https://phabricator.wikimedia.org/T127330) (owner: 10Volans) [18:59:28] (it actually changes all the time for maintenance, but you get the idea) [18:59:58] so, we leave it as local masters only in read only mode and check for issues? [19:00:27] (we can reevaluate at any time) [19:01:00] same as weights, I am tweaking all the time without you knowing [19:01:03] !log volans@tin Synchronized wmf-config/db-codfw.php: Change codfw external storage topology T127330 (duration: 00m 27s) [19:01:07] T127330: Migration from es2001-es2010 to es2011-es2019 - https://phabricator.wikimedia.org/T127330 [19:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:48] but I really need to do some load testing and buffer warmup a week in advance [19:02:58] jynus: Yeah, got it. [19:03:06] jynus: got a meeting, will reconfirm after. [19:03:09] ok [19:03:17] jynus: meanwhile, Id' feel a lot better if those hostnames/ip match :) [19:03:18] will fix the hosts [19:03:39] and think of a time to talk about memcaches this or next week [19:05:14] 6Operations, 10DBA, 13Patch-For-Review, 7Performance, and 2 others: Stress-test mediawiki application servers at codfw (specially to figure out db weights configuration) and basic buffer warming - https://phabricator.wikimedia.org/T124697#2104147 (10jcrespo) [19:05:16] 6Operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 3 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#2104148 (10jcrespo) [19:08:31] (03PS2) 10Ottomata: Remove clientIp from EventLogging varnishkafka format [puppet] - 10https://gerrit.wikimedia.org/r/275892 (https://phabricator.wikimedia.org/T128407) [19:09:05] (03CR) 10Ottomata: [C: 032 V: 032] Remove clientIp from EventLogging varnishkafka format [puppet] - 10https://gerrit.wikimedia.org/r/275892 (https://phabricator.wikimedia.org/T128407) (owner: 10Ottomata) [19:09:50] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Root on labtestweb for Alex Monk (Krenair) - https://phabricator.wikimedia.org/T129097#2104171 (10RobH) 5Open>3Resolved a:3RobH So this change is now live, and @AlexMonk-WMF now has sudo root permissions on labtestweb2001. Please reopen this task... [19:16:41] 6Operations, 10Ops-Access-Requests, 13Patch-For-Review: Root on labtestweb for Alex Monk (Krenair) - https://phabricator.wikimedia.org/T129097#2104204 (10AlexMonk-WMF) It works, thanks @RobH I support this access request btw :) [19:18:16] (03PS2) 10Dzahn: switch wikitech-static to new jessie VM [dns] - 10https://gerrit.wikimedia.org/r/276088 (https://phabricator.wikimedia.org/T126385) [19:19:02] (03CR) 10Dzahn: [C: 032] "everybody be like "when it's ready". risking it :)" [dns] - 10https://gerrit.wikimedia.org/r/276088 (https://phabricator.wikimedia.org/T126385) (owner: 10Dzahn) [19:21:17] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2104213 (10Dzahn) Done, removed the WIP and merged DNS switch. [19:22:10] 6Operations, 6Labs, 10wikitech.wikimedia.org, 13Patch-For-Review: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2104214 (10Dzahn) the sync between wikitech and wikitech-static .. does that need any change because the IP changed? or all just hostname based? [19:22:47] 6Operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2104215 (10Dzahn) [19:23:21] 6Operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2012874 (10Dzahn) a:5Andrew>3Krenair [19:23:46] andrewbogott: ^ switched [19:23:54] 6Operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2104219 (10Krenair) I don't think anything needs to be changed - the access restriction was removed recently and the dumps became public [19:24:03] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [19:24:58] (03PS1) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [19:25:16] 6Operations, 13Patch-For-Review, 7Tracking: reduce amount of remaining Ubuntu 12.04 (precise) systems - https://phabricator.wikimedia.org/T123525#2104226 (10Dzahn) [19:25:18] 6Operations, 6Labs, 10wikitech.wikimedia.org: Update wikitech-static OS/PHP version - https://phabricator.wikimedia.org/T126385#2104222 (10Dzahn) 5Open>3Resolved eh, true :) i just did that in T54170 i will just claim you resolved this [19:25:24] (03PS1) 10Ottomata: Remove etcd usage from eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/276244 (https://phabricator.wikimedia.org/T128407) [19:26:02] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CRITICAL: Stopped EventLogging jobs: processor/server-side-0 processor/client-side-11 processor/client-side-10 processor/client-side-09 processor/client-side-08 processor/client-side-07 processor/client-side-06 processor/client-side-05 processor/client-side-04 processor/client-side-03 processor/client-side-02 processor/client-side-01 processor/client-side [19:26:03] Krenair: thanks for the upgrade :) barnstar token (i just saw that it exists ) [19:26:06] (03PS2) 10Ottomata: Remove etcd usage from eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/276244 (https://phabricator.wikimedia.org/T128407) [19:27:07] mutante, you're welcome. I'm hoping the notes I left will be useful next time this machine needs changing :) [19:27:28] hehe, i hope so too [19:27:58] (03CR) 10Ottomata: [C: 032] Remove etcd usage from eventlogging processor [puppet] - 10https://gerrit.wikimedia.org/r/276244 (https://phabricator.wikimedia.org/T128407) (owner: 10Ottomata) [19:29:33] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK: OK: All defined EventLogging jobs are runnning. [19:30:17] (03PS2) 10Filippo Giunchedi: prometheus: add node_exporter support [puppet] - 10https://gerrit.wikimedia.org/r/276243 (https://phabricator.wikimedia.org/T92813) [19:30:35] 6Operations, 6Labs, 10wikitech.wikimedia.org: decom old wikitech-static machine - https://phabricator.wikimedia.org/T129391#2104237 (10Dzahn) [19:31:20] how short are you thinking mutante? [19:31:27] 6Operations, 6Labs, 10wikitech.wikimedia.org: decom old wikitech-static machine - https://phabricator.wikimedia.org/T129391#2104237 (10Dzahn) [19:31:34] few days? [19:32:39] Krenair: yea, few days feels right. what do you think [19:32:54] I want to check importing is working fine tomorrow [19:33:03] over the weekend [19:33:05] i'd say [19:33:07] right now I'm a bit suspicious [19:33:30] ok, yes, that should be checked [19:34:53] PROBLEM - dhclient process on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:23] PROBLEM - puppet last run on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:42] PROBLEM - RAID on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:43] PROBLEM - salt-minion processes on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:44] PROBLEM - Check size of conntrack table on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:35:53] PROBLEM - configured eth on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:02] PROBLEM - DPKG on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:03] PROBLEM - spamassassin on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:22] PROBLEM - Disk space on mx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:36:32] PROBLEM - Exim SMTP on mx1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:16] that's not good and i cant connect either [19:38:43] hrmm [19:39:13] no bueno... [19:39:33] mgmt , no luck [19:39:50] its a vm no? [19:39:55] its not in racktables [19:43:34] RECOVERY - dhclient process on mx1001 is OK: PROCS OK: 0 processes with command name dhclient [19:44:03] RECOVERY - puppet last run on mx1001 is OK: OK: Puppet is currently enabled, last run 13 minutes ago with 0 failures [19:44:14] RECOVERY - RAID on mx1001 is OK: OK: no RAID installed [19:44:23] RECOVERY - salt-minion processes on mx1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:44:24] RECOVERY - Check size of conntrack table on mx1001 is OK: OK: nf_conntrack is 0 % full [19:44:34] RECOVERY - configured eth on mx1001 is OK: OK - interfaces up [19:44:42] RECOVERY - DPKG on mx1001 is OK: All packages OK [19:44:43] RECOVERY - spamassassin on mx1001 is OK: PROCS OK: 3 processes with args spamd [19:45:02] RECOVERY - Disk space on mx1001 is OK: DISK OK [19:45:12] RECOVERY - Exim SMTP on mx1001 is OK: OK - Certificate will expire on 09/22/2016 18:01. [19:47:51] chasemp: heyyyy yt? [19:47:53] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [19:48:45] (03PS1) 10Jcrespo: Remove decommisioned hosts whose ips had been reused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 [19:50:53] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [19:50:59] (03CR) 10Dzahn: "i do planet.wikimedia.org but this is planet.wikimedia.de. i have nothing against this though or anything :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/275815 (https://phabricator.wikimedia.org/T127176) (owner: 10WMDE-leszek) [19:51:18] Oh that's what happened [19:51:21] Decom + IP reuse [19:51:43] Makes a lot of sense :) I was pretty baffled when I saw Krinkle's comments earlier [19:52:09] removing all of those "#do not remove or comment out" ? :) [19:52:37] I can't wait until we stop using PHP arrays as a primitive way to resolve dns hostnames [19:53:05] Oh wow, db30-69 :O [19:53:09] Those are pmtpa hostnames/IPs [19:55:37] (03CR) 10Krinkle: Remove decommisioned hosts whose ips had been reused (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 (owner: 10Jcrespo) [19:56:09] (03PS2) 10Jcrespo: Clean up database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 [19:58:05] ottomata:afk for a minute, i will ping you back in a few [19:58:33] k [19:58:55] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2104351 (10Dzahn) Hi Ppena, so it's a bit complicated because the mail setup is split up betwen different teams. Some things are done by Tech Ops and other things are done by Office IT. This task is... [20:00:05] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160309T2000). Please do the needful. [20:00:53] RECOVERY - cassandra-b CQL 10.64.0.118:9042 on restbase1011 is OK: TCP OK - 0.006 second response time on port 9042 [20:01:01] jynus: is db1010 also decom? seems to be up [20:01:32] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [20:02:22] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:03:04] (03CR) 10Jcrespo: "It has to die in fire! All hosts <=db1010 should be already decommissioned (only db1001 and db1009 will survive for now because they are i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 (owner: 10Jcrespo) [20:03:25] Krinkle, that is a mistake [20:03:39] meaning, the patch is ok, the mistake is it being still up [20:03:46] it has mentions in puppet etc [20:03:57] https://github.com/search?utf8=%E2%9C%93&q=db1010+%40wikimedia&type=Code&ref=searchresults [20:03:58] OK [20:04:14] yeah, it has not been properly decommed [20:04:19] I can add it if you want [20:04:47] technically it is usable [20:04:49] The array key is not referenced from anywhere so it's safe [20:05:03] mw doesn't know what it is for at this point so thre's no way it'll use it [20:05:14] I guess that's the first step in decom. So thats fine [20:05:21] it is not monitored [20:05:25] But would've normally happened in a separate commit I guess [20:05:39] Yeah, it's fine. [20:05:45] I'll stage on mw1017 in a bit if that's okay [20:05:55] this change? [20:05:59] yes [20:05:59] Yeah [20:06:05] (03CR) 10Krinkle: [C: 031] Clean up database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 (owner: 10Jcrespo) [20:06:39] twentyafterfour: If you haven't started deploying yet, can I push this out ^ ? [20:06:55] (03PS1) 10Yuvipanda: k8s: Pin a version of docker [puppet] - 10https://gerrit.wikimedia.org/r/276250 [20:06:57] Krinkle: go for it [20:07:03] thx [20:07:05] (03CR) 10Krinkle: [C: 032] Clean up database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 (owner: 10Jcrespo) [20:07:06] there is no rush for it in any case [20:07:20] (03PS2) 10Yuvipanda: k8s: Pin a version of docker [puppet] - 10https://gerrit.wikimedia.org/r/276250 [20:08:13] (03Merged) 10jenkins-bot: Clean up database configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276247 (owner: 10Jcrespo) [20:08:16] 6Operations, 10DBA: Decomission db1010 - https://phabricator.wikimedia.org/T129395#2104373 (10jcrespo) [20:09:15] all <= db1030 will go away in 3 months anyway [20:09:17] (03CR) 10Yuvipanda: [C: 032] k8s: Pin a version of docker [puppet] - 10https://gerrit.wikimedia.org/r/276250 (owner: 10Yuvipanda) [20:10:46] (03PS9) 10Jcrespo: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [20:11:18] !log krinkle@tin Synchronized wmf-config/db-codfw.php: Clean up - Ibb4bb0b32f5 (duration: 00m 35s) [20:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:11:33] (03PS1) 10Andrew Bogott: Update designate policy rules [puppet] - 10https://gerrit.wikimedia.org/r/276251 [20:11:58] !log krinkle@tin Synchronized wmf-config/db-eqiad.php: Clean up - Ibb4bb0b32f5 (duration: 00m 29s) [20:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:12:52] jynus: I can push out the other change (prepare db-codfw) later tonight if you want. [20:13:02] in 2-3 hours [20:13:31] let decide first in which state [20:13:45] I was about to change it to the original proposal [20:14:01] (03PS2) 10Andrew Bogott: Update designate policy rules [puppet] - 10https://gerrit.wikimedia.org/r/276251 [20:15:09] (03PS1) 10BBlack: appservers_debug: switch to codfw for cache->app [puppet] - 10https://gerrit.wikimedia.org/r/276252 (https://phabricator.wikimedia.org/T125510) [20:15:33] (03CR) 10BBlack: [C: 032 V: 032] appservers_debug: switch to codfw for cache->app [puppet] - 10https://gerrit.wikimedia.org/r/276252 (https://phabricator.wikimedia.org/T125510) (owner: 10BBlack) [20:16:07] (03PS10) 10Jcrespo: Prepare db-codfw.php for a live deployment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/267659 [20:16:14] (03PS3) 10Andrew Bogott: Update designate policy rules [puppet] - 10https://gerrit.wikimedia.org/r/276251 [20:17:05] ^that would be my initial proposal (we can modify the actual formatting as you want, description, etc.) [20:17:33] jynus: AaronSchulz: https://phabricator.wikimedia.org/T129399 [20:18:02] (03PS2) 10Ottomata: Set file.encoding=UTF-8 for all java processes in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/276010 (https://phabricator.wikimedia.org/T128607) [20:18:11] https://gerrit.wikimedia.org/r/#/c/267659/9..10/wmf-config/db-codfw.php [20:18:11] (03CR) 10Andrew Bogott: [C: 032] Update designate policy rules [puppet] - 10https://gerrit.wikimedia.org/r/276251 (owner: 10Andrew Bogott) [20:18:19] (03PS3) 10Ottomata: Set file.encoding=UTF-8 for all java processes in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/276010 (https://phabricator.wikimedia.org/T128607) [20:18:29] (03CR) 10Ottomata: [C: 032 V: 032] Set file.encoding=UTF-8 for all java processes in analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/276010 (https://phabricator.wikimedia.org/T128607) (owner: 10Ottomata) [20:18:33] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2104447 (10bbogaert) Hi, Board@wikimedia.org is a Google Group. Hope this helps. -Byron [20:19:25] Krinkle: ok to do the train now? [20:19:31] Yep, done [20:19:45] !log decommissioning restbase1001.eqiad.wmnet : T125842 [20:19:46] T125842: normalize eqiad restbase cluster - replace restbase1001-1006 - https://phabricator.wikimedia.org/T125842 [20:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:20:59] I could explain it (maybe) [20:22:58] (03PS1) 10Ppchelko: Enable varnish caching for related pages. [puppet] - 10https://gerrit.wikimedia.org/r/276254 (https://phabricator.wikimedia.org/T125983) [20:23:07] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276255 [20:24:11] Krinkle I will go away soon- to give you an idea of my blocker- my blocker is having the right weights in place [20:24:38] Krinkle: do we need any of these symlinks anymore? ^ https://gerrit.wikimedia.org/r/276255 [20:24:38] if that is deployed, the rest (masters) I do not care, you can continue discussing [20:24:57] twentyafterfour: only php [20:24:58] but I need the databases in use for proper load testing [20:25:04] twentyafterfour: static/* should not be created for new branches [20:25:20] we need to keep current ones as well as the "static/current" alias [20:25:54] twentyafterfour: Hm.. static/current can use /s/m/php [20:26:04] I thought it was using that already [20:26:25] That's what we do for w/extensions and a few other ones [20:26:44] That way you'll only have to update ./php when a new branch is available [20:27:24] (03PS2) 1020after4: group1 wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276255 [20:27:29] 6Operations, 10Traffic, 13Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Traffic Infrastructure support for Mar 2016 codfw rollout - https://phabricator.wikimedia.org/T125510#2104514 (10BBlack) Status updates on the 3x things mentioned a couple updates above: 1. (codfw direct): not yet t... [20:27:54] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:28:25] (03Abandoned) 1020after4: group1 wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276255 (owner: 1020after4) [20:29:06] (03PS1) 10Ottomata: Add camus job for importing eventbus events [puppet] - 10https://gerrit.wikimedia.org/r/276257 (https://phabricator.wikimedia.org/T125144) [20:29:38] (03PS1) 1020after4: group1 wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276258 [20:30:22] (03CR) 10Mobrovac: [C: 031] Enable varnish caching for related pages. [puppet] - 10https://gerrit.wikimedia.org/r/276254 (https://phabricator.wikimedia.org/T125983) (owner: 10Ppchelko) [20:30:27] (03CR) 1020after4: [C: 032] group1 wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276258 (owner: 1020after4) [20:30:35] (03CR) 10Ottomata: [C: 032] Add camus job for importing eventbus events [puppet] - 10https://gerrit.wikimedia.org/r/276257 (https://phabricator.wikimedia.org/T125144) (owner: 10Ottomata) [20:30:53] hmm [20:31:25] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276258 (owner: 1020after4) [20:32:04] Krinkle: https://gerrit.wikimedia.org/r/276258 looks ok then? [20:32:07] * twentyafterfour should have asked that before pulling the trigger [20:32:12] it's merged but not deployed [20:32:32] twentyafterfour: Depends, what does static/current point to? same as /php? [20:32:39] I should be the same [20:33:14] * Krinkle is confused [20:33:45] Oh, the other one was abandoned [20:34:12] so it does still need to update static/current/* ? [20:34:27] I’m trying to move horizon.wikimedia.org to labdashboard.wikimedia.org. Does the misc-web setup support redirections like that, or should I just write a vhost for horizon.wikimedia.org with a rewrite or redirect or something? [20:34:53] twentyafterfour: yes those are still used. I was just saying that we can make those point to /php so that you don't need to update them every week, but their actual target does need to be updated every week, yes. [20:35:06] ah ok [20:35:17] so I can make them symlinks [20:35:21] Yep :) [20:39:40] 6Operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#2104553 (10Krenair) 5Open>3Resolved ```krenair@terbium:~$ mwscript eval.php enwiki > $rev = Revision::newFromId( 186704908 ); > $content = $rev->getContent(); > var_... [20:42:00] packet loss to/from carbon it seems [20:42:09] affecting the ubuntu mirror we have [20:43:22] (03PS14) 10BBlack: cache_app_route(): parser func for cache->app routing [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [20:43:24] (03PS1) 10BBlack: cache_text: codfw->direct routing [puppet] - 10https://gerrit.wikimedia.org/r/276259 (https://phabricator.wikimedia.org/T125510) [20:43:25] smokeping notification (does it send recovery emails? I forget already) [20:43:26] (03PS1) 10BBlack: appservers_debug: split routing [puppet] - 10https://gerrit.wikimedia.org/r/276260 (https://phabricator.wikimedia.org/T125510) [20:43:28] (03PS1) 1020after4: w/static/current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276261 [20:43:39] andrewbogott: either would work [20:44:14] 7Puppet: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2104571 (10dschwen) [20:44:27] andrewbogott: it's a bit nicer to do it in apache, because the varnish configs are shared by multiple services so you always need to exercise more care when touching that [20:44:29] Krinkle: https://gerrit.wikimedia.org/r/276261 look good? [20:44:41] * twentyafterfour is being overly cautious now [20:44:52] ori: ok, thanks. I’ll add you once I have a patch. [20:45:22] twentyafterfour: I'd use absolute links as elsewhere. We have a tendency to move things around (for one, I intend to move /w/static to /static soon) [20:45:53] But looks good yeah [20:46:08] (absolute is also easier to review, counting the steps hurts my brain) [20:46:55] (03PS1) 10Andrew Bogott: Move horizon.wikimedia.org to labsdashboard.wikimedia.org. [puppet] - 10https://gerrit.wikimedia.org/r/276262 [20:48:11] (03CR) 10GWicke: restbase: make restbase configuration $master_dc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/275536 (https://phabricator.wikimedia.org/T126235) (owner: 10Giuseppe Lavagetto) [20:49:04] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [20:52:02] 7Puppet: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2104612 (10Andrew) Can you tell me when you fixed the puppet issue, and when you received your most recent email nag? [20:55:25] (03PS1) 10Andrew Bogott: Have the site-branding link link back to horizon rather than to wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/276264 [20:57:14] !log Upgrade HHVM on CODFW app servers to 3.12.1+dfsg-1 [20:57:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:58:22] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2104625 (10GWicke) [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160309T2100). Please do the needful. [21:01:26] (03PS2) 1020after4: w/static/current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276261 [21:01:31] 6Operations, 10ops-codfw: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2104645 (10Papaul) I did a full hardware scan, HD0 is bad {F3582945}. The system is out of warranty (HW warranty expiration: 2015-08-29). so i will have to open a task to order a new disk for this system. [21:01:57] 6Operations, 10ops-codfw: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2104646 (10Papaul) p:5Triage>3Normal [21:02:24] (03CR) 1020after4: [C: 032] "ps2 made the symlinks absolute as requested by krinkle." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276261 (owner: 1020after4) [21:02:57] (03Merged) 10jenkins-bot: w/static/current symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276261 (owner: 1020after4) [21:03:01] 7Puppet: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2104647 (10dschwen) Last nag: ~13h ago. Fixed three days ago. [21:03:33] twentyafterfour: thx [21:06:40] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2104657 (10dschwen) a:5Andrew>3yuvipanda [21:06:45] !log twentyafterfour@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.16 [21:06:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:08:16] 7Puppet, 6Labs: Receiving puppet run failure alert for instance where manual puppet runs complete fine - https://phabricator.wikimedia.org/T129403#2104660 (10yuvipanda) a:5yuvipanda>3None [21:09:05] (03PS2) 10Cmjohnson: Adding labsdb1008 to db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/276235 [21:09:24] Before I do the mobileapps deployment, can someone in ops help me with temporarily depooling scb1001 and removing it from salt? Maybe akosiaris? mobrovac suggested this for the next deploy since we had issues on Monday [21:10:21] (03CR) 10Cmjohnson: [C: 032] Adding labsdb1008 to db.cfg [puppet] - 10https://gerrit.wikimedia.org/r/276235 (owner: 10Cmjohnson) [21:10:44] "Message blob for ext.MassMessage.content.js should have been preloaded" [21:11:48] 6Operations, 10ops-codfw: codfw: 500GB SATA disk for bast2001 - https://phabricator.wikimedia.org/T129405#2104662 (10Papaul) [21:12:29] legoktm: ^ re massmessage [21:12:43] uhh [21:12:48] twentyafterfour: where does it say that? [21:13:00] That's a ResourceLoader debugging thing [21:13:04] May not actually be related to MM at all [21:13:23] legoktm: mediawiki-errors dashboard (https://logstash.wikimedia.org/#/dashboard/elasticsearch/mediawiki-errors) if you filter on wmf.16 [21:13:28] 6Operations, 10ops-codfw, 10procurement: codfw: 500GB SATA disk for bast2001 - https://phabricator.wikimedia.org/T129405#2104662 (10Papaul) [21:14:35] there are a lot of those, for a bunch of different modules [21:15:03] 6Operations, 6Commons, 10media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#2104681 (10Tgr) As far as I can see Swift is not relevant here. Having it in beta will help reproducing/early catching of missing file errors and the like, but it does not affect issues with t... [21:15:29] i mean removing it from the salt deployment grains [21:15:37] Krinkle will know what that message means, I think he added it [21:15:46] seems to be all on testwiki [21:15:53] if ( !isset( $this->msgBlobs[$lang] ) ) { [21:15:53] $this->getLogger()->warning( 'Message blob for {module} should have been preloaded', [ [21:15:53] 'module' => $this->getName(), [21:15:53] ] ); [21:16:04] in ResourceLoaderModule::getMessageBlob() [21:16:54] * twentyafterfour was just poking for signs of trouble in wmf.16, maybe this isn't anything to worry about [21:17:19] sorry for the false alarm [21:18:35] * greg-g dittos [21:18:46] my first reaction is always to ping and ask forgiveness if false alarm :) [21:21:32] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.036 second response time [21:22:12] oh well, i'm just going to proceed then without it /cc:mdholloway mobrovac [21:22:22] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.024 second response time [21:22:35] !log restarted HHVM on mw1107; lock-up [21:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:24:03] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.110 second response time [21:25:03] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 67860 bytes in 1.194 second response time [21:26:07] RoanKattouw_away: It means something is calling getMessages() on a module that was not previously preloaded by ResourceLoader::respond() [21:26:12] we usually batch query, right? [21:26:23] !log starting mobileapps deploy [21:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:34] RoanKattouw_away: I noticed these warnings in Jenkins recently. Looks like a regression indeed. [21:26:47] twentyafterfour: ^ [21:27:37] Krinkle: in RL or the client code (MassMessage or whatever)? [21:27:44] PROBLEM - Apache HTTP on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.008 second response time [21:27:44] greg-g: RL, server-side. [21:27:48] * greg-g nods [21:27:59] 6Operations, 6Research-and-Data, 10The-Wikipedia-Library, 10Traffic, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#2104704 (10DarTar) John Vanderberg [[ https://meta.wikimedia.org/wiki/Research_talk:Wikimedia_ref... [21:28:03] greg-g: The cause depends. I doubt MassMessage would call it directly, that would explain it but also be worrysome. [21:28:17] it doesn't :) [21:28:19] Krinkle: can you file the task, I'd just be copying IRC lines :) [21:28:20] greg-g: I suspect it may be a larger issue that preloading may've broken [21:28:31] in a meeting, will do in 20min [21:28:32] !log Depooled mw1107 [21:28:33] PROBLEM - HHVM rendering on mw1107 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.012 second response time [21:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:13] Krinkle: twentyafterfour legoktm is it worth a revert? do you think it's breaking things? [21:29:32] * greg-g goes into two back to back 1:1s [21:29:37] I really have no idea what this part of the RL code does [21:29:49] If front-end performance, dbload or memcache load is significantly up as a result, then yes we should revert. [21:29:58] I'll check in a bit [21:29:59] twentyafterfour: ^ your call [21:30:08] those messages only seem to be happening on testwiki not in prod [21:34:21] does someone see something that I don't? every graph I can find seems pretty stable [21:34:52] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=RCStream%20cluster%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1457559264&g=network_report&z=large has a little spike [21:36:34] greg-g: I think everything is ok unless someone has evidence that I'm missing [21:39:44] PROBLEM - RAID on db2017 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [21:40:12] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [21:40:37] twentyafterfour: backend time spent in PHP for saves seems up a fair bit however [21:40:51] which I also use as a proxy for spent time in PHP in general for views [21:41:59] https://graphite.wikimedia.org/render?target=MediaWiki.timing.editResponseTime.p75&target=timeShift(MediaWiki.timing.editResponseTime.p75,%221w%22)&from=-3h&width=900&height=500 [21:42:01] ori: twentyafterfour [21:42:14] 6Operations, 6Research-and-Data, 10The-Wikipedia-Library, 10Traffic, and 6 others: Set an explicit "Origin When Cross-Origin" referer policy via the meta referrer tag - https://phabricator.wikimedia.org/T87276#2104753 (10Tgr) Note that before the change target sites which were linked over HTTPS received th... [21:42:34] Krinkle: I'm upgrading some app servers [21:43:03] i'll watch the graph. I don't think it's the train [21:44:33] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [21:45:12] PROBLEM - check_puppetrun on bellatrix is CRITICAL: CRITICAL: Puppet has 1 failures [21:48:05] ori: I think it is, but it happens every week [21:48:12] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet has 1 failures [21:48:13] deploy time was different than last week, didn't see it until now [21:48:24] The last week line (refresh the graph) goes up around this time last week [21:48:46] Or, it isn't supposed to happen every week and we had a regression last week that we fixed since [21:48:59] could be a few version specific caches waiting to be warmed [21:49:38] 6Operations, 10ops-codfw: db2018 failed disk (degraded RAID) - https://phabricator.wikimedia.org/T128057#2104825 (10jcrespo) db2017 slot 11 failed completely today. [21:49:58] twentyafterfour: interesting so it happens only on test2wiki [21:50:12] RECOVERY - check_puppetrun on bellatrix is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [21:50:14] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.014 second response time [21:51:02] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:51:52] 6Operations, 10ops-codfw: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2104838 (10Papaul) [21:51:54] 6Operations, 10ops-codfw, 10procurement: codfw: 500GB SATA disk for bast2001 - https://phabricator.wikimedia.org/T129405#2104834 (10Papaul) 5Open>3declined [21:52:04] RECOVERY - Apache HTTP on mw1122 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 3.446 second response time [21:53:14] RECOVERY - HHVM rendering on mw1122 is OK: HTTP OK: HTTP/1.1 200 OK - 67858 bytes in 0.281 second response time [21:54:15] !log mobileapps deployed 95a2d76 [21:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:55:30] bearND: mobileapps on scb1001 is critical again [21:55:48] not sure what that health check is checking [21:55:52] greg-g: oh noes [21:57:07] sigh [21:57:44] hmm, when I run nagios checker it tells me all endpoints are healthy [21:57:53] (03PS1) 10Hashar: hiera_lookup: support 'labs' realm [puppet] - 10https://gerrit.wikimedia.org/r/276345 [21:57:55] (03PS1) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 [21:58:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [21:58:13] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:58:18] for some reason it's flapping ... [21:58:22] no scb1002 [21:58:26] now [21:59:05] mobrovac: bearND are you able to look at useful logs? [21:59:10] mobrovac: ok, what should we do here? [21:59:42] (03CR) 10Hashar: "That is a first step, the labs hiera hierarchy relies on 'labsproject' being set. In the interest of reviewers, I have added that in a fol" [puppet] - 10https://gerrit.wikimedia.org/r/276345 (owner: 10Hashar) [21:59:53] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [21:59:59] heh ^ [22:00:08] the one that's most useful seems to be: [22:00:09] {"name":"mobileapps","hostname":"scb1001","pid":109,"level":30,"message":"400: https://mediawiki.org/wiki/HyperSwitch/errors/bad_request","status":400,"type":"https://mediawiki.org/wiki/HyperSwitch/errors/bad_request","detail":"title-invalid-characters","levelPath":"info/400","request_id":"1ae96cd3-e642-11e5-bc5e-137e8e553a6a","msg":"400: [22:00:09] https://mediawiki.org/wiki/HyperSwitch/errors/bad_request","time":"2016-03-09T21:58:50.094Z","v":0} [22:00:20] (03CR) 10jenkins-bot: [V: 04-1] hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (owner: 10Hashar) [22:00:26] ... [22:00:50] bearND: that's probably a title-normalisation issue, we can look into that separately [22:00:54] nothing in the logs relating to workers dying or the like [22:01:42] bearND: mdholloway: let's keep it running like that for 10 mins or so [22:02:31] mobrovac: maybe we should change one of the boxes back? [22:03:10] there seemed to have been an initial hyper-load on the master process on both nodes, but now things are looking pretty normal [22:03:54] (03PS2) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 [22:04:16] ok [22:04:31] (03CR) 10Hashar: "Fixed my lame ruby mistakes:" [puppet] - 10https://gerrit.wikimedia.org/r/276346 (owner: 10Hashar) [22:04:46] lol hashar [22:05:04] mobrovac: yeah I am really not a rubyist :-} [22:05:14] you should become one ;) [22:05:43] PROBLEM - HHVM rendering on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.015 second response time [22:06:13] 7Puppet, 5Continuous-Integration-Scaling, 13Patch-For-Review: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2104908 (10hashar) `operations/puppet.git` has a hiera utility `/utils/hiera_lookup` but it does not support labs anymore so I had to fix it: https... [22:06:13] PROBLEM - Apache HTTP on mw1122 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50392 bytes in 0.006 second response time [22:07:03] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:07] (03PS3) 10Hashar: hiera_lookup: recognize labs project and site [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) [22:07:09] (03PS2) 10Hashar: hiera_lookup: support 'labs' realm [puppet] - 10https://gerrit.wikimedia.org/r/276345 (https://phabricator.wikimedia.org/T129092) [22:07:16] jynus: , yt? [22:07:54] (03CR) 10Hashar: "I have attached this change to T129092 (Hiera is not properly configured on Nodepool instances). Ready for review!" [puppet] - 10https://gerrit.wikimedia.org/r/276345 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [22:07:58] (03CR) 10Hashar: "I have attached this change to T129092 (Hiera is not properly configured on Nodepool instances). Ready for review!" [puppet] - 10https://gerrit.wikimedia.org/r/276346 (https://phabricator.wikimedia.org/T129092) (owner: 10Hashar) [22:08:53] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:09:07] 7Puppet, 5Continuous-Integration-Scaling, 13Patch-For-Review: Hiera is not properly configured on Nodepool instances - https://phabricator.wikimedia.org/T129092#2094783 (10hashar) [22:09:32] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [22:09:39] mobrovac: and that is all part of figuring out how hiera behave on nodepool instances so we can get the services -dev.deb packages installed ;-} [22:09:43] PROBLEM - HHVM rendering on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:09:50] 6Operations, 10DBA: Investigate/decom db2001-db2008 - https://phabricator.wikimedia.org/T125827#2104918 (10RobH) a:5RobH>3jcrespo >>! In T125827#2098420, @jcrespo wrote: > @RobH @mark I think there is a mistake on the 5-year planing. I made a comment on the spreadsheet. Luckily, most of these do not need r... [22:09:52] yup hashar, i figured :) [22:09:58] mobrovac: scb1002 was flapping again [22:10:02] yup bear [22:10:04] mobrovac: in the end we will be in a more or less good situation -:} [22:10:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:10:45] now 1001 [22:10:58] mobrovac: what now? [22:11:07] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2104921 (10Ppena) Thanks guys! So yeah, we can remove board@ from the Shop@ email, and maybe add Pats Pena and Gretchen Holtman to the Shop@ alias? Thanks Pats [22:11:18] bearND: i've jsut seen a worker on scb1002 die [22:11:22] RECOVERY - HHVM rendering on mw1154 is OK: HTTP OK: HTTP/1.1 200 OK - 67858 bytes in 0.370 second response time [22:12:25] mobrovac: ok, i see it now. Any ideas why that happened? [22:12:36] trying to figure it out [22:12:59] Could it be node dependencies? [22:14:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:14:15] bearND: it seems it committed suicide due to memory exhaustion [22:14:42] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [22:15:15] mobrovac: yep, makes sense. I see the "Heap memory limit temporarily exceeded" [22:15:26] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2104923 (10CCogdill_WMF) I can confirm the site is ready for SNI for our next event. Thanks for your help! [22:15:33] 6Operations, 10Traffic, 7HTTPS: SSL cert needed for benefactorevents.wikimedia.org - https://phabricator.wikimedia.org/T115028#2104924 (10CCogdill_WMF) 5Open>3Resolved [22:15:43] bearND: mdholloway: i would propose to revert https://gerrit.wikimedia.org/r/#/c/274638/ in the source repo and redeploy [22:15:50] mobrovac: +1 [22:15:55] mobrovac: ok, will do [22:16:45] bearND: mdholloway: the redlinks call to mw api is issued for 99% of the reqs, so it's possible that that is the things that tips over the memory [22:17:04] mobrovac: yeah, i was thinking about that. [22:19:22] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:19:34] i'm having to resolve some conflicts in the patch, just a minute [22:20:46] (03PS1) 10Dzahn: Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 [22:21:04] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:21:26] (03CR) 10Dzahn: "i'd like to re-add just one of the 2 changes, but they have both been reverted in a single change" [puppet] - 10https://gerrit.wikimedia.org/r/276355 (owner: 10Dzahn) [22:21:28] (03PS9) 10Ottomata: Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) [22:22:08] 6Operations, 10Mail, 10Wikipedia-Store: why is shop@ -> board@ ? - https://phabricator.wikimedia.org/T127503#2104964 (10bbogaert) Hi All, I have created the group in LDAP and Google Groups. It can be taken off exim. Members: - Pats Pena - Gretchen Holtman Thanks, Byron [22:22:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:23:08] am I here? [22:23:17] you are [22:23:32] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [22:23:41] mobrovac: i wonder if moving the redlinks api call to the initial promise for the request (rather than having it part of the subsequent PageContentPromise, as it is in the patch) would help. [22:23:43] (03CR) 10Yuvipanda: [C: 031] Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 (owner: 10Dzahn) [22:23:43] nope [22:23:48] heh [22:23:56] mutante: do you have any idea why it didn't work? [22:24:00] (03CR) 10jenkins-bot: [V: 04-1] Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [22:24:02] (03PS10) 10Ottomata: Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) [22:24:04] cc bearND ^ [22:24:05] mutante: am around for anothe rhour or so [22:24:08] yuvipanda: no [22:24:28] i would have liked to revert one change at a time [22:24:33] (03PS11) 10Ottomata: Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) [22:24:35] i'm commenting a part of it first [22:24:43] mutante: ah sure [22:25:03] it seems oddly familiar though [22:25:14] (03PS2) 10Dzahn: Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 [22:25:54] (03PS3) 10Dzahn: Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 [22:25:56] as if we had the issue before [22:25:58] (03CR) 10jenkins-bot: [V: 04-1] Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) (owner: 10Ottomata) [22:26:39] (03CR) 10Dzahn: [C: 032] "ok, so adding the contact group and the host itself for now.. that should be just fine for sure.. then the second step" [puppet] - 10https://gerrit.wikimedia.org/r/276355 (owner: 10Dzahn) [22:27:20] (03CR) 10jenkins-bot: [V: 04-1] Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 (owner: 10Dzahn) [22:27:37] "You are no longer signed in to Gerrit Code Review." [22:27:41] really... [22:27:53] i logged in 5 minutes ago and am in the middle of using it [22:28:03] jouncebot: next [22:28:04] In 1 hour(s) and 31 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T0000) [22:29:01] mutante: what you think you are doing is irrelevant - the system knows better :) [22:31:20] (03PS4) 10Dzahn: Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 [22:31:53] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [22:32:42] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Puppet has 1 failures [22:32:58] alsafi is also me .. groan [22:33:39] (03PS12) 10Ottomata: Add mysql-backupex script in mysql_wmf module to do regular incremental backups [puppet] - 10https://gerrit.wikimedia.org/r/273312 (https://phabricator.wikimedia.org/T127991) [22:35:02] (03PS5) 10Dzahn: Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 [22:35:04] note: I'm going to be testing codfw->eqiad direct applayer access in a few. There's no expected impact, but I'm just warning a little ahead in case... [22:35:17] (03CR) 10Dzahn: [C: 032] Revert "Revert "tools: Add paws as a separate host"" [puppet] - 10https://gerrit.wikimedia.org/r/276355 (owner: 10Dzahn) [22:35:23] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:36:10] (03PS6) 10Dereckson: Rename NS_PROJECT_TALK at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [22:36:12] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [22:36:25] !log starting mobileapps deploy, second try (without dead link removal patch) [22:36:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:40] (03PS15) 10BBlack: cache_app_route(): parser func for cache->app routing [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [22:37:03] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [22:37:30] (03CR) 10BBlack: [C: 032 V: 032] cache_app_route(): parser func for cache->app routing [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [22:37:37] (03CR) 10Dereckson: "PS6: rebase were needed after T123654 merge" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [22:38:58] (03PS16) 10BBlack: cache_app_route(): parser func for cache->app routing [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) [22:39:12] (03CR) 10BBlack: [V: 032] cache_app_route(): parser func for cache->app routing [puppet] - 10https://gerrit.wikimedia.org/r/275497 (https://phabricator.wikimedia.org/T127484) (owner: 10BBlack) [22:39:38] mobrovac: mdholloway : restarted services on scb1001, going now to scb1002 [22:42:35] !log mobileapps deployed 26d4031 [22:42:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:44:42] (03PS7) 10Dereckson: Namespace configuration for bs.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [22:46:35] bearND: mdholloway: the load is considerably lower now on both boxes [22:46:39] (03CR) 10Dereckson: [C: 031] "PS7: added legacy namespaces aliases, more context in commit message" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [22:46:51] mobrovac: excellent :) [22:46:56] bearND: mdholloway: so i'm inclined to say we found the culprit [22:47:05] i agree [22:47:10] yep [22:47:16] 6Operations, 10ops-codfw, 10procurement: codfw: 500GB SATA disk for bast2001 - https://phabricator.wikimedia.org/T129405#2105018 (10RobH) [22:47:18] 6Operations, 10ops-codfw: Check bast2001 for hardware problems - https://phabricator.wikimedia.org/T129316#2105016 (10RobH) [22:47:27] (03PS1) 10Dzahn: ganglia: script to start multiple aggregators [puppet] - 10https://gerrit.wikimedia.org/r/276369 [22:53:59] bearND: mobrovac: could we do more load testing on appservice.wmflabs.org before production deployments to prevent stuff like this in the future? [22:54:25] (03PS2) 10BBlack: cache_text: codfw->direct routing [puppet] - 10https://gerrit.wikimedia.org/r/276259 (https://phabricator.wikimedia.org/T125510) [22:56:33] 6Operations, 10Ops-Access-Requests, 15User-greg: Requesting access to production for SWAT deploy for dereckson - https://phabricator.wikimedia.org/T129365#2105055 (10RobH) p:5Triage>3Normal a:3greg >>! In T129365#2103542, @greg wrote: > Thanks @Dereckson. I'm going to assess where we are with the SWAT... [22:58:06] mdholloway: I like that idea [22:58:36] (03PS1) 10Dzahn: paws: move monitoring to icinga module [puppet] - 10https://gerrit.wikimedia.org/r/276372 [22:58:50] (03PS2) 10Dzahn: paws: move monitoring to icinga module [puppet] - 10https://gerrit.wikimedia.org/r/276372 (https://phabricator.wikimedia.org/T129209) [22:59:18] bearND: I'll phab it [22:59:30] (03CR) 10Dzahn: [C: 032] paws: move monitoring to icinga module [puppet] - 10https://gerrit.wikimedia.org/r/276372 (https://phabricator.wikimedia.org/T129209) (owner: 10Dzahn) [22:59:35] mdholloway: thank you [23:01:54] 6Operations, 10RESTBase-Cassandra, 6Services, 13Patch-For-Review: Evaluate Brotli compression for Cassandra - https://phabricator.wikimedia.org/T125906#2105080 (10Eevans) >>! In T125906#2103414, @Eevans wrote: > Update: [[https://issues.apache.org/jira/browse/CASSANDRA-8464|Changes]] to Cassandra's [[https... [23:02:24] yuvipanda: i think it's about where you use @monitoring::host, it works when inside the icinga module, but apparently not when used outside / in the toollabs module. i remember this issue from last time, adding this for ORES. i move this to where ORES monitoring is as well [23:02:34] (03PS3) 10BBlack: cache_text: codfw->direct routing [puppet] - 10https://gerrit.wikimedia.org/r/276259 (https://phabricator.wikimedia.org/T125510) [23:02:44] yuvipanda: will confirm that theory in a minute [23:02:50] mutante: but it worked for tools.wmflabs.org [23:02:52] (03CR) 10BBlack: [C: 032 V: 032] cache_text: codfw->direct routing [puppet] - 10https://gerrit.wikimedia.org/r/276259 (https://phabricator.wikimedia.org/T125510) (owner: 10BBlack) [23:02:52] ok [23:03:08] maybe puppet things got changed since tools.wmflabs has been added [23:03:21] and nothing deletes things from icinga config via puppet [23:03:21] !log caches: starting test codfw->direct (codfw caches -> eqiad apps) [23:03:28] so it never had to work again since that change [23:03:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:03:50] PROBLEM - HHVM rendering on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:04:03] morebots: ah, fun [23:04:03] I am a logbot running on tools-exec-1215. [23:04:03] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [23:04:03] To log a message, type !log . [23:04:08] bah [23:04:10] mutante: ah fun [23:04:20] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105088 (10brion) Quick note from IRC regarding the thumb-URL needs for mobile apps/web: The primary use cas... [23:04:23] I don't think that rendering alert is me. does someone else know it? [23:04:32] it hit before I pressed enter on my change, basically [23:05:03] bblack: yeah, that's me [23:05:06] I'll ack / fix [23:05:19] ok thanks, just being paranoid :) [23:05:30] RECOVERY - HHVM rendering on mw1155 is OK: HTTP OK: HTTP/1.1 200 OK - 67850 bytes in 0.554 second response time [23:06:26] (03PS2) 10BBlack: appservers_debug: split routing [puppet] - 10https://gerrit.wikimedia.org/r/276260 (https://phabricator.wikimedia.org/T125510) [23:06:37] (03CR) 10BBlack: [C: 032 V: 032] appservers_debug: split routing [puppet] - 10https://gerrit.wikimedia.org/r/276260 (https://phabricator.wikimedia.org/T125510) (owner: 10BBlack) [23:06:49] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0] [23:08:21] 6Operations, 6Services, 3Mobile-Content-Service: Investigate server flapping after 3/7/2016 deploy - https://phabricator.wikimedia.org/T129237#2105128 (10Mholloway) Resolved by reverting https://gerrit.wikimedia.org/r/#/c/274638/. [23:08:29] 6Operations, 6Services, 3Mobile-Content-Service: Investigate server flapping after 3/7/2016 deploy - https://phabricator.wikimedia.org/T129237#2105130 (10Mholloway) 5Open>3Resolved [23:09:39] yuvipanda: [23:09:41] define host { [23:09:41] + address paws.wmflabs.org [23:09:51] that's strange [23:09:59] so yea, that needs to be in a module [23:10:02] on why tools.wmflabs.org works [23:10:03] that is actually applied on neon [23:10:07] bleh, split is still broken [23:10:12] since that was only applied on labcontrol [23:10:16] Error: Could not retrieve catalog from remote server: Error 400 on SERVER: unexpected return at /etc/puppet/modules/role/manifests/cache/text.pp:76 on node cp2001.codfw.wmnet [23:10:22] thanks puppet, that's very informative [23:10:38] PROBLEM - puppet last run on cp1068 is CRITICAL: CRITICAL: puppet fail [23:10:39] bearND: created T129419 [23:10:39] yuvipanda: it's in icinga: [23:10:40] T129419: Perform load testing before production deployments - https://phabricator.wikimedia.org/T129419 [23:10:47] yuvipanda: modules/icinga/manifests/monitor/certs.pp: @monitoring::host { 'tools.wmflabs.org': [23:10:47] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: puppet fail [23:10:55] oh well, back to testing [23:10:57] mutante: aaahhhhhhhhh [23:10:58] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: puppet fail [23:10:59] PROBLEM - puppet last run on cp1067 is CRITICAL: CRITICAL: puppet fail [23:10:59] mutante: that explains [23:11:01] yea [23:11:02] (03PS1) 10BBlack: Revert "appservers_debug: split routing" [puppet] - 10https://gerrit.wikimedia.org/r/276373 [23:11:07] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: puppet fail [23:11:08] mutante: so the @monitoring::host in the toollabs module is a noop :D [23:11:09] PROBLEM - puppet last run on cp1054 is CRITICAL: CRITICAL: puppet fail [23:11:10] ok [23:11:17] bblack: wait [23:11:17] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: puppet fail [23:11:18] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [23:11:21] yes, it existed in 2 locations and that one never did anything [23:11:28] if you introduced a new function, sometimes the puppetmaster server has to be restarted before it picks it up [23:11:31] mutante: fun, yay puppet [23:11:34] (03CR) 10BBlack: [C: 032 V: 032] "Failed compilation in practice, even though compiler worked earlier :P" [puppet] - 10https://gerrit.wikimedia.org/r/276373 (owner: 10BBlack) [23:11:35] i'm enabling the actual monitoring now [23:11:50] yuvipanda: i guess it kind of makes sense, because the resource has to be created on neon [23:11:57] but yea [23:11:58] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105138 (10Tgr) IMO the two main questions here are: # is this going to be supported by MediaWiki or just by... [23:12:16] ori: hmmm I've seen that before, but I'm not sure if that applies here [23:12:28] ori: I can try JIC, I haven't puppet-merged yet [23:12:43] you can also check the apache error log on puppet master, that should have some detail [23:13:59] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105146 (10Tgr) Also, a 100% compatible VCL-layer mapping of nice URLs to old URL is just not gonna happen. F... [23:14:06] (03PS1) 10Dzahn: paws: enable http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/276374 (https://phabricator.wikimedia.org/T129209) [23:14:19] (03PS2) 10ArielGlenn: bump version to 0.5.7-1 [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/276191 [23:14:38] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: puppet fail [23:14:45] (03PS2) 10Dzahn: paws: enable http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/276374 (https://phabricator.wikimedia.org/T129209) [23:14:50] PROBLEM - puppet last run on cp1055 is CRITICAL: CRITICAL: puppet fail [23:14:50] PROBLEM - puppet last run on cp1066 is CRITICAL: CRITICAL: puppet fail [23:15:18] PROBLEM - puppet last run on cp2007 is CRITICAL: CRITICAL: puppet fail [23:15:35] !log restarted puppetmaster on palladium [23:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:11] (03CR) 10Dzahn: [C: 032] paws: enable http monitoring [puppet] - 10https://gerrit.wikimedia.org/r/276374 (https://phabricator.wikimedia.org/T129209) (owner: 10Dzahn) [23:17:27] PROBLEM - puppet last run on cp2023 is CRITICAL: CRITICAL: puppet fail [23:17:28] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail [23:17:33] 6Operations, 10Salt, 10Trebuchet, 13Patch-For-Review: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#2105155 (10ArielGlenn) I have built packages that contain a few other changes as well, see https://gerrit.wikimedia.org/r/#/c/276191/ This patch may or may not ge... [23:17:53] bblack: i'll let you merge that revert when you feel it's right. my change can wait or just be merged anytime [23:18:07] PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: puppet fail [23:18:07] mutante: ok thanks [23:18:20] restart didn't help, merging [23:18:39] PROBLEM - DPKG on nobelium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [23:18:46] ok [23:18:59] PROBLEM - puppet last run on mw2184 is CRITICAL: CRITICAL: puppet fail [23:19:07] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 13 failures [23:19:28] PROBLEM - puppet last run on ms-be2004 is CRITICAL: CRITICAL: Puppet has 4 failures [23:19:38] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: puppet fail [23:19:38] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: puppet fail [23:19:39] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [23:19:47] PROBLEM - puppet last run on sinistra is CRITICAL: CRITICAL: Puppet has 6 failures [23:19:48] PROBLEM - puppet last run on cp2016 is CRITICAL: CRITICAL: puppet fail [23:19:48] PROBLEM - puppet last run on mw2082 is CRITICAL: CRITICAL: Puppet has 8 failures [23:19:57] PROBLEM - puppet last run on lvs1004 is CRITICAL: CRITICAL: Puppet has 2 failures [23:19:57] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 2 failures [23:19:58] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 21 failures [23:20:08] PROBLEM - puppet last run on cp2003 is CRITICAL: CRITICAL: puppet fail [23:20:08] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Puppet has 2 failures [23:20:18] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: puppet fail [23:20:19] PROBLEM - puppet last run on mw2083 is CRITICAL: CRITICAL: Puppet has 1 failures [23:20:19] PROBLEM - puppet last run on mw2134 is CRITICAL: CRITICAL: puppet fail [23:20:27] ^ that's probably just the usual belated fallout of a master restart [23:20:27] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 5 failures [23:20:28] PROBLEM - puppet last run on mw2113 is CRITICAL: CRITICAL: puppet fail [23:20:28] PROBLEM - puppet last run on db2065 is CRITICAL: CRITICAL: puppet fail [23:20:29] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 6 failures [23:20:29] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 21 failures [23:20:29] PROBLEM - puppet last run on elastic2004 is CRITICAL: CRITICAL: Puppet has 6 failures [23:20:37] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 30 failures [23:20:37] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [23:20:37] PROBLEM - puppet last run on mw2087 is CRITICAL: CRITICAL: Puppet has 7 failures [23:20:38] PROBLEM - puppet last run on mw2003 is CRITICAL: CRITICAL: Puppet has 7 failures [23:20:38] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: puppet fail [23:20:38] PROBLEM - puppet last run on mw2004 is CRITICAL: CRITICAL: Puppet has 5 failures [23:20:47] PROBLEM - puppet last run on mw2019 is CRITICAL: CRITICAL: Puppet has 7 failures [23:20:48] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: puppet fail [23:20:48] PROBLEM - puppet last run on mw2176 is CRITICAL: CRITICAL: Puppet has 10 failures [23:20:48] PROBLEM - puppet last run on mw2093 is CRITICAL: CRITICAL: Puppet has 4 failures [23:20:58] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 11 failures [23:20:58] PROBLEM - puppet last run on mw2143 is CRITICAL: CRITICAL: Puppet has 5 failures [23:20:59] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 7 failures [23:20:59] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [23:21:08] PROBLEM - puppet last run on mw2067 is CRITICAL: CRITICAL: Puppet has 13 failures [23:21:09] PROBLEM - puppet last run on cp3030 is CRITICAL: CRITICAL: puppet fail [23:21:28] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet has 9 failures [23:21:28] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:21:29] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 3 failures [23:21:32] although it seems way worse than the nightly usual heh [23:22:24] yep [23:22:37] PROBLEM - puppet last run on mw2212 is CRITICAL: CRITICAL: Puppet has 9 failures [23:22:48] RECOVERY - puppet last run on cp2023 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [23:22:57] PROBLEM - puppet last run on cp4017 is CRITICAL: CRITICAL: puppet fail [23:22:57] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:22:57] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [23:23:07] RECOVERY - puppet last run on cp3030 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [23:23:08] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [23:23:27] RECOVERY - puppet last run on cp1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:23:28] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:23:37] RECOVERY - puppet last run on cp1054 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:23:48] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:23:49] RECOVERY - puppet last run on cp1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:24:37] RECOVERY - puppet last run on cp1068 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:24:39] RECOVERY - puppet last run on cp4017 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [23:25:18] RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:25:19] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:25:27] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:25:37] RECOVERY - puppet last run on cp1066 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:26:08] RECOVERY - puppet last run on cp2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:26:18] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:27:08] RECOVERY - puppet last run on cp2016 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:30:07] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch inactive shards 1311 threshold =0.1% breach: status: red, number_of_nodes: 1, unassigned_shards: 1308, number_of_pending_tasks: 3, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 446, cluster_name: labsearch, relocating_shards: 0, active_shards: 446, initializing_shards: 3, number_of_data_nodes: 1, delayed_unassigne [23:30:14] safe to ignore that one [23:33:24] assuming there's no complaints and I don't find some ugly unexpected fallout on my own, I'm planning to revert the cache_text codfw->direct stuff in about another half hour (after about 1 hour of test time) [23:33:36] I figured one hour is enough we can stare at more graphs and things afterwards and still see it [23:35:24] yuvipanda: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=paws.wmflabs.org&nostatusheader [23:36:19] assuming it holds (so far so good), we're looking at success on dual-direct and applayer switches, but still some puppetization problem for 'split' (which is a stretch anyways, not strictly necessary for EOQ) [23:39:42] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105215 (10brion) >>! In T66214#2105138, @Tgr wrote: > IMO the two main questions here are: > # is this going... [23:40:36] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1757, cluster_name: labsearch, relocating_shards: 0, active_shards: 1757, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [23:44:58] RECOVERY - puppet last run on sinistra is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:45:17] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:45:18] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [23:45:26] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:45:47] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [23:45:56] RECOVERY - puppet last run on ms-be2004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [23:45:58] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [23:46:06] RECOVERY - puppet last run on mw2212 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:46:07] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [23:46:17] RECOVERY - puppet last run on mw2093 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [23:46:26] RECOVERY - puppet last run on elastic2004 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:46:27] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [23:46:27] RECOVERY - puppet last run on mw2004 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:46:36] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:46:37] RECOVERY - puppet last run on mw2083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:46:47] RECOVERY - puppet last run on mw2082 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [23:47:07] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [23:47:07] RECOVERY - puppet last run on lvs1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:07] RECOVERY - puppet last run on mw2003 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [23:47:07] RECOVERY - puppet last run on mw2019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:07] RECOVERY - puppet last run on mw2087 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:18] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:18] RECOVERY - puppet last run on db2065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:27] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:28] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:28] RECOVERY - puppet last run on mw2113 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [23:47:29] RECOVERY - puppet last run on cp2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:36] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:37] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:47] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:07] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:07] RECOVERY - puppet last run on mw2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:08] RECOVERY - puppet last run on mw2134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:08] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:10] 6Operations, 10Traffic, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Varnish support for shutting users out of a DC - https://phabricator.wikimedia.org/T129424#2105242 (10BBlack) [23:48:27] RECOVERY - puppet last run on mw2176 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:27] RECOVERY - puppet last run on mw2143 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:28] RECOVERY - puppet last run on mw2184 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:50:03] (03PS1) 10Krinkle: Move /w/static to /static (keeping symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276377 [23:50:07] (03PS1) 10Dereckson: Configure upload rights on ce.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276378 (https://phabricator.wikimedia.org/T129005) [23:51:18] jouncebot: next [23:51:19] In 0 hour(s) and 8 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160310T0000) [23:51:37] (03PS1) 10BBlack: Revert "cache_text: codfw->direct routing" [puppet] - 10https://gerrit.wikimedia.org/r/276379 [23:52:08] (03PS2) 10BBlack: Revert "cache_text: codfw->direct routing" [puppet] - 10https://gerrit.wikimedia.org/r/276379 [23:53:31] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: BlockUse content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105268 (10Jrtorres432) [23:53:51] (03CR) 10BBlack: [C: 032 V: 032] Revert "cache_text: codfw->direct routing" [puppet] - 10https://gerrit.wikimedia.org/r/276379 (owner: 10BBlack) [23:54:05] !log ending caches codfw->direct testing [23:54:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:55:23] 6Operations, 10ArchCom-RfC, 6Commons, 10MediaWiki-File-management, and 13 others: Use content hash based image / thumb URLs & define an official thumb API - https://phabricator.wikimedia.org/T66214#2105283 (10brion) [23:56:35] twentyafterfour: What does the --all option do in updateBranchPointers? [23:56:41] Not sure I get it [23:57:11] (03PS1) 10BBlack: Revert "cache_app_route(): parser func for cache->app routing" [puppet] - 10https://gerrit.wikimedia.org/r/276380 [23:57:19] Oh, I see. It's not adding new values or changing them. Without --all the array just stays empty [23:57:19] (03PS2) 10BBlack: Revert "cache_app_route(): parser func for cache->app routing" [puppet] - 10https://gerrit.wikimedia.org/r/276380 [23:57:26] (03CR) 10BBlack: [C: 032 V: 032] Revert "cache_app_route(): parser func for cache->app routing" [puppet] - 10https://gerrit.wikimedia.org/r/276380 (owner: 10BBlack) [23:57:59] (03PS1) 10Dereckson: Set NS_PROJECT to Vikilüğət on az.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/276382 (https://phabricator.wikimedia.org/T128296) [23:59:46] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: puppet fail