[00:00:05] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151125T0000). Please do the needful. [00:00:05] Krenair jhobs: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:00:15] I'm here [00:00:38] hey [00:02:11] there's an error in this patch set [00:05:17] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [00:06:09] (03PS4) 10Alex Monk: Enable rollbacker and patroller group at maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254443 (https://phabricator.wikimedia.org/T118934) (owner: 10Luke081515) [00:07:36] (03PS5) 10Alex Monk: Enable rollbacker and patroller group at maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254443 (https://phabricator.wikimedia.org/T118934) (owner: 10Luke081515) [00:07:42] (03CR) 10Alex Monk: [C: 032] Enable rollbacker and patroller group at maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254443 (https://phabricator.wikimedia.org/T118934) (owner: 10Luke081515) [00:08:07] (03Merged) 10jenkins-bot: Enable rollbacker and patroller group at maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254443 (https://phabricator.wikimedia.org/T118934) (owner: 10Luke081515) [00:09:14] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/254443/ (duration: 00m 28s) [00:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:52] (03PS2) 10Alex Monk: Increase survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255276 (owner: 10Jhobs) [00:10:56] jhobs, hi [00:11:03] howdy [00:11:33] (03CR) 10Alex Monk: [C: 032] Increase survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255276 (owner: 10Jhobs) [00:12:03] (03Merged) 10jenkins-bot: Increase survey coverage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255276 (owner: 10Jhobs) [00:12:09] Krenair: there's an error in my patch set? or were you referring to something else? [00:12:30] no [00:12:32] the other patch set [00:12:43] there were two on the list, only one of which was from you [00:12:46] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [00:12:52] ok [00:14:20] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/255276/ (duration: 00m 28s) [00:14:22] jhobs, ^ [00:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:33] Krenair: looks good, thanks! [00:16:02] is it increasing the response to the expected levels? [00:17:46] I actually can't check that, I'll get someone else to [00:18:19] but we appeared to be off by a factor of 100 earlier so this is the hotfix until we can figure out why [00:18:26] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [00:19:04] Hm.. &forceprofile=1 don't work no more? [00:19:08] https://test.wikipedia.org/w/load.php?debug=false&lang=en&modules=startup&only=scripts&skin=vector&forceprofile=1 [00:19:36] jouncebot: next [00:19:36] In 135 hour(s) and 40 minute(s): Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151130T1600) [00:19:43] ok, good [00:20:10] Krenair: yeah looks like it did the trick [00:20:32] !log rebooting mira [00:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:21:15] Krinkle, it should work, but that's not the password. [00:22:00] Look in PrivateSettings.php [00:23:04] 6operations: apt-get update partial failure lots of places - https://phabricator.wikimedia.org/T119242#1830230 (10Andrew) 5Open>3Resolved a:3Andrew I didn't do anything; it seems to have cleared up on its own. [00:23:44] 6operations, 6Labs: Untangle labs/production roles from labs/instance roles - https://phabricator.wikimedia.org/T119401#1830233 (10Andrew) Sounds good to me. [00:27:15] Krenair: what? https://github.com/wikimedia/operations-mediawiki-config/blob/b198426f581ca72ad608d7b78816c18306032739/wmf-config/StartProfiler.php#L11 [00:27:25] PROBLEM - Keyholder SSH agent on mira is CRITICAL: CRITICAL: Keyholder is not armed. Run keyholder arm to arm it. [00:27:47] Hm.. ok [00:27:49] Thanks Krenair [00:28:56] Krinkle, I'll admit that I've only seen the private settings line, not the actual code that uses this [00:29:06] I don't see any use of it.. [00:29:09] indeed [00:29:34] it also doesn't work [00:29:48] I think aude reported a similar issue recenty [00:29:51] ori: ^^ [00:30:13] mutante: you know how to arm it, right? [00:30:41] ori: I was meaning the profiler issues [00:30:43] it's in https://wikitech.wikimedia.org/wiki/Keyholder [00:30:44] oh [00:30:52] urgent? if not, in the middle of something [00:31:03] nah, it's not I don't think [00:31:11] k, will take a look in a bit [00:31:48] (03PS1) 10Ori.livneh: Migrate rdb1007 to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/255298 [00:32:10] (03PS2) 10Ori.livneh: Migrate rdb1007 to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/255298 [00:33:33] (03CR) 10Ori.livneh: [C: 032] Migrate rdb1007 to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/255298 (owner: 10Ori.livneh) [00:36:37] (03PS1) 10Ori.livneh: redisrdb: reintroduce slave-write-only => false, dropped from I7a111e4 [puppet] - 10https://gerrit.wikimedia.org/r/255299 [00:36:55] (03PS2) 10Ori.livneh: redisrdb: reintroduce slave-write-only => false, dropped from I7a111e4 [puppet] - 10https://gerrit.wikimedia.org/r/255299 [00:37:03] (03CR) 10Ori.livneh: [C: 032 V: 032] redisrdb: reintroduce slave-write-only => false, dropped from I7a111e4 [puppet] - 10https://gerrit.wikimedia.org/r/255299 (owner: 10Ori.livneh) [00:37:49] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1830270 (10Dzahn) I fixed it on mira by editing /etc/passwd and then running @mira:~# **find / -uid 12162 -exec chown 10002:10002 {} \;** , then **run it a second time with chown -h*... [00:37:57] !log Migrating rdb1007 from redis::legacy to redis::instance; will involve a service restart. [00:38:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:39:15] PROBLEM - puppet last run on rdb1007 is CRITICAL: CRITICAL: Puppet has 1 failures [00:39:27] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: puppet fail [00:40:16] !log tin: fixing l10nupdate UID (997->10002), file ownership [00:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:41:04] seemed like perfect time when there are no deploys for a while [00:43:05] RECOVERY - puppet last run on rdb1007 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [00:43:06] PROBLEM - Redis on rdb1007 is CRITICAL: Connection refused [00:45:41] root@rdb1007:~# service redis status [00:45:41] ● redis.service Loaded: not-found (Reason: No such file or directory) Active: inactive (dead) [00:46:31] !log starting redis on rdb1007 (it died and was status: dead) [00:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:43] mutante: i'm on it [00:46:44] see !log above [00:46:51] (03PS1) 10Ori.livneh: Fix filenames from I7a111e420 [puppet] - 10https://gerrit.wikimedia.org/r/255306 [00:47:01] ori: ooh, i missed that.ok! [00:47:03] (03PS2) 10Ori.livneh: Fix filenames from I7a111e420 [puppet] - 10https://gerrit.wikimedia.org/r/255306 [00:47:18] (03CR) 10jenkins-bot: [V: 04-1] Fix filenames from I7a111e420 [puppet] - 10https://gerrit.wikimedia.org/r/255306 (owner: 10Ori.livneh) [00:48:43] (03CR) 10Ori.livneh: [C: 032] Fix filenames from I7a111e420 [puppet] - 10https://gerrit.wikimedia.org/r/255306 (owner: 10Ori.livneh) [00:49:26] RECOVERY - HTTP 5xx reqs/min anomaly on graphite1001 is OK: OK: No anomaly detected [01:00:37] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1830298 (10Dzahn) running the fix on tin .. in a screen because it was still ongoing ... [01:01:07] 6operations, 7Performance: forceprofile=1 with X-Wikimedia-Debug: 1 header does not work on non-wikipedias - https://phabricator.wikimedia.org/T118990#1830300 (10Krinkle) It also doesn't work on en.wikipedia.org now (e.g. ). Which makes sense as it follows... [01:01:35] mutante: many thanks for your work on that l10nupdate uid task [01:01:38] 6operations, 7Performance, 7Regression: [Regression] forceprofile=1 with X-Wikimedia-Debug: 1 no longer works - https://phabricator.wikimedia.org/T118990#1830301 (10Krinkle) p:5Normal>3High [01:01:38] PROBLEM - WDQS SPARQL on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.013 second response time [01:01:39] PROBLEM - WDQS HTTP on wdqs1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 393 bytes in 0.008 second response time [01:03:21] !log re-armed keyholders on mira [01:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:33] bd808: :) it seemed a good time actually with the deployment break [01:03:41] i'll check on it later, running in screen [01:04:58] RECOVERY - Keyholder SSH agent on mira is OK: OK: Keyholder is armed with all configured keys. [01:06:38] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [01:08:01] i'm not sure it makes sense to require manual arming of keyholder [01:08:11] we have puppet provision all manner of passwords [01:08:53] i guess if tin was compromised, you'd want to protect the apaches [01:10:39] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [01:18:19] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [01:18:38] (03PS1) 10Ori.livneh: Add two additional redis instances on rdb1007 and 1008 [puppet] - 10https://gerrit.wikimedia.org/r/255308 [01:21:08] (03CR) 10Ori.livneh: [C: 032] Add two additional redis instances on rdb1007 and 1008 [puppet] - 10https://gerrit.wikimedia.org/r/255308 (owner: 10Ori.livneh) [01:24:49] (03PS1) 10Ori.livneh: redisdb: open ports 6380 and 6381 for add'l instances added in Ieb8e2fbc [puppet] - 10https://gerrit.wikimedia.org/r/255313 [01:25:54] mutante: still there? [01:27:00] (03CR) 10Ori.livneh: [C: 032] redisdb: open ports 6380 and 6381 for add'l instances added in Ieb8e2fbc [puppet] - 10https://gerrit.wikimedia.org/r/255313 (owner: 10Ori.livneh) [01:30:45] > 01:30:17 /srv/mediawiki-staging/php-1.27.0-wmf.7/StartProfiler.php: did you mean to sync a symbolic link? [01:30:58] Not sure what scap is asking here [01:31:04] usually "did you mean" is something I'm not doing [01:31:23] bd808, ^ [01:31:27] !log krinkle@tin Synchronized php-1.27.0-wmf.7/StartProfiler.php: create symlink (duration: 01m 10s) [01:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:38] (03CR) 10Ori.livneh: scap: Fix mwgrep pep8 warnings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255279 (owner: 10Yuvipanda) [01:34:41] Krinkle, Krenair: I'm trying to remember what that warning is about .... [01:35:23] oh, I know. Sync file syncs the symlink as a symlink and not the target file contents and that caused a problem once so we added the warning [01:35:47] I think the place where it caused a problem was with PrivateSettings [01:36:27] as in somebody synced the symlink when they really meant to sync the backing file [01:38:04] The warning message could be better [01:40:47] speaking of messages [01:40:59] (03PS1) 10Ori.livneh: redisdb: use correct ferm::service multi-port instance [puppet] - 10https://gerrit.wikimedia.org/r/255316 [01:41:58] oh, never mind [01:42:12] i was going to say 'No syntax errors detected in InitialiseSettings.php' should not be printed [01:42:17] but it's not scap, it's php itself [01:42:24] and we do want the output to be printed in case of an error [01:42:25] so meh [01:42:49] yeah, we only do that on sync-file I think [01:42:49] (03CR) 10Ori.livneh: [C: 032] redisdb: use correct ferm::service multi-port instance [puppet] - 10https://gerrit.wikimedia.org/r/255316 (owner: 10Ori.livneh) [01:44:17] Maybe we could replace php -l with a python php linter :) [01:44:23] [01:46:14] ostriches: https://phabricator.wikimedia.org/D61 [01:46:30] where does ferm actually run? [01:46:53] in a gulley [01:46:57] oh wait, that's ferns [01:46:59] lots of hosts I think. look for base::firewall in site.pp [01:47:28] I know it is on for all of the elasticsearch and logstash boxes [01:48:56] * p858snake gives ostriches a gold star [01:49:00] rdb1007 and rdb1008 don't even have /etc/ferm [01:56:16] (03PS1) 10Ori.livneh: redisdb: bind 0.0.0.0, since the default is loopback only [puppet] - 10https://gerrit.wikimedia.org/r/255317 [01:56:31] (03CR) 10Ori.livneh: [C: 032 V: 032] redisdb: bind 0.0.0.0, since the default is loopback only [puppet] - 10https://gerrit.wikimedia.org/r/255317 (owner: 10Ori.livneh) [02:00:27] !log l10nupdate@tin LocalisationUpdate failed: git pull of core failed [02:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:09:58] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [02:14:29] not sure why l10nupdate failed [02:14:40] I tried updating the l10nupdate copy of core and it was fine... [02:14:52] running again [02:15:09] it ran into p5ms [02:15:42] Oh: [02:15:53] Starting l10nupdate at Wed Nov 25 02:00:01 UTC 2015. [02:15:53] Updating git clone ... [02:15:53] fatal: You don't exist. Go away! [02:16:00] Helpful. [02:17:28] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [02:37:54] 6operations, 7Performance, 7Regression: [Regression] forceprofile=1 with X-Wikimedia-Debug: 1 no longer works - https://phabricator.wikimedia.org/T118990#1830499 (10Krinkle) 5Open>3Resolved a:3Krinkle Turns out `StartProfiler.php` was missing from 1.27.0-wmf.7. Fixed now. [02:40:03] !log krenair@tin Synchronized php-1.27.0-wmf.7/cache/l10n: l10nupdate for 1.27.0-wmf.7 (duration: 06m 50s) [02:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:34] It actually prints "Failed to sync-dir 'php-1.27.0-wmf.7/cache/l10n'" [02:40:39] Because of the issues with mira [02:41:28] and then doesn't complete the rest of the process [02:41:29] bd808, ^ [02:43:41] https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/scap/files/l10nupdate-1;HEAD$111 [02:44:55] So it's been failing silently since 2015-11-06 [02:45:04] (according to SAL) [02:50:42] (03PS1) 10Alex Monk: l10nupdate: Log sync-dir failure to SAL [puppet] - 10https://gerrit.wikimedia.org/r/255321 [03:00:00] Krenair: m.utante is working on getting the uids in sync. Hopeful when that is done things will be working correctly again [03:02:47] (03PS3) 10Krinkle: Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 (owner: 10Aaron Schulz) [03:03:29] (03CR) 10Krinkle: [C: 032] Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 (owner: 10Aaron Schulz) [03:03:51] (03Merged) 10jenkins-bot: Remove obsolete "claimTTL" settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255077 (owner: 10Aaron Schulz) [03:05:14] ebernhardson: 'portals' is an untracked directory on tin. Is that created by puppet? If not, should it be a submodule, or otherwise ignored so that deployers don't get an error when dealing with git on the command line? [03:12:39] RECOVERY - salt-minion processes on cp3022 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [03:12:55] !log krinkle@tin Synchronized wmf-config/CommonSettings.php: I4e21eda0f3 (duration: 01m 06s) [03:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:26:29] PROBLEM - puppet last run on labstore2001 is CRITICAL: CRITICAL: puppet fail [03:29:24] Krinkle: it should be a submodule of mediawiki-config [03:29:27] checking [03:30:06] Krinkle: yes, it reports as a submodule [03:30:15] looks to have some local hacks though :S [03:30:18] Hm.. but git also reports it as untracked [03:30:22] (03PS1) 10Tim Starling: Add YubiKey for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/255323 [03:30:24] (not dirty, but untracked) [03:30:40] i see it as modified: portals (modified content, untracked content) [03:30:46] Ah both [03:30:54] there is both an untracked file and local modifications [03:31:07] i remember something about maxsem deploying a security fix to it [03:31:10] looks like an earlier version of https://gerrit.wikimedia.org/r/255159 [03:31:15] but i don't know how he did that [03:31:18] which btw, includes a syntax error in the avascript [03:31:19] (03CR) 10Tim Starling: [C: 032] Add YubiKey for tstarling [puppet] - 10https://gerrit.wikimedia.org/r/255323 (owner: 10Tim Starling) [03:31:21] :S [03:31:24] innerHTML() undefined function [03:31:51] oh come on [03:31:55] it's actually deployed on prod right now [03:31:58] * Krinkle undoes local hack [03:32:41] Krinkle: there was some sort of XSS that maxsem was trying to fix (that was pre-existing from the meta import). [03:33:04] * ebernhardson wonders if it's documented anywhere...but doesn't see anything [03:33:09] i'll harass max tomorrow when he's around [03:33:15] Well, yeah, it's https://gerrit.wikimedia.org/r/255159 [03:33:25] But it's just wrong to do that with dirty files on tin [03:33:33] yea [03:33:40] even if someone wants to bypass Gerrit (which they didn't in this case) one should at least commit it locally on tin [03:33:48] yea [03:35:51] !log krinkle@tin Synchronized portals/prod/wikipedia.org/index.html: Fix Uncaught TypeError: innerHTML() is not a function (duration: 00m 28s) [03:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:44:48] !log on palladium ran puppet post-merge hook manually after puppet-merge failed with "error: Ref refs/remotes/origin/production is at 380706c5bd2f47d140ffa183dd22fe920927af7e but expected 897486120d645f1c30dcf68bfb3198e2f39ec639" [03:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:45:30] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [03:47:28] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [03:48:08] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: puppet fail [03:53:30] RECOVERY - puppet last run on labstore2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:15:08] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [04:18:54] (03PS13) 10Krinkle: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [04:19:29] (03PS1) 10BBlack: add global misc-web geo IPs [dns] - 10https://gerrit.wikimedia.org/r/255330 [04:19:31] (03PS1) 10BBlack: switch all misc-web to geographic routing [dns] - 10https://gerrit.wikimedia.org/r/255331 [04:19:54] (03PS1) 10BBlack: Remove unused r::c::1layer [puppet] - 10https://gerrit.wikimedia.org/r/255332 [04:24:57] (03PS14) 10Krinkle: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [04:25:04] (03CR) 10Krinkle: "Upgrade from npm v2.7.6 to v2.14.12. Security fixes and various bug fixes around cache handling and race conditions." [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [04:51:58] (03CR) 10Krinkle: contint: install npm/grunt-cli with npm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [04:57:25] (03PS15) 10Krinkle: contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [05:10:30] (03CR) 10Krinkle: "I've nuked /usr/local/bin/npm and /usr/local/lib/node_modules/npm via salt from all slaves, restored the original /usr/bin/npm symlink and" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [05:10:35] (03CR) 10Krinkle: [C: 031] contint: install npm/grunt-cli with npm [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [05:12:21] (03PS1) 10Aaron Schulz: Update job queue config to use 3 rdb1007 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255333 [05:13:53] (03CR) 10Krinkle: "Deployed on integration-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [05:22:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I think we should start out trying 2 instances on one machine, for the following reasons:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255333 (owner: 10Aaron Schulz) [05:22:49] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [05:24:08] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 62 failures [05:25:33] <_joe_> this ^^ is a real problem, btw [05:30:49] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [05:38:29] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 3 below the confidence bounds [05:39:09] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [05:40:09] (03CR) 10Giuseppe Lavagetto: "I would've said installing 3 instances at once is not optimal when coming from one, if I had the chance to review this." [puppet] - 10https://gerrit.wikimedia.org/r/255308 (owner: 10Ori.livneh) [05:48:48] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [05:49:58] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:51:09] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:28:39] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 18 data above and 9 below the confidence bounds [06:30:59] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:19] PROBLEM - puppet last run on mw2016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:59] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:29] (03PS2) 10Giuseppe Lavagetto: etcd: remove package etcdctl [puppet] - 10https://gerrit.wikimedia.org/r/255088 (https://phabricator.wikimedia.org/T118830) [06:33:31] (03PS2) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [06:35:55] (03CR) 10Nikerabbit: [C: 031] "Yes please." [puppet] - 10https://gerrit.wikimedia.org/r/255321 (owner: 10Alex Monk) [06:42:49] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [06:45:09] PROBLEM - check_puppetrun on mintaka is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:42] (03PS1) 10Giuseppe Lavagetto: graphite::alerts: remove 5xx anomaly measurement [puppet] - 10https://gerrit.wikimedia.org/r/255337 [06:46:50] PROBLEM - puppet last run on db2070 is CRITICAL: CRITICAL: puppet fail [06:47:37] (03CR) 10Giuseppe Lavagetto: [C: 032] graphite::alerts: remove 5xx anomaly measurement [puppet] - 10https://gerrit.wikimedia.org/r/255337 (owner: 10Giuseppe Lavagetto) [06:47:46] awww [06:48:38] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [06:50:18] RECOVERY - check_puppetrun on mintaka is OK: OK: Puppet is currently enabled, last run 124 seconds ago with 0 failures [06:55:59] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:19] RECOVERY - puppet last run on mw2016 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:49] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:49] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:39] PROBLEM - HTTP 5xx reqs/min anomaly on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 16 data above and 9 below the confidence bounds [07:12:40] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 19.23% of data above the critical threshold [100000000.0] [07:13:59] RECOVERY - puppet last run on db2070 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:22:20] PROBLEM - Apache HTTP on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:28] PROBLEM - HHVM rendering on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:23:18] PROBLEM - DPKG on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:18] PROBLEM - Disk space on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:19] PROBLEM - salt-minion processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:19] PROBLEM - configured eth on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:38] PROBLEM - dhclient process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:39] PROBLEM - HHVM processes on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:23:49] PROBLEM - nutcracker port on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:00] PROBLEM - puppet last run on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:18] PROBLEM - nutcracker process on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:38] PROBLEM - Check size of conntrack table on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:24:39] PROBLEM - SSH on mw1136 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:24:49] PROBLEM - RAID on mw1136 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:32:31] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [07:33:31] RECOVERY - RAID on mw1136 is OK: OK: no RAID installed [07:33:50] RECOVERY - configured eth on mw1136 is OK: OK - interfaces up [07:34:01] RECOVERY - dhclient process on mw1136 is OK: PROCS OK: 0 processes with command name dhclient [07:34:01] RECOVERY - nutcracker process on mw1136 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [07:34:02] RECOVERY - DPKG on mw1136 is OK: All packages OK [07:34:20] RECOVERY - nutcracker port on mw1136 is OK: TCP OK - 0.000 second response time on port 11212 [07:34:22] RECOVERY - Disk space on mw1136 is OK: DISK OK [07:34:30] RECOVERY - salt-minion processes on mw1136 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:34:40] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [07:34:50] RECOVERY - HHVM processes on mw1136 is OK: PROCS OK: 6 processes with command name hhvm [07:34:52] RECOVERY - SSH on mw1136 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [07:34:52] RECOVERY - Check size of conntrack table on mw1136 is OK: OK: nf_conntrack is 0 % full [07:36:02] RECOVERY - Apache HTTP on mw1136 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.273 second response time [07:36:21] RECOVERY - HHVM rendering on mw1136 is OK: HTTP OK: HTTP/1.1 200 OK - 65585 bytes in 2.012 second response time [07:39:10] (03PS2) 10Ori.livneh: Update job queue config to use 2 rdb1007 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255333 (owner: 10Aaron Schulz) [07:39:49] _joe_: ^ amended to use one more instead of two more [07:40:10] <_joe_> ori: k, thanks! [07:40:31] _joe_: mind if i push it? [07:40:42] i'll be around for a while to keep an eye on it [07:40:57] <_joe_> ori: one more doubt before you merge [07:41:10] <_joe_> how do we distribute the keys on the jobqueue? [07:41:25] <_joe_> if we do consistent hashing changing the labels might have an impact [07:41:29] <_joe_> or am I wrong? [07:54:23] _joe_: missed your question, sorry. mediawiki never reads them back [07:54:41] <_joe_> uh? [07:54:57] mediawiki doesn't read values from the jobqueue redises, it just writes to them [07:55:04] so yes, the jobs will shard differently [07:55:10] but that won't effect anything already in the queue [07:55:11] <_joe_> ok, and the readers just consumes blindly? [07:55:15] right [07:55:19] <_joe_> ok that makes sense [07:55:33] (03CR) 10Giuseppe Lavagetto: [C: 031] Update job queue config to use 2 rdb1007 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255333 (owner: 10Aaron Schulz) [07:55:48] <_joe_> I think aaron already explained that to me once, brainfail [07:57:28] (03CR) 10Ori.livneh: [C: 032] Update job queue config to use 2 rdb1007 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255333 (owner: 10Aaron Schulz) [07:57:50] (03Merged) 10jenkins-bot: Update job queue config to use 2 rdb1007 instances [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255333 (owner: 10Aaron Schulz) [08:05:42] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: CRITICAL - Expecting active but unit create-dbusers is failed [08:06:28] !log update puppet compilers fact database [08:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:06:33] (03PS8) 10Alexandros Kosiaris: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [08:08:00] !log ori@tin Synchronized wmf-config/jobqueue-eqiad.php: I9bea66df: Update job queue config to use 2 rdb1007 instances (duration: 00m 28s) [08:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:08:08] akosiaris: thanks! [08:09:53] (03CR) 10Alexandros Kosiaris: [C: 032] "I just realized that the master parameter needs to match the SSL cert issued for it so I 've updated it for both servers. The rest already" [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [08:10:01] (03PS9) 10Alexandros Kosiaris: labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [08:10:07] (03CR) 10Alexandros Kosiaris: [V: 032] labs openldap role [puppet] - 10https://gerrit.wikimedia.org/r/253347 (owner: 10Muehlenhoff) [08:19:11] RECOVERY - Ensure mysql credential creation for tools users is running on labstore1001 is OK: OK - create-dbusers is active [08:20:51] ori: did you also update puppet for the jobchron/jobrunner services? [08:21:10] nope. doh. [08:21:22] doing now. [08:22:19] AaronSchulz: does the config format take host:port? i see just host being specified now, so wanna make sure. [08:22:20] PROBLEM - HHVM rendering on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:11] PROBLEM - Apache HTTP on mw1128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:23:18] host:port is allowed [08:24:52] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 13 failures [08:25:18] (03PS1) 10Ori.livneh: add rdb1007:6380 to jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/255341 [08:25:32] AaronSchulz: ^ [08:27:35] (03CR) 10Aaron Schulz: [C: 031] add rdb1007:6380 to jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/255341 (owner: 10Ori.livneh) [08:27:53] (03CR) 10Ori.livneh: [C: 032] add rdb1007:6380 to jobrunner config [puppet] - 10https://gerrit.wikimedia.org/r/255341 (owner: 10Ori.livneh) [08:37:44] 6operations, 10Flow, 10MediaWiki-Redirects, 3Collaboration-Team-Current, and 2 others: Flow notification links on mobile point to desktop - https://phabricator.wikimedia.org/T107108#1830616 (10QuimGil) In fact, the problem seems to be even simpler: * Clicking the topic link leads to the desktop page. * Ho... [08:38:16] (03PS3) 10Zfilipin: RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) [08:38:50] (03CR) 10Zfilipin: "Patch set 3 fixes merge conflict." [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [08:42:19] (03PS2) 10Zfilipin: RuboCop: fixed Lint/UnusedMethodArgument offense [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) [08:42:46] (03CR) 10Zfilipin: "Patch set 2 fixes merge conflict." [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [08:49:01] 6operations, 10hardware-requests: Site: 2 hardware access request for ORES - https://phabricator.wikimedia.org/T119598#1830629 (10akosiaris) 3NEW [08:49:27] 6operations, 10hardware-requests: Site: 2 hardware access request for ORES - https://phabricator.wikimedia.org/T119598#1830637 (10akosiaris) [08:50:46] 6operations, 10hardware-requests: Site: 2 hardware access request for ORES - https://phabricator.wikimedia.org/T119598#1830629 (10akosiaris) Looking into the server spares I see: WMF3149 and WMF3300, 2 Dell PowerEdge R310, Single Intel Xeon X3450, 8GB Memory (2) 500GB 3.5 SATA that look like they fit the bill... [09:00:34] (03PS1) 10Muehlenhoff: Add DNS aliases for ldap-labs.[eqiad|codfw].wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/255342 [09:34:39] <_joe_> Nemo_bis: ping [09:39:10] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1830707 (10jcrespo) I see two accelerations, on the 27 sep and on the 7 nov. There could be many explanations, from long running transactions being executed there, to the schema changes don... [10:00:21] (03PS1) 10Giuseppe Lavagetto: alerts: move icinga alerts away from reqstats.* [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) [10:00:43] <_joe_> akosiaris, paravoid ^^ I'd like your feedback [10:00:55] <_joe_> and godog too [10:00:57] <_joe_> :) [10:01:51] <_joe_> there are a couple of typos, btw, [10:02:53] (03PS2) 10Giuseppe Lavagetto: alerts: move icinga alerts away from reqstats.* [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) [10:22:28] so I have this change: https://gerrit.wikimedia.org/r/#/c/253665/ [10:22:39] that it is needed [10:22:57] I have tested it and it is backwards compatible [10:23:53] but the alert it generates pages, so any tips on how to roll it without making a mess? [10:29:27] I can think of 2 ways- disabling notification site-wide, or implementing it optionaly by puppet/hiera [10:30:20] how long would notifications need to be off for? [10:30:28] if it;s not too long I'd say go ahead [10:31:22] well, 5 minutes, assuming everithing goes well (only the replication checks) [10:31:52] otherwise, the change will be reverted, so 5 minutes in every case [10:31:54] if it goes poorly would you be able to back off? [10:31:56] ah that [10:32:09] yes, it is a script change only [10:32:16] so why not go with notifications off, if things get tough revert and if you need to go the slow route do that [10:32:23] it does not change icinga itself [10:32:30] right [10:32:50] thanks, apergos, much appreciated [10:32:54] sure [10:33:14] just announce here when it's going to happen [10:33:18] of course [10:33:40] it will not affect other notifications, only the lag ones [10:33:44] right [10:33:51] seems pretty safe [10:34:15] and I can see those outside of icinga [10:34:26] but I fear an avalanche of sms [10:34:38] no need to suffer that [10:35:13] I realize have to do more puppet preparation first [10:41:50] (03PS1) 10Mdann52: Namespace config change on de.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255354 (https://phabricator.wikimedia.org/T119420) [10:42:53] (03Abandoned) 10Mdann52: Namespace config change on de.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255354 (https://phabricator.wikimedia.org/T119420) (owner: 10Mdann52) [10:43:51] (03PS1) 10Jcrespo: Depool db1044 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255355 [10:50:19] (03PS1) 10Jcrespo: Changing configuration of all s3 slaves after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/255357 [10:51:17] (03CR) 10Jcrespo: [C: 04-1] "Do not apply until db1044 is depooled" [puppet] - 10https://gerrit.wikimedia.org/r/255357 (owner: 10Jcrespo) [10:51:53] (03CR) 10Jcrespo: [C: 032] Depool db1044 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255355 (owner: 10Jcrespo) [10:54:33] (03PS1) 10Yuvipanda: mattermost: Set default team and server for matterircd [puppet] - 10https://gerrit.wikimedia.org/r/255359 [10:56:42] (03PS1) 10Mdann52: Namespace config change on de.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255361 (https://phabricator.wikimedia.org/T119420) [10:57:40] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1044 for a brief maintenance (duration: 00m 28s) [10:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:31] (03PS2) 10Mdann52: Namespace config change on de.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255361 (https://phabricator.wikimedia.org/T119420) [10:59:33] (03CR) 10Yuvipanda: [C: 032] mattermost: Set default team and server for matterircd [puppet] - 10https://gerrit.wikimedia.org/r/255359 (owner: 10Yuvipanda) [11:05:32] PROBLEM - HTTP 5xx reqs/min threshold on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [11:12:57] (03CR) 10Filippo Giunchedi: [C: 031] Uninstall apport [puppet] - 10https://gerrit.wikimedia.org/r/253593 (owner: 10Muehlenhoff) [11:13:20] RECOVERY - HTTP 5xx reqs/min threshold on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:16:16] (03PS2) 10Jcrespo: Changing configuration of all s3 slaves after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/255357 [11:18:10] (03CR) 10Jcrespo: [C: 032] Changing configuration of all s3 slaves after upgrade [puppet] - 10https://gerrit.wikimedia.org/r/255357 (owner: 10Jcrespo) [11:19:36] !log applying ferm and p_s to db1044 (depooled) [11:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:57] is p_s understood? [11:21:40] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, minor nits" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [11:22:55] (03CR) 10Florianschmidtwelzow: [C: 04-1] Namespace config change on de.wikivoyage.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255361 (https://phabricator.wikimedia.org/T119420) (owner: 10Mdann52) [11:31:38] <_joe_> godog: thanks! [11:39:04] (03PS1) 10Jcrespo: Repool db1044 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255366 [11:39:06] no problem [11:41:39] s3 is now in 10.0.22, with ferm and with performance_schema enabled, which is an achievement by itself [11:41:57] \o/ \o/ \o/ [11:43:04] expect soon nice latency graphs and real-time query monitoring [11:45:36] (03CR) 10Filippo Giunchedi: [C: 031] RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [11:46:12] mobrovac: ^ good to go for me [11:46:54] which leads me to a question, should I reconvert tendril backend into a graphite, wait for an alternative, or implement a whisper or graphana backend reusing the current TokuDB [11:48:42] jynus: would it be easy to "tee" the data to graphite too during the transition? how much data it is? happy to discuss on a ticket too [11:49:19] lots now, more are coming, which I am asumming we do not have the resources now [11:49:30] so I was assuming to contribute to those resources [11:49:36] 6operations, 10ops-esams, 10netops: Set up cr2-esams - https://phabricator.wikimedia.org/T118256#1831004 (10faidon) [11:50:35] jynus: yeah I guess we'll need to have some guesstimation [11:52:25] estimation is easy, 150 hosts, 100 metrics per host [11:53:06] 400 GB compressed right now [11:53:20] for a 7 day retention [11:54:07] I will create a ticket, not something that has to be done now, and has some complexitis [11:55:20] also, not all of those will go to graphs, some may go to logs (elastic?) or stay in the current database (slow query reports) [11:58:06] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1831007 (10jcrespo) [11:58:59] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1292599 (10jcrespo) [12:03:21] 6operations, 7Database, 5Patch-For-Review: implement performance_schema for mysql monitoring - https://phabricator.wikimedia.org/T99485#1831018 (10jcrespo) Adding @fgiunchedi so he is in the loop, but not having any actionables for him here. performance_schema has been rolled out to all of s3. This is an ac... [12:13:36] 6operations, 7Database, 5Patch-For-Review: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1831039 (10jcrespo) 3NEW a:3jcrespo [12:14:30] godog, created a subtask for it, and we will talk in the future (no hurries) [12:14:50] 6operations, 7Database: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1831050 (10jcrespo) [12:15:11] jynus: sweet, thanks! [12:15:58] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review, 7Ruby: Move RuboCop job from experimental pipeline to the usual pipelines for operations/puppet - https://phabricator.wikimedia.org/T110019#1831056 (10zeljkofilipin) [12:23:21] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add DNS aliases for ldap-labs.[eqiad|codfw].wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/255342 (owner: 10Muehlenhoff) [12:38:20] (03PS2) 10BBlack: Remove unused r::c::1layer [puppet] - 10https://gerrit.wikimedia.org/r/255332 [12:38:29] (03CR) 10BBlack: [C: 032 V: 032] Remove unused r::c::1layer [puppet] - 10https://gerrit.wikimedia.org/r/255332 (owner: 10BBlack) [12:41:15] 6operations, 10Wikimedia-General-or-Unknown, 7user-notice: Content of Special:BrokenRedirects and many others query pages not updated since 2015-09-16 - https://phabricator.wikimedia.org/T113721#1831109 (10IKhitron) Something interesting. Today [[Special:LonelyPages]] wasn't updated on hewiki. The rest 72-ho... [12:50:18] (03PS2) 10BBlack: switch all misc-web to geographic routing [dns] - 10https://gerrit.wikimedia.org/r/255331 [12:50:20] (03PS2) 10BBlack: add global misc-web geo IPs [dns] - 10https://gerrit.wikimedia.org/r/255330 [12:52:58] (03CR) 10Giuseppe Lavagetto: alerts: move icinga alerts away from reqstats.* (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [12:53:17] (03PS3) 10Giuseppe Lavagetto: alerts: move icinga alerts away from reqstats.* [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) [12:53:23] (03CR) 10BBlack: [C: 032] add global misc-web geo IPs [dns] - 10https://gerrit.wikimedia.org/r/255330 (owner: 10BBlack) [13:04:00] (03PS1) 10Muehlenhoff: Assign openldap::labs role to seaborgium/serpens [puppet] - 10https://gerrit.wikimedia.org/r/255373 [13:05:13] "Request from 10.20.0.165 via cp3012 cp3012 ([10.20.0.112]:3128), Varnish XID 1338238360 [13:05:15] Forwarded for: 80.176.129.180, 10.20.0.165, 10.20.0.165 [13:05:16] Error: 503, Service Unavailable at Wed, 25 Nov 2015 13:04:49 GMT " [13:06:46] (03CR) 10Filippo Giunchedi: [C: 031] alerts: move icinga alerts away from reqstats.* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [13:09:16] !log hashar@tin Synchronized php-1.27.0-wmf.7/Rakefile: Added Rakefile https://gerrit.wikimedia.org/r/#/c/254423/ (duration: 00m 28s) [13:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:12:52] PROBLEM - HTTP 5xx reqs/min threshold on graphite1001 is CRITICAL: CRITICAL: 7.69% of data above the critical threshold [500.0] [13:13:04] ShakespeareFan00: which page? [13:15:09] it's probably not reproducible. we just had a short 503 spike that was mostly isolated to esams [13:15:49] (about 10 minutes back at :05, right when the report is above) [13:16:34] I generally assume when it's isolated to 1x cache DC and brief like that, it's some kind of link/traffic hiccups local to that DC or its connections back to the primaries [13:17:49] K [13:19:41] (03PS1) 10BBlack: cache_misc - Add full set of services to conftool data [puppet] - 10https://gerrit.wikimedia.org/r/255375 (https://phabricator.wikimedia.org/T119394) [13:19:43] (03PS1) 10BBlack: cache_misc - define global IPs in LVS config data [puppet] - 10https://gerrit.wikimedia.org/r/255376 (https://phabricator.wikimedia.org/T119394) [13:19:45] (03PS1) 10BBlack: cache_misc - set up LVS services at all DCs [puppet] - 10https://gerrit.wikimedia.org/r/255377 (https://phabricator.wikimedia.org/T119394) [13:19:47] (03PS1) 10BBlack: cache_misc - switch to conftool dynamic directors [puppet] - 10https://gerrit.wikimedia.org/r/255378 (https://phabricator.wikimedia.org/T119394) [13:20:35] (03CR) 10Hashar: [C: 04-1] "Instead of hiding the unused variables, shouldn't we instead complain about them being unused ?" [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:20:43] RECOVERY - HTTP 5xx reqs/min threshold on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:21:15] (03CR) 10BBlack: [C: 032] cache_misc - Add full set of services to conftool data [puppet] - 10https://gerrit.wikimedia.org/r/255375 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:22:45] (03PS2) 10BBlack: cache_misc - switch to conftool dynamic directors [puppet] - 10https://gerrit.wikimedia.org/r/255378 (https://phabricator.wikimedia.org/T119394) [13:22:47] (03PS2) 10BBlack: cache_misc - set up LVS services at all DCs [puppet] - 10https://gerrit.wikimedia.org/r/255377 (https://phabricator.wikimedia.org/T119394) [13:22:49] (03PS2) 10BBlack: cache_misc - define global IPs in LVS config data [puppet] - 10https://gerrit.wikimedia.org/r/255376 (https://phabricator.wikimedia.org/T119394) [13:23:43] (03CR) 10BBlack: [C: 032 V: 032] cache_misc - define global IPs in LVS config data [puppet] - 10https://gerrit.wikimedia.org/r/255376 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:43:19] (03CR) 10Zfilipin: "Good point. The first step could be marking them as not used. I am not familiar with the codebase, so I did not want to do more refactorin" [puppet] - 10https://gerrit.wikimedia.org/r/254838 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:18] (03CR) 10BBlack: [C: 032] cache_misc - set up LVS services at all DCs [puppet] - 10https://gerrit.wikimedia.org/r/255377 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [13:44:27] (03CR) 10Alexandros Kosiaris: [C: 032] RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:34] (03PS4) 10Alexandros Kosiaris: RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:44:47] (03CR) 10Alexandros Kosiaris: [V: 032] RuboCop: fixed Style/AndOr offense [puppet] - 10https://gerrit.wikimedia.org/r/254841 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [13:47:19] (03CR) 10Alexandros Kosiaris: [C: 031] Assign openldap::labs role to seaborgium/serpens [puppet] - 10https://gerrit.wikimedia.org/r/255373 (owner: 10Muehlenhoff) [13:53:34] godog: cool re restbase thnx! [13:53:51] godog: i'll need to amend the change and then i want to deploy, could you assist? [13:58:04] (03PS1) 10BBlack: cache_misc: add LVS service IPs to balancers themselves [puppet] - 10https://gerrit.wikimedia.org/r/255383 (https://phabricator.wikimedia.org/T119394) [13:58:43] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: add LVS service IPs to balancers themselves [puppet] - 10https://gerrit.wikimedia.org/r/255383 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [14:01:32] 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1831244 (10Aklapper) >>! In T109810#1572602, @Jalexander wrote: > but let me check with the lawyers first. @JAlexander: Did that happen? Any outcome? [14:10:49] mobrovac: sure [14:11:00] cool thnx! [14:11:09] amend coming right up [14:12:40] (03PS1) 10BBlack: cache_misc: fix various icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/255384 [14:13:36] (03PS2) 10BBlack: cache_misc: fix various icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/255384 (https://phabricator.wikimedia.org/T119394) [14:14:07] (03CR) 10BBlack: [C: 032 V: 032] cache_misc: fix various icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/255384 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [14:26:27] (03PS4) 10Mobrovac: RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [14:27:04] godog: the amended patch is ^, but cannot go out right away [14:27:17] godog: so jsut a review for now until i prepare the deploy repo [14:28:03] godog: also, note: this will create 4 new CFs per storage group, but after the deployment i'll drop 2 CFs per storage group[ [14:30:03] mobrovac: ok! can you add that to the code review too? the comment from gabriel said one cf per group [14:30:26] godog: sure, want it as a comment or in the commit msg? [14:32:45] mobrovac: commit message is better, thanks! [14:32:54] kk [14:34:08] (03PS5) 10Mobrovac: RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [14:37:25] godog: euh, sorry, i was wrong actually, this will create 1 new CF per storage group and 2 extra CFs (for one storage group only), and after that i'll drop 2 old CFs [14:37:35] godog: so the net result will be one new CF per storage group [14:37:43] will amend the commit msg [14:38:34] (03PS3) 10BBlack: cache_misc - switch to conftool dynamic directors [puppet] - 10https://gerrit.wikimedia.org/r/255378 (https://phabricator.wikimedia.org/T119394) [14:38:52] (03CR) 10BBlack: [C: 032 V: 032] cache_misc - switch to conftool dynamic directors [puppet] - 10https://gerrit.wikimedia.org/r/255378 (https://phabricator.wikimedia.org/T119394) (owner: 10BBlack) [14:39:09] (03PS6) 10Mobrovac: RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [14:40:25] godog: kk, i'm ready when you are ^ [14:42:23] 6operations, 7Database: [EPIC] Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831329 (10jcrespo) 3NEW [14:42:32] (03CR) 10Ottomata: [C: 031] "You could make a define to DRY up a lot of this, but whateeverrrr! :) Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [14:42:47] ^I will do that when I have 5 minutes [14:43:58] (03PS2) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/255047 (https://phabricator.wikimedia.org/T106303) [14:45:36] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 2 failures [14:45:56] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Puppet has 2 failures [14:47:12] mobrovac: ack, did you stop puppet already? [14:47:36] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:55] PROBLEM - Auth DNS for labs pdns on labs-ns3.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:47:56] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [14:47:57] godog: nope, would oyu mind salting that? [14:48:06] prod only [14:48:15] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [14:48:22] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:36] labs issues? [14:48:38] <_joe_> Coren, andrewbogott ^^ [14:48:41] (03PS3) 10Alexandros Kosiaris: Parsoid LVS codfw records [dns] - 10https://gerrit.wikimedia.org/r/208627 (https://phabricator.wikimedia.org/T90271) [14:48:46] Yeah, I'm looking at it now [14:48:47] hey [14:48:48] <_joe_> what's up with the dns? [14:48:56] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 2 failures [14:49:06] PROBLEM - puppet last run on cp4002 is CRITICAL: CRITICAL: Puppet has 2 failures [14:49:07] the temporary puppetfail->recover on cpX is me obviously [14:49:11] (unrelated to labs) [14:49:13] dammit [14:49:16] PROBLEM - puppet last run on cp4004 is CRITICAL: CRITICAL: Puppet has 2 failures [14:49:17] <_joe_> it's labs recursor [14:49:23] so... [14:49:32] Hm [14:49:34] labs-recursor problems ? [14:49:34] I’m in the process of renaming one of the recursors, but the other recursor was still up [14:49:40] andrewbogott: Looks like DNS went down for a bit [14:49:41] Which I’ve been watching like a hawk [14:49:46] so I don’t know why things were complaining [14:49:46] RECOVERY - Auth DNS for labs pdns on labs-ns3.wikimedia.org is OK: DNS OK: 0.055 seconds response time. nagiostest.eqiad.wmflabs returns [14:49:48] That has been the cause of alerts. [14:49:59] carp [14:50:00] crap [14:50:07] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.035 seconds response time. www.wikipedia.org returns 208.80.154.224 [14:50:07] holey carps? [14:50:09] what’s the point of redundancy if... [14:50:34] dns recursor redundancy isn't awesome when it's two IPs in /etc/resolv.conf [14:50:39] fwiw, things seem to be better all of a sudden. [14:50:44] there's the whole issue with timeout -> failover of every client request, etc [14:50:48] tools.wmflabs.org is still down for me [14:50:50] <_joe_> yeah, what bblack said :) [14:50:54] yes, I restarted labs-recursor0 [14:50:54] it's better than nothing, but it's not ideal [14:50:56] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:06] RECOVERY - puppet last run on cp4002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:08] mobrovac: kk, how come prod only? [14:51:16] RECOVERY - puppet last run on cp4004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:25] jynus: wfm [14:51:34] bblack: ok — so, is there any way I can do this? [14:51:34] :-) [14:51:50] godog: 'cause i'll first deploy in staging [14:52:04] andrewbogott: I really don't know, I haven't been following the details of what you're doing [14:52:07] and can't deploy the code without the config change [14:52:11] Coren, did you just restarted nginx or not at all? [14:52:13] !log stop puppet on restbase1* / restbase2* before https://gerrit.wikimedia.org/r/#/c/254372/ [14:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:22] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 936051 bytes in 3.931 second response time [14:52:26] PROBLEM - graphite.wikimedia.org on graphite1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.001 second response time [14:52:30] jynus: Not at all - the 502s were caused by DNS timing out and dns is back. [14:52:35] ok ok [14:52:36] bblack: What I’m doing isn’t super interesting, I just want to re-image the box that contains one of the recursors [14:52:42] https://phabricator.wikimedia.org/T106303 [14:52:44] taking a look at graphite [14:52:55] PROBLEM - puppet last run on cp1056 is CRITICAL: CRITICAL: Puppet has 1 failures [14:53:07] * Coren is pretty sure the graphite thing isn't related though. [14:53:12] (03CR) 10Alexandros Kosiaris: [C: 032] "For some reason, I 've never submitted this. Rebased, updated and submitting now" [dns] - 10https://gerrit.wikimedia.org/r/208627 (https://phabricator.wikimedia.org/T90271) (owner: 10Alexandros Kosiaris) [14:53:22] godog: it could be me, I'm messing with misc-web-lb.... [14:53:22] I can change resolv.conf on labs hosts to exchange the primary/secondary order [14:53:33] that will solve this particular issue, I guess, for instances that are properly puppetized [14:53:37] but phab and gerrit seem fine so far [14:54:28] Or I can schedule a ‘dns will be kind of slow and stupid for 20 minutes’ outage [14:54:31] Coren, thoughts? [14:54:47] RECOVERY - puppet last run on cp1056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:54:56] if dns is questionable may as well describe it as a period of total outage for all users will understand [14:54:57] bblack: ah, nah it is 500 on graphite's apache [14:55:03] ok [14:55:23] guess I’ll do both [14:55:32] andrewbogott: I'd rather switch the resolv.conf entries around if it's planned. But also, given that our resolvers are local, we might want to add a timeout:1 or timeout:2 option [14:55:58] if they're in production subnets, we could put them behind lvs? [14:56:05] (dns recursors) [14:56:12] Especially since we have nscd running so it should notly timeout on cache misses. [14:56:17] 6operations, 10Traffic, 5Patch-For-Review: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1831353 (10BBlack) Ok at this point **everything** is ready and configured correctly, except the final switch hasn't been flipped to send users to the geographic endpoints, which is: https:/... [14:56:22] s/notly/only/ [14:57:23] 6operations, 10Traffic: Expand misc cluster into cache PoPs - https://phabricator.wikimedia.org/T101339#1831354 (10BBlack) [14:57:23] !log bounce uwsgi on graphite1001 [14:57:24] 6operations, 10Traffic, 5Patch-For-Review: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1831355 (10BBlack) [14:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:58:00] misc-web 2 layer, really? ;) [14:58:03] Coren: can you explain the ' if it's planned’ part of what you just said? [14:58:17] godog: did you break something on graphite? :/ [14:58:40] andrewbogott: If you're planning work on one of the recursors, switching the order in resolv.conf in advance would be a good thing. [14:58:47] ah, ok [14:59:15] most calls to render from here are no longer returning with an accept-origin header, so are not loaded in grafana. Just getting 503s [14:59:19] mark: well mostly it's about misc-web termination at cache pops. but since that's all set up as backend->backend for the other clusters, and we want to simplify varnish config rather than make more special cases.... [14:59:38] mark: so it's easier to just standardize. now every cache cluster is two-layer and two-tier with remote termination at the cache pops [14:59:51] (03PS1) 10Rush: openstack: catch nodepool up with manifest reorg [puppet] - 10https://gerrit.wikimedia.org/r/255392 [15:00:18] addshore: no, read the scrollback [15:01:01] bblack: was that response regarding dns/lvs, or something else? [15:01:06] ahh, got it! didnt spot that! [15:01:42] andrewbogott: not about your dns issue I don't think. what response? :) [15:02:47] bblack: just, mark suggested putting the dns recursors behind lvs, and then you ssaid a bunch of stuff back to him about misc-web terminatin and I because confused. nevermind :) [15:03:03] mark: the other missing bit in the long term, is once we have the TLS stuff for cache<->cache + cache<->app all set up, I'm going to start configuring things such that pass and hit_for_pass traffic can bypass cache layers. [15:03:33] (as in, jump from esams frontend cache straight to an eqiad applayer backend) [15:03:47] (meeting) [15:11:05] so there's a ton of requests to apache on graphite1001, like 60/s starting at 14:50, I suspect a dashboard hammering it [15:13:34] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#1831410 (10faidon) 3NEW [15:13:49] I have reloaded graphana a couple of times when it didn't worked [15:14:19] (03CR) 10Hashar: [C: 031] "Thank you Timo!" [puppet] - 10https://gerrit.wikimedia.org/r/244748 (https://phabricator.wikimedia.org/T113903) (owner: 10Hashar) [15:14:28] but obviously not 60/s [15:15:06] godog: any chance they're all monitoring checks, from me expanding the misc cluster and such? [15:16:25] (03PS2) 10Aklapper: Add rights to CU+OS groups on en.wikipedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255129 (https://phabricator.wikimedia.org/T119446) (owner: 10Mdann52) [15:19:01] bblack: mmhh unlikely it is check_graphite, I'm looking at https://ganglia.wikimedia.org/latest/?c=Miscellaneous%20eqiad&h=graphite1001.eqiad.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [15:20:09] (03PS9) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [15:20:11] (03PS1) 10Andrew Bogott: Labs instances: Exchange the order of the primary/secondary dns recursors. [puppet] - 10https://gerrit.wikimedia.org/r/255393 (https://phabricator.wikimedia.org/T106303) [15:20:55] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1831438 (10MasinAlDujaili) These are the current settings for not only wikilovesmonuments.org but .net and .com. At the moment, we u... [15:21:44] hmmm bytes don't look up much [15:22:03] (03CR) 10coren: [C: 031] "The reduced timeout in particular will help if the primary goes down unplanned." [puppet] - 10https://gerrit.wikimedia.org/r/255393 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [15:22:37] but yeah reqs and such are way up [15:24:11] seems like cassandra or kafka [15:24:17] those are the heavy ones in graphite's apache log [15:24:40] well or the wikidata.dispatch thing [15:24:55] I'm going to change https://grafana.wikimedia.org/dashboard/db/labs-monitoring not to reload every 5s [15:25:41] and this is when hilarity ensues, I don't think there's a way to force grafana clients to pick that up [15:27:05] root@graphite1001:/var/log/apache2# tail -10000 other_vhosts_access.log|grep POST|awk '{print $12}'|sort|uniq -c|sort -rn [15:27:08] 2103 "https://grafana.wikimedia.org/dashboard/db/kafka" [15:27:11] 731 "https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch" [15:27:14] 663 "https://grafana.wikimedia.org/dashboard/db/wikidata-query-service" [15:27:31] there's other GET traffic too, but for POST it's the kafka db dashboard as the top referrer [15:28:16] it's defaulting to 1m refresh heh [15:28:40] aye [15:29:34] (03PS2) 10Rush: openstack: catch nodepool up with manifest reorg [puppet] - 10https://gerrit.wikimedia.org/r/255392 [15:31:48] bblack: would it be easier to try banning some referer at the varnish level? looking into doing that from apache too [15:32:33] (03CR) 10Rush: [C: 032] "http://puppet-compiler.wmflabs.org/1363/" [puppet] - 10https://gerrit.wikimedia.org/r/255392 (owner: 10Rush) [15:35:01] don't need no refresh [15:35:06] on that thar dashboard [15:35:17] i wonder who is even looking at it [15:35:19] ... [15:35:42] turned off refresh [15:36:06] but, ja there are a lot of indivdidual metrics that dashboard pulls together [15:36:11] lots of topics, lots of partitions [15:36:20] ha, the dashboard isn't even working [15:36:29] ottomata: is it new? [15:36:50] no [15:42:25] (03CR) 10Andrew Bogott: [C: 032] Labs instances: Exchange the order of the primary/secondary dns recursors. [puppet] - 10https://gerrit.wikimedia.org/r/255393 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [15:42:45] (03PS2) 10Andrew Bogott: Labs instances: Exchange the order of the primary/secondary dns recursors. [puppet] - 10https://gerrit.wikimedia.org/r/255393 (https://phabricator.wikimedia.org/T106303) [15:42:53] so yeah ATM the only thing I can think of is banning certain referer from apache on graphite [15:46:40] (03PS2) 10JanZerebecki: add wikilovesmonument.org [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) [15:48:49] (03PS3) 10JanZerebecki: add wikilovesmonument.org [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) [15:49:50] RECOVERY - graphite.wikimedia.org on graphite1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1572 bytes in 0.012 second response time [15:49:53] (03CR) 10JanZerebecki: "PS2 is like it is currently, except NS. PS3 fixes MX." [dns] - 10https://gerrit.wikimedia.org/r/252703 (https://phabricator.wikimedia.org/T118468) (owner: 10JanZerebecki) [15:53:27] !log ban grafana kafka dashboard temporarily from graphite [15:53:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:31] ottomata: ^ [15:55:24] (03PS10) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [15:55:50] waaaaahhhh :( [15:55:53] godog: :( [15:56:02] it used to work, why is it not all of the sudden :( [15:57:27] (03PS11) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [15:57:44] ottomata: still not 100% it is the root cause, but there were a ton of requests to graphite [15:58:19] (03PS1) 10EBernhardson: Turn on the rest of the top 10 wikis (by size) for ES labs replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255397 [15:58:25] 6operations, 6Phabricator, 10netops: Fix edit permissions of the netops project - https://phabricator.wikimedia.org/T119634#1831596 (10greg) 3NEW [15:59:07] 6operations, 10Wiki-Loves-Monuments-General, 10Wikimedia-DNS, 5Patch-For-Review, 7domains: point wikilovesmonument.org ns to wmf - https://phabricator.wikimedia.org/T118468#1831605 (10JanZerebecki) [15:59:30] (03CR) 10EBernhardson: "this will have to wait until after deployment freeze, but is the next step for ES labs replica" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255397 (owner: 10EBernhardson) [16:00:57] Can someone who understands rolematcher.py advise? mutante maybe? [16:06:10] (03CR) 10coren: [C: 031] "With the caveat that while the change to rolematcher.py is obviously intended to remove a special case for holmium, I don't know whether d" [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [16:12:42] mobrovac: heh, want to resume? [16:14:20] (03CR) 10Jcrespo: [C: 032] Repool db1044 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255366 (owner: 10Jcrespo) [16:14:28] godog: in 20 mins or so? in a meeting now [16:15:16] mobrovac: ack, I will be in a meeting until 17.30, we can do after that [16:15:40] me too godog hehe [16:15:46] (03PS1) 10Alexandros Kosiaris: Bring etherpad-lite configuration up to date with upstream [puppet] - 10https://gerrit.wikimedia.org/r/255402 [16:15:48] (03PS1) 10Alexandros Kosiaris: Set a standard defaultPadText for etherpad [puppet] - 10https://gerrit.wikimedia.org/r/255403 [16:15:50] (03PS1) 10Alexandros Kosiaris: Trust the upstream proxy to have the correct client IP [puppet] - 10https://gerrit.wikimedia.org/r/255404 [16:15:52] (03PS1) 10Alexandros Kosiaris: misc-web: Force HTTPS for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/255405 [16:15:54] (03PS1) 10Alexandros Kosiaris: Have misc-web talk directly to etherpad-lite [puppet] - 10https://gerrit.wikimedia.org/r/255406 [16:16:28] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1044 after maintenance (duration: 00m 47s) [16:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:37] (03CR) 10Faidon Liambotis: [C: 031] misc-web: Force HTTPS for etherpad.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/255405 (owner: 10Alexandros Kosiaris) [16:29:26] (03PS3) 10BBlack: switch all misc-web to geographic routing [dns] - 10https://gerrit.wikimedia.org/r/255331 [16:31:23] hm, godog i'm looking at joal's --until change, and I can't find anything wrong with it at all! hm. [16:33:11] i'm going to submit a temp patch that includes his stuff, but so that we don't have to apply it on everything [16:33:42] ottomata: yeah it seemed ok on the surface to me too, thanks yeah we can go with python first and then icinga [16:34:06] well, the python stuff all looks fine, but i can't really check icinga well without adding another check command [16:34:25] either i stop puppet on neon and temporarily add it, or I just make a new temp check command patch in puppet that uses it [16:34:29] and we just try it on one or two hosts [16:34:32] i think i'll do that [16:36:53] (03PS4) 10Giuseppe Lavagetto: alerts: move icinga alerts away from reqstats.* [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) [16:39:02] hmmmmmm or huh, i can just put the check command files in place on neon, don't need to puppetize them... [16:39:05] i think.. [16:40:07] * qchris foo [16:47:23] <_joe_> ottomata: WAT? [16:47:28] (03CR) 10Giuseppe Lavagetto: [C: 032] alerts: move icinga alerts away from reqstats.* [puppet] - 10https://gerrit.wikimedia.org/r/255347 (https://phabricator.wikimedia.org/T118979) (owner: 10Giuseppe Lavagetto) [16:47:46] _joe_: :) [16:48:07] _joe_: godog had to revert this change https://gerrit.wikimedia.org/r/#/c/254846/3 [16:48:12] we are not sure why it broke things [16:48:21] as far as we can tell, it should work [16:48:36] so, instead of doing that which changes the check_command for everytihng [16:48:45] i'm going to make a new temp one, and change a single check command to use it [16:49:27] <_joe_> ottomata: uhm want me to take a look? [16:49:40] <_joe_> not now though, I'm in a meeting [16:50:08] sure, _joe_ if you like, fyi, the python script works A-ok [16:50:25] we think it may have just been a puppet/icinga sync problem, but we aren't sure [16:53:42] !log killing pdns-recursor on holmium [16:53:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:53:57] * andrewbogott watches nervously [16:55:43] godog: i'm ready to continue if you are [16:58:16] (03PS3) 10Mdann52: Namespace config change on de.wikivoyage.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255361 (https://phabricator.wikimedia.org/T119420) [16:59:43] (03PS1) 10BBlack: Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) [17:00:03] (03CR) 10jenkins-bot: [V: 04-1] Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [17:01:24] (03PS2) 10BBlack: Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) [17:02:57] !log restarted pdns-recursor on holmium while I figure out what’s happening [17:03:32] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1831764 (10ori) 3NEW [17:06:55] 6operations, 6Phabricator, 10netops: Fix edit permissions of the netops project - https://phabricator.wikimedia.org/T119634#1831776 (10faidon) 5Open>3Resolved a:3faidon I hadn't set that edit policy, it's probably a remnant from the default policy from projects coming off from the RT migration. I just... [17:08:55] ottomata: did the 4K config patch for varnishkafka get into a package -> deploy? [17:09:03] (03PS12) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [17:09:05] (03PS1) 10Andrew Bogott: Labs resolv.conf: don't sort the array of resolvers. [puppet] - 10https://gerrit.wikimedia.org/r/255411 (https://phabricator.wikimedia.org/T106303) [17:09:13] bblack: no i haven't done anything with it [17:09:22] ok [17:09:29] i can probably repackage today, i would like to include this change too: [17:09:43] https://gerrit.wikimedia.org/r/#/c/230173/ [17:09:50] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1831783 (10ori) Recent releases of HHVM [[ https://github.com/facebook/hhvm/commit/75632c113d3ba8010... [17:09:58] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1831785 (10ori) [17:10:00] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1831784 (10ori) [17:10:01] wouldn't mind a quick review from you if we do that :) [17:10:46] (03CR) 10coren: [C: 031] "I'm... not entirely sure why sorting would have been thought to be a good idea in the first place." [puppet] - 10https://gerrit.wikimedia.org/r/255411 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [17:11:12] if we don't change the config, the behavior will be exactly the same as now [17:11:20] (03CR) 10Andrew Bogott: [C: 032] Labs resolv.conf: don't sort the array of resolvers. [puppet] - 10https://gerrit.wikimedia.org/r/255411 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [17:11:36] but this will allow us to produce to variable topics based on some vsl data [17:11:42] instead of just one topic per vk instance [17:13:41] 7Blocked-on-Operations, 6operations, 7Performance: Update HHVM package to recent release - https://phabricator.wikimedia.org/T119637#1831794 (10ori) [17:15:12] 6operations, 6Reading-Admin, 10Reading-Community-Engagement: Improve UX Strategic Test - https://phabricator.wikimedia.org/T117826#1831796 (10Moushira) [17:15:41] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1831799 (10Milimetric) >>! In T119380#1830707, @jcrespo wrote: > I see two accelerations, on the 27 sep and on the 7 nov. There could be many explanations, from long running transactions be... [17:15:49] ottomata: aye I was thinking it could be icinga not liking the command definition to be changed on reload, or that part of all hosts would run with the old command definition until puppet has run everywhere and things converge cc _joe_ [17:16:12] 6operations, 6Phabricator, 10netops: Fix edit permissions of the netops project - https://phabricator.wikimedia.org/T119634#1831802 (10greg) thanks! [17:16:52] ottomata: well I don't know that I can accurately review that, I don't know the vk code that well. there is an outstanding comment on magnus's +1 though, at the bottom of https://gerrit.wikimedia.org/r/#/c/230173/6/varnishkafka.c [17:17:30] mobrovac: yup, good to go [17:17:37] cool [17:17:39] go go go godog [17:17:40] :) [17:18:07] ok cool bblack, that's fine then [17:18:14] (03PS7) 10Filippo Giunchedi: RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [17:18:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Update to new specs & enable summary end point [puppet] - 10https://gerrit.wikimedia.org/r/254372 (owner: 10GWicke) [17:18:26] i'll just fix up the one comment, merge, and build it with that and yours then [17:18:27] i think its fine [17:18:41] mobrovac: {{done}} it is merged [17:19:04] godog: cool, mind salting puppet agent -tv in staging? [17:20:03] mobrovac: kk [17:20:08] thnx! [17:21:07] ottomata: ok [17:21:49] 6operations, 10Analytics, 10Deployment-Systems, 6Services, 3Scap3: Use Scap3 for deploying AQS - https://phabricator.wikimedia.org/T114999#1831813 (10mobrovac) [17:22:36] mobrovac: {{done}} [17:23:49] (03CR) 10Mdann52: [C: 031] Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [17:24:04] godog: cheers, will start deploy [17:25:27] (03PS3) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [17:25:43] 6operations, 6Performance-Team, 10Traffic, 7Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1831822 (10Aklapper) @ori: Any news? Or should this have lower priority? [17:26:58] (03CR) 10jenkins-bot: [V: 04-1] etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [17:27:06] 6operations, 6Performance-Team, 10Traffic, 7Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1831827 (10ori) @Aklapper, no, it's still serious. I'm just swamped; sorry. [17:28:00] (03CR) 10DarTar: [C: 031] Set explicit referrer policy for WMF sites [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255408 (https://phabricator.wikimedia.org/T87276) (owner: 10BBlack) [17:28:22] ah puppet removes stuff from /etc/icinga/commands....ooook :) [17:30:31] PROBLEM - Restbase root url on xenon is CRITICAL: Connection refused [17:30:41] known ^ [17:30:49] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [17:30:55] this too ^ [17:33:00] 6operations, 10Deployment-Systems, 6Performance-Team, 6Release-Engineering-Team, 7HHVM: Translation cache exhaustion caused by changes to PHP code in file scope - https://phabricator.wikimedia.org/T103886#1831842 (10mmodell) >>! In T103886#1831783, @ori wrote: > Recent releases of HHVM [[ https://github.... [17:37:54] 6operations, 6Commons, 10Wikimedia-Media-storage: Cannot delete file: File:JGY5201.jpg - https://phabricator.wikimedia.org/T119301#1831855 (10Steinsplitter) 5Open>3Resolved a:3Steinsplitter Fixed: ``` 17:36, 25 November 2015 Steinsplitter (talk | contribs | block) deleted page File:JGY5201.jpg (Copyr... [17:41:53] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1831863 (10ArielGlenn) First results of debugging on neodymium: the salt cli client stops receiving events well before its timeout, and sits idlely until the timeout is reached at which point... [17:42:38] (03PS1) 10Mobrovac: RESTBase: Config: specify graphoid's module as a spec [puppet] - 10https://gerrit.wikimedia.org/r/255414 [17:42:50] godog: need a fix ^^ [17:43:19] mobrovac: kk [17:43:23] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] RESTBase: Config: specify graphoid's module as a spec [puppet] - 10https://gerrit.wikimedia.org/r/255414 (owner: 10Mobrovac) [17:43:32] cheers godog! [17:43:38] godog: salt run puppet? :P [17:43:44] mobrovac: use ansible :P [17:43:50] hahahaha [17:43:56] * mobrovac hides [17:44:14] (03CR) 10Ori.livneh: [C: 04-1] "reviewed python script, not the other ones" (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [17:44:25] !log killing pdns-recursor on holmium [17:44:54] mobrovac: doing it, but srsly we should figure out a solution [17:44:54] (03PS1) 10Ottomata: Test addition of --until parameter to check_graphite check [puppet] - 10https://gerrit.wikimedia.org/r/255415 (https://phabricator.wikimedia.org/T116035) [17:45:08] godog: cool, thnx [17:45:12] godog: yup, i agree strongly [17:45:35] godog: for the time being, i can create an ansible task for this [17:45:50] puppet run/disable/enable [17:47:10] mobrovac: yup that might work [17:48:29] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [17:48:37] mobrovac: {{done}} [17:48:50] thnx godog [17:49:27] mobrovac: I have been using ansible for this so far [17:49:50] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:49:58] PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:50:20] RECOVERY - Restbase root url on xenon is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.015 second response time [17:50:32] andrewbogott: ^ [17:50:45] Coren: ^ [17:50:50] PROBLEM - Auth DNS for labs pdns on labs-ns3.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [17:51:00] mobrovac: ansible -i staging restbase -m shell -a "sudo puppet agent -tv" [17:51:03] Huh, I scheduled downtime [17:51:08] maybe I missed something [17:51:08] well I guess that paged everyone anyway [17:51:14] touché gwicke [17:51:48] <_joe_> andrewbogott: tools home page is down, seemed related [17:51:54] <_joe_> that's what paged [17:52:13] yeah [17:52:14] andrewbogott: Yeah, homepage is down. Looks like something is still screwy with dns. Checking. [17:52:15] dammit [17:52:18] I restarted [17:52:40] RECOVERY - Auth DNS for labs pdns on labs-ns3.wikimedia.org is OK: DNS OK: 0.016 seconds response time. nagiostest.eqiad.wmflabs returns [17:52:42] I have no pages. [17:52:45] none today at all [17:52:46] andrewbogott: We might be bitten by the web servers caching IPs? [17:52:48] But I don’t understand, /public/ dns is hosted in a totally other place. [17:52:51] mobrovac: actually, you can leave out -m shell: ansible -i staging restbase -a "sudo puppet agent -tv" [17:53:28] also tools.wmflabs.org worked for me the whole time [17:53:33] andrewbogott: Alternately, the pdns has a dependency somewhere we don't see offhand? [17:53:41] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.018 seconds response time. www.wikipedia.org returns 208.80.154.224 [17:53:43] andrewbogott: Yeah, it was DNS that was screwy. [17:53:48] ok. [17:53:50] So, two things... [17:53:57] RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 935406 bytes in 3.240 second response time [17:53:58] 1) what is monitoring 208.80.154.20? [17:54:26] 2) Why did a public dns entry ( tools.wmflabs.org ) break, when that’s served by labcontrol1001? [17:54:37] andrewbogott: Go grab your lunch, we'll sit down and debug when you have a full stomach? [17:54:55] <_joe_> 2) is wrong [17:54:56] hm, yeah, I guess. [17:55:06] <_joe_> it's not the dns entry that breaks [17:55:09] _joe_: how so? [17:55:13] <_joe_> it's the internal resolution from the cluster [17:55:23] <_joe_> I get a 502 from that page, not a dns error [17:55:33] ah, ok [17:55:37] <_joe_> so that VM tries to use that dns and fails [17:55:44] _joe_: Odd, I got a DNS timeout on my end. [17:55:52] 6operations, 6Analytics-Kanban, 7Database: db1046 running out of disk space - https://phabricator.wikimedia.org/T119380#1831889 (10jcrespo) > And we can't really drop old revision_ids for schemas in an automated way Actually, I wasn't suggesting that, just doing regularly that even if the revision_id doesn'... [17:56:04] so, for _joe_ at least, public dns was working properly [17:56:13] but dns on the proxy was failing [17:56:15] ? [17:56:20] (03PS5) 10Mdann52: Tidy robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240065 (https://phabricator.wikimedia.org/T104251) [17:56:27] andrewbogott: Maybe both broke and he had the IP in his local cache? [17:56:29] <_joe_> andrewbogott: that's what I saw earlier [17:56:36] <_joe_> btw the error is [17:56:38] <_joe_> < icinga-wm> PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:48] <_joe_> so not "unknown host" [17:57:00] that... [17:57:26] _joe_: ‘earlier’ like an hour ago? [17:57:28] _joe_: That 10 second timeout, IIRC, is the entire connection including the DNS lookup. [17:57:57] _joe_: But yeah, if you got a 502 it definitely means the internal dns was broken [17:58:12] yeah, and internal dns /was/ broken during the earlier incident. Or, not broken, but slow. [17:58:19] So, that part I think I understand. [17:58:28] I don’t understand why it failed more recently, though, when internal dns was /not/ broken [17:58:32] andrewbogott: Hypothesis: the proxy is an idiot and caches the current resolver? [17:58:47] andrewbogott: So it doesn't pick up the change in resolv.conf [17:58:58] (which would suck) [17:59:08] Coren: but did you see public dns failing? [17:59:19] just now, I mean? [17:59:24] <_joe_> andrewbogott: when we got paged earlier [18:00:09] andrewbogott: I thought I did, but maybe it timed out for another reason after all - it never got to the point where I got an error message because you restarted it and my browser "broke through" then. [18:00:19] ok [18:00:40] so, ok, maybe we can blame the proxy for not properly switching over [18:00:44] but it should have a fail-over as well [18:00:49] so maybe there are two bugs in the proxy [18:00:57] !log restbase canary deploy to restbase1001 of 74662c6 [18:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:20] YuviPanda, when you arrive, can you have a look at how the labs proxies handle 1) lookup failure on the primary resolver and 2) changes in resolv.conf? [18:03:14] (03CR) 10BBlack: [C: 032] switch all misc-web to geographic routing [dns] - 10https://gerrit.wikimedia.org/r/255331 (owner: 10BBlack) [18:03:39] bblack: \o/ [18:04:03] !log restored pdns-recursor on holmium, again [18:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:05:01] Coren: OK, everything is restored on holmium and I’m setting out to buy my holiday ham. With luck Yuvi will appear before I get back and we won’t have to dissect the labs proxy ourselves. [18:05:16] kk [18:05:17] Sorry for the pages, everyone. Past and future :( [18:05:40] PROBLEM - Restbase root url on restbase1001 is CRITICAL: Connection refused [18:06:00] PROBLEM - Restbase endpoints health on restbase1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:06:12] ok, I’m sure those aren’t me [18:06:15] :) [18:06:31] rb hosts have been puppet-disabled for hours now, I'm guessing some deploy/change happening [18:06:45] yeah deploy [18:07:01] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831923 (10jcrespo) [18:07:09] it takes a while for restbase to start up if it needs to make schema changes [18:07:24] rb1001 known ^^^ [18:08:01] AaronSchulz, let me scare you a bit with T119626 [18:08:54] (03CR) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) (owner: 10Giuseppe Lavagetto) [18:09:05] (I've only created it because some of the work is needed for T88445, too) [18:10:24] 6operations, 7Database: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1831926 (10fgiunchedi) graphite side: each metric takes 309KB on disk with current retention of `1m:7d,5m:14d,15m:30d,1h:1y` so that would be ~40M/host with 100 metrics [18:10:25] (03CR) 10BryanDavis: [C: 031] l10nupdate: Log sync-dir failure to SAL [puppet] - 10https://gerrit.wikimedia.org/r/255321 (owner: 10Alex Monk) [18:11:42] (03PS4) 10Giuseppe Lavagetto: etcd: auth puppettization [WiP] [puppet] - 10https://gerrit.wikimedia.org/r/255155 (https://phabricator.wikimedia.org/T97972) [18:12:45] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1831927 (10Dzahn) done on tin. now: root@tin:~# id l10nupdate uid=10002(l10nupdate) gid=10002(l10nupdate) groups=10002(l10nupdate) and: root@tin:~# find / -uid 997 -exec chown 1... [18:12:47] 6operations, 7Database: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1831928 (10jcrespo) What about performance (reads), is there space to grow right now? [18:13:04] mobrovac: I have to go shortly, anything I can assist with in the meantime? in any case let me know what CF stats I should drop from graphite [18:13:36] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1831929 (10Dzahn) 5Open>3Resolved a:3Dzahn root@tin:~# id l10nupdate uid=10002(l10nupdate) gid=10002(l10nupdate) groups=10002(l10nupdate) root@mira:~# id l10nupdate uid=10002(... [18:14:17] godog: the keyspaces i'll drop are local_group_globaldomain_T_mathoid_data and local_group_globaldomain_T_mathoid_request [18:14:24] godog: nope, that it, so far so good [18:14:29] godog: thnx a ton! [18:14:57] bd808: ^ [18:15:08] mobrovac: kk, no problem! [18:17:09] 6operations, 7Database: Create a Master-master topology between datacenters for easier failover - https://phabricator.wikimedia.org/T119642#1831941 (10jcrespo) 3NEW [18:17:36] 6operations, 7Database: Create a Master-master topology between datacenters for easier failover - https://phabricator.wikimedia.org/T119642#1831950 (10jcrespo) [18:17:37] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831949 (10jcrespo) [18:17:39] RECOVERY - Restbase root url on restbase1001 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.014 second response time [18:17:44] mutante: w00t. [18:18:00] RECOVERY - Restbase endpoints health on restbase1001 is OK: All endpoints are healthy [18:18:30] 6operations, 7Database: Create a Master-master topology between datacenters for easier failover - https://phabricator.wikimedia.org/T119642#1831941 (10jcrespo) [18:18:35] 6operations, 7Database: Decide storage backend for performance schema monitoring stats - https://phabricator.wikimedia.org/T119619#1831955 (10fgiunchedi) ultimately depends on the graphite queries issued of course, but I'm assuming we wouldn't be reading all metrics from all machines at the same time all the t... [18:19:22] !log bd808@tin Synchronized README: Testing l10nupdate uid fix for T119165 (duration: 00m 28s) [18:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:19:39] 6operations, 10Deployment-Systems: l10nupdate user uid mismatch between tin and mira - https://phabricator.wikimedia.org/T119165#1831956 (10bd808) ``` tin:/srv/mediawiki-staging (git master $) bd808$ sync-file README "Testing l10nupdate uid fix for T119165" ___ ____ ⎛ ⎛ ,---- \... [18:19:59] mutante: it works!111! [18:20:05] bd808: :)) [18:21:34] mutante: so did you change puppet to always use that uid for the l10nupdate user too? [18:22:25] 6operations, 10Deployment-Systems: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1831959 (10Dzahn) T119165 has been resolved, which was identified as previously as the remaning blocker. can we deploy from mira now and resolve this one?:) [18:23:44] bd808: no, so far i just confirmed that puppet does not break it. i think the UID is not specified in puppet (so far) [18:23:54] but yea [18:24:00] should do that too [18:24:50] 6operations: Build Linux 4.3 for jessie-wikimedia - https://phabricator.wikimedia.org/T119519#1831968 (10MoritzMuehlenhoff) I made a backport of linux 4.3-1~wmf1 for jessie-wikimedia moving back to GCC 4.9. In addition I built firmware-nonfree 20151018-2~wmf1 so that we get updated, compatible firmware blobs. M... [18:26:26] 6operations: mwdeploy does not have the same user ID on all Apaches - https://phabricator.wikimedia.org/T79786#1831969 (10bd808) > This causes problems sometimes when rsync works by user ID not by user name (not entirely sure when that happens) and other things run inside sudo -u mwdeploy and expect to be able t... [18:27:05] bd808: actually i had looked if it is, in modules/scap/manifests/l10nupdate.pp but while that refereces the user, it does not create the user, i am not sure yet what creates it. [18:27:34] mutante: https://phabricator.wikimedia.org/diffusion/OPUP/browse/production/modules/mediawiki/manifests/users.pp;48c6a219b2b5048662cef1f8638bbe1e232c751f$41 [18:28:15] bd808: oh, ! interesting, so the gid is specified, that explains that that part was ok on both servers :p [18:28:28] yeah, kind of weird [18:28:33] yea, let's fix that, want me to? [18:29:10] hmm. it might be a problem in labs [18:29:14] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831975 (10aaron) Some notes on Galera that affect MediaWiki (some bits from https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/): * GET_LOCK() is used in a f... [18:29:49] mutante: in labs ldap we already have uid=10002(l10nupdate) gid=10002(l10nupdate) groups=10002(l10nupdate) [18:29:53] remembers issues where puppet users and labs ldap users conflict [18:29:55] ah, very good [18:30:29] what I don't know is if we use l10nupdate on the general MW fleet [18:30:41] meaning would we need to renumber everywhere [18:30:49] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1831981 (10jcrespo) @aaron Forget about that. Galera multimaster is not an option. A single master, that happens to have a "galera" slave, is (maybe). [18:31:13] at least it exists on a random appserver [18:31:32] AaronSchulz, I am not that crazy :-) [18:31:47] sorry you had to write all that, I already was aware of that [18:32:01] PROBLEM - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 69895 MB (3% inode=99%) [18:32:14] with the right gid, but yet another uid [18:32:37] mutante: oh... yeah we use it to sync nightly [18:33:16] bd808: the uid on a random appserver does not match what it was on tin or mira. i guess the good part about that is that it means the difference was not a problem in the past either [18:33:47] between tin and an appserver that is [18:33:53] 6operations, 10Traffic, 5Patch-For-Review: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1831990 (10BBlack) [18:33:54] 6operations, 10Traffic, 5Patch-For-Review: Create globally-unique varnish cache cluster port/instancename mappings - https://phabricator.wikimedia.org/T119396#1831988 (10BBlack) [18:33:56] 6operations, 10Traffic, 5Patch-For-Review: Convert misc cluster to 2-layer - https://phabricator.wikimedia.org/T119394#1831986 (10BBlack) 5Open>3Resolved a:3BBlack [18:33:56] yeah. the bug we hit was specific to the deploy servers and trying to keep the medaiwiki-staging directory in sync [18:33:58] 6operations, 10Traffic: Expand misc cluster into cache PoPs - https://phabricator.wikimedia.org/T101339#1831989 (10BBlack) [18:34:22] mutante: see line 108 of modules/scap/files/l10nupdate-1 for how it is used on all MW hosts [18:34:27] ok, that hopefully means we dont have to do this on all appservers and it would just be a cosmetic fix [18:35:18] but if we don't pin the uid in puppet it will happen again the next time a deploy server is rebuilt (like when tin is rebuilt to be a 14.04 box) [18:35:42] yes, ack. let's add it in puppet now [18:35:51] ACKNOWLEDGEMENT - Disk space on restbase1009 is CRITICAL: DISK CRITICAL - free space: /var 68738 MB (3% inode=99%): gwicke Looks tight, but the compaction might just make it in time. [18:36:59] I wonder if ensure=>present forces the uid change or if it would only happen on initial provision? [18:37:13] jynus: is there a description of a similar setup on a website somewhere? [18:37:27] * AaronSchulz is trying to figure out that avoids SPOF [18:37:39] "This attribute represents concrete state on the target system." -- I think that means Puppet will try to renumber [18:37:51] no, that is taken from my own head [18:38:03] which bits an ideas from other places [18:38:34] I do not think there is nothing revolutionary there, anyway [18:38:46] * AaronSchulz must have jumped too soon seeing "no SPOF" and "galera" mentioned in the diagram [18:38:52] but stop thinking too much, [18:38:58] that is a declaration of intentions [18:39:11] I wanted to dump a napkin diagram there [18:39:19] so we have the circular replication soon [18:39:21] (03PS1) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) [18:39:29] needed for your work [18:39:36] if that makes sense [18:39:56] the master equiad <---> master codfw link [18:40:00] bd808: yea, i think we dont need to try to have puppet fix it, since we already did manually now and it would probably make the puppetmaster slow [18:40:14] to go through a ton of resources recursively [18:40:35] recursive chown in puppet is horrible [18:40:37] let's just add the uid => line and on reinstall will be fine [18:40:56] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1832027 (10jcrespo) In other words, slave1 is a master candidate, that by using GTIDs or cloning the binary logs (binlog server) we can continue replication automatically. But the ide... [18:41:27] AaronSchulz, http://blog.booking.com/mysql_slave_scaling_and_more.html [18:41:43] booking writes a lot about it, there is some concepts there [18:41:55] bd808: wanna +1 that above? [18:42:05] (03CR) 10Ottomata: [C: 032] Test addition of --until parameter to check_graphite check [puppet] - 10https://gerrit.wikimedia.org/r/255415 (https://phabricator.wikimedia.org/T116035) (owner: 10Ottomata) [18:42:21] PROBLEM - puppet last run on restbase1003 is CRITICAL: CRITICAL: puppet fail [18:42:43] wth? [18:43:30] PROBLEM - puppet last run on restbase1008 is CRITICAL: CRITICAL: puppet fail [18:44:01] PROBLEM - puppet last run on restbase1005 is CRITICAL: CRITICAL: puppet fail [18:44:50] that's a lie ^ [18:45:19] jynus: that reminds me, the MW code for getting pt-heartbeat lag checks the row for Master_Server_Id (from the slave status), which for master => slave1 => slave2, would check for the slave1 id entry (which wouldn't even exist) [18:45:36] (03CR) 10BryanDavis: [C: 04-1] "I'm a bit worried that specifying this will make Puppet try to change the existing uid for the l10nupdate user across all hosts in the MW " (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [18:45:39] RECOVERY - puppet last run on restbase1008 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:45:40] ugh, need to fix that [18:45:49] AaronSchulz, that is an issue [18:46:00] that happens now [18:46:14] if we failovered to codfw master [18:46:20] !log restbase deploy start of 74662c [18:46:22] there is not pt-heartbeat there [18:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:46:46] so that would fail at many levels [18:47:01] (03PS2) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) [18:47:04] !log removing cr2-ulsfo:xe-1/2/0, Patch ID 1062 as T118171 cancels that link [18:47:07] I have a similar problem with checks now on puppet [18:47:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:47:16] (03PS1) 10Luke081515: Enable filemover group at ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255422 (https://phabricator.wikimedia.org/T119636) [18:47:27] once you have several datacenters, it is not easy to say what a master is [18:47:27] (03CR) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [18:47:34] RECOVERY - puppet last run on restbase1005 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:48:29] (03CR) 10Dzahn: [C: 04-1] "syntax issues fixed. agreed that needs some testing before being merged" [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [18:48:37] AaronSchulz, if I remember correctly, it does not [18:49:07] it gets the server id from the master, by connecting to the master though mediawiki and caches that [18:49:14] IIRC [18:50:32] in other words, please forget about T119626, we are not yet there, and we have more important problems [18:51:09] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1832084 (10aaron) If the master is still passive, I assume that means other config changes are still need to use it when the active one fails, so the title of this task seems a bit st... [18:51:11] PROBLEM - Restbase root url on restbase1003 is CRITICAL: Connection refused [18:51:17] bd808: yes, agree to what you said, needs testing. maybe first an "if hostname" around it to do it on test.wp only [18:51:26] 6operations, 10ops-ulsfo: ulsfo temperature-related exceptions - https://phabricator.wikimedia.org/T119631#1832086 (10RobH) We were not alerted to anything by UnitedLayer, however those are indeed spikes. I'll open up a trouble ticket with them. I'm onsite today, and it is quite warm in their DC floor. [18:52:12] jynus: getLagFromPtHeartbeat() gets the ID from SHOW SLAVE STATUS, so it needs some fix [18:52:21] PROBLEM - Restbase endpoints health on restbase1003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [18:52:25] then it is wrong, yes [18:52:31] mutante: *nod* you could maybe just do some testing with the user resource and a self-hosted puppetmaster in labs too. I just don't know what the behavior is [18:52:50] (03PS3) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) [18:53:11] RECOVERY - puppet last run on restbase1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:53:12] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 1, unused: 0BRxe-1/2/0: down - BR [18:54:05] (03PS3) 10Luke081515: Nuke and unblockself only for bureaucrats on en.wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254029 (https://phabricator.wikimedia.org/T113109) [18:54:21] RECOVERY - Restbase endpoints health on restbase1003 is OK: All endpoints are healthy [18:54:27] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1832099 (10jcrespo) p:5Triage>3Low Let's forget about this for now and only do the subtask, needed for other reasons. @aaron, I also want to put a haproxy on every mediawiki, I j... [18:54:40] RECOVERY - Restbase root url on restbase1003 is OK: HTTP OK: HTTP/1.1 200 - 15171 bytes in 0.030 second response time [18:56:00] (03PS4) 10Luke081515: Add a rollbacker group at wuuwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/253084 (https://phabricator.wikimedia.org/T116270) [18:56:18] (03PS2) 10Dzahn: l10nupdate: Log sync-dir failure to SAL [puppet] - 10https://gerrit.wikimedia.org/r/255321 (owner: 10Alex Monk) [18:56:35] (03CR) 10Dzahn: [C: 032] l10nupdate: Log sync-dir failure to SAL [puppet] - 10https://gerrit.wikimedia.org/r/255321 (owner: 10Alex Monk) [18:56:37] godog: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=cp1054&service=Varnishkafka+Delivery+Errors+per+minute [18:56:39] seems fine... [18:57:33] jynus: at any rate, I filed T119648 [18:57:46] marked as blocking the heartbeat task [18:58:14] i'm going to use this on a few more checks, if that goes fine, then i'll bring back the original change and try to make sure puppet and icinga are all synced up and happy [18:59:20] the "proxies on every mediawiki", I will need it sooner so I do not have to do a commit every time I pool and depool a server [18:59:55] doesn't necessarily have to be outside of mediawiki, though [18:59:59] (03PS4) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) [19:00:02] (03PS1) 10Ottomata: Test --until graphite_check parameter on all varnishkafka drerr reports [puppet] - 10https://gerrit.wikimedia.org/r/255425 [19:00:44] (03PS2) 10Ottomata: Test --until graphite_check parameter on all varnishkafka drerr reports [puppet] - 10https://gerrit.wikimedia.org/r/255425 [19:00:56] jynus: I guess it would need to replace the lag checks and query group logic [19:01:32] it is a lot of things, that is why it is Epic, not a short term thing, and not even a goal [19:01:49] could be useful for long term scripts that do writes [19:02:17] * AaronSchulz wishes we had bunch of additional staff :( [19:02:19] are we talking pt-heartbeat or the SPOF/psoxy thing? [19:02:23] the proxy thing [19:02:26] (03CR) 10jenkins-bot: [V: 04-1] Test --until graphite_check parameter on all varnishkafka drerr reports [puppet] - 10https://gerrit.wikimedia.org/r/255425 (owner: 10Ottomata) [19:02:39] how would it be useful? [19:02:54] ah [19:02:58] to direct writes to the new master (say if you switched one) [19:03:06] (03PS3) 10Ottomata: Test --until graphite_check parameter on all varnishkafka drerr reports [puppet] - 10https://gerrit.wikimedia.org/r/255425 [19:03:08] probably you mean the stale configuration I mentioned? [19:03:13] yeah [19:03:16] yes [19:03:26] but not only for writes [19:03:30] also for reads [19:03:37] yep [19:03:46] it takes 9 hours to do a long running query [19:03:49] for special pages [19:03:56] and days to perform a dump [19:03:57] it's amazing how far we've gone on just PHP code [19:03:58] (03CR) 10Ottomata: [C: 031] Uninstall apport [puppet] - 10https://gerrit.wikimedia.org/r/253593 (owner: 10Muehlenhoff) [19:04:08] a proxy is inevitable at some point [19:04:10] (03PS5) 10Dzahn: mediawiki: specify uid 10002 for l10nupdate user [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) [19:04:14] well [19:04:24] if we had 5 dedicated programmers [19:04:43] I would say let's implement it on code [19:04:49] or 3 aarons [19:05:03] (03CR) 10Ottomata: [C: 032] Test --until graphite_check parameter on all varnishkafka drerr reports [puppet] - 10https://gerrit.wikimedia.org/r/255425 (owner: 10Ottomata) [19:05:07] proxy is better for other apps using our DBs (which is a pain enough right now) [19:05:15] I am not counting on that, [19:05:28] so it reuses a lot of code done elsewhere [19:05:37] and SoC in general [19:05:43] without needing to touch the current code [19:05:59] I could literally put => localhost, and it would work [19:06:13] (03PS2) 10Muehlenhoff: Uninstall apport [puppet] - 10https://gerrit.wikimedia.org/r/253593 [19:06:18] (03CR) 10Dzahn: "http://puppet-compiler.wmflabs.org/1364/mw1033.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/255421 (https://phabricator.wikimedia.org/T119165) (owner: 10Dzahn) [19:07:14] another option could be patch mediawiki to talk to etcd directly [19:07:41] RECOVERY - Disk space on restbase1009 is OK: DISK OK [19:08:20] I'm with AaronSchulz on the proxy [19:08:26] or, you know, setting a multi-master with failover, that would be easy [19:08:46] try the first ip, if not, try the second [19:09:10] oh, I agree with the proxy, and if it has connection pooling integrated, better [19:10:16] if only there was a place and moment in which talking about architecture, like a mediawiki developers conference! [19:12:59] !log installed django security updates on stat* and graphite hosts [19:13:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:13:13] !log restbase deploy end of 74662c [19:13:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:16:40] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [19:17:53] (03PS3) 10Dzahn: Support easy cloning of git repositories from Phabricator [puppet] - 10https://gerrit.wikimedia.org/r/244715 (owner: 10Chad) [19:19:08] Also, to make it clear, the drawing I made is not "this is is, and it has been already decided", not at all. I have not even started defining the problem yet there [19:19:11] 6operations, 7Availability: Set $wmfSwiftEqiadConfig in PrivateSettings - https://phabricator.wikimedia.org/T119651#1832183 (10aaron) 3NEW a:3aaron [19:19:34] 6operations, 7Availability: Set $wmfSwiftCodfwConfig in PrivateSettings - https://phabricator.wikimedia.org/T119651#1832183 (10aaron) [19:19:46] godog: I filed https://phabricator.wikimedia.org/T119651 [19:20:43] (03CR) 10Base: [C: 04-1] "it is weird that move-subpages right is requested while suppressredirect is not (the discussion is clear only about movefile). Clarificati" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255422 (https://phabricator.wikimedia.org/T119636) (owner: 10Luke081515) [19:21:39] (03CR) 10Dzahn: [C: 032] "ran in compiler on: puppetmaster, CI zuul server and transparency report role as examples using git::clone." [puppet] - 10https://gerrit.wikimedia.org/r/244715 (owner: 10Chad) [19:24:50] ostriches: ^ happy cloning from phab [19:25:41] (03PS1) 10Aklapper: List workboard column changes in weekly Phabricator changes email [puppet] - 10https://gerrit.wikimedia.org/r/255430 (https://phabricator.wikimedia.org/T119623) [19:26:55] mutante: thx! [19:27:16] yw [19:28:27] (03PS7) 10Ottomata: [WIP] Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) [19:28:58] (03PS8) 10Ottomata: Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) [19:30:05] (03PS9) 10Ottomata: Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) [19:30:40] RECOVERY - DPKG on labmon1001 is OK: All packages OK [19:35:54] 6operations, 7Database, 7Epic: Eliminate SPOF at the main database infrastructure - https://phabricator.wikimedia.org/T119626#1832245 (10jcrespo) [19:38:28] andrewbogott: I've been looking at the glibc source and the result is not a surprise after all; resolv.conf is normally read exactly once at (near) startup of an executable and not rechecked afterwards. [19:39:09] andrewbogott: On boxes where nscd runs we might be able to get away with the switch if we restart it and invalidate the hosts cache, though. [19:39:40] andrewbogott: The timeout:2 will help, though, for things started since that change. [19:43:01] (03CR) 10Dzahn: [C: 04-1] "query starts with "SELECT SELECT"" [puppet] - 10https://gerrit.wikimedia.org/r/255430 (https://phabricator.wikimedia.org/T119623) (owner: 10Aklapper) [19:46:25] Coren: ok, so where does that leave us? We can restart the proxies certainly, but... [19:46:37] I guess we’ll still be taunting any existing tools jobs, huh? [19:47:01] andrewbogott: There'd be a bazilllion other things to restart, at best. I think the only realistic scenario at this point is to look at lvs. [19:47:47] 6operations, 10ops-ulsfo: populate spares data for ulsfo - https://phabricator.wikimedia.org/T118207#1832265 (10RobH) 5Open>3Resolved Updated the spares data for ulsfo. [19:47:47] ok [19:48:09] also: dang [19:48:33] we are looking at a manual failover situation right? do we use ucarp (vrrp) anywhere to just float a vip? [19:48:37] (03PS2) 10Aklapper: List workboard column changes in weekly Phabricator changes email [puppet] - 10https://gerrit.wikimedia.org/r/255430 (https://phabricator.wikimedia.org/T119623) [19:48:54] (03PS1) 10Jcrespo: Add SQL for the creation of heartbeat tables [software] - 10https://gerrit.wikimedia.org/r/255434 (https://phabricator.wikimedia.org/T71463) [19:49:11] chasemp: Well, the case at hand /is/ a manual failover, but it'd be nice if we could actually benefit from the redundancy of having two dns servers. [19:49:45] can the secondary function as a complete dns server not as the active? [19:49:49] what function do we fail manually? [19:50:01] fail over that is [19:50:02] (03PS3) 10Dzahn: List workboard column changes in weekly Phabricator changes email [puppet] - 10https://gerrit.wikimedia.org/r/255430 (https://phabricator.wikimedia.org/T119623) (owner: 10Aklapper) [19:50:35] (03CR) 10Dzahn: [C: 032] "tested query, works and takes just 0.03 sec" [puppet] - 10https://gerrit.wikimedia.org/r/255430 (https://phabricator.wikimedia.org/T119623) (owner: 10Aklapper) [19:51:48] (03PS2) 10Jcrespo: Add SQL for the creation of heartbeat tables [software] - 10https://gerrit.wikimedia.org/r/255434 (https://phabricator.wikimedia.org/T71463) [19:52:04] chasemp: right now redundancy is handled mostly by having two nameservers in resolv.conf [19:52:07] that’s it [19:52:39] which, oh now, Coren, now we already have half our services using one IP and half using the other. So even if we set up load balancing I guess we have to do it for… both? [19:53:31] (03PS3) 10Jcrespo: Add SQL for the creation of heartbeat tables [software] - 10https://gerrit.wikimedia.org/r/255434 (https://phabricator.wikimedia.org/T71463) [19:53:45] andrewbogott: Even more fun - unless I'm mistaken switching to lvs involves pointing everything at a third, distinct, address. [19:54:13] yes that's true, as would vrrp [19:54:24] yeah, so… there’s basically no way we can do anything without breaking a bunch of services. [19:54:27] (03CR) 10Jcrespo: [C: 032] "I am sorry." [software] - 10https://gerrit.wikimedia.org/r/255434 (https://phabricator.wikimedia.org/T71463) (owner: 10Jcrespo) [19:54:40] (03CR) 10Jcrespo: [V: 032] "I am sorry." [software] - 10https://gerrit.wikimedia.org/r/255434 (https://phabricator.wikimedia.org/T71463) (owner: 10Jcrespo) [19:55:09] Except for creating a third load-balanced IP, changing the resolv.conf in labs, and then waiting a MONTH before I can rename holmium [19:55:11] which is ok I guess [19:55:18] Coren, that may interest you: https://wikitech.wikimedia.org/wiki/Help:Tool_Labs/Database#Identifying_lag [19:55:36] what's the month delay for? [19:55:39] but, I still don’t really understand… if we move everything to a different load-balanced IP, then doesn’t that host become a spof? So isn’t that sort of worse than what we have now? [19:56:00] andrewbogott: A month might be overkill. /most/ things would survive farily well (if slowed down a bit) by timing out on the primary for a while. We'd have to switch, then restart the "important" services. [19:56:07] chasemp: because we can’t change the resolver for running processes. So we just have to wait for them to restart out of attrition [19:56:19] we can't reload? [19:56:40] chasemp: we’re talking about Labs here, ‘all the services’ is an unknown number of volunteer-manager things [19:56:50] chasemp: resolv.conf is never reloaded. It's handled by the nss part of glibc. [19:57:17] jynus: Oooo. Nice! [19:59:51] jynus: I think just having a second resolution on the lag column is probably good enough -- after all, the heartbeats are only sent twice per second [20:00:11] but it's a really cool way of measuring this! [20:00:20] well, it is a boasting thing [20:00:31] :D [20:00:32] the "I can give you decimals now" [20:00:45] andrewbogott: I swear I did this switcharoo once before w/ resolvconf [20:00:53] it was done for production, so it is very accurate [20:01:10] valhallasw`cloud, can you check your access? [20:01:22] like doing select on that table on one of the servers [20:01:38] just in case I did something wrong before announcing it? [20:01:51] chasemp: daemons who reload by doing a self exec() would reread resolv.conf, if nothing else. [20:01:53] yep, just a sec [20:02:01] Coren: no I mean like https://github.com/masterkorp/openvpn-update-resolv-conf/blob/master/update-resolv-conf.sh [20:02:18] you can change midflight values iirc w/ resolvconf -d [20:02:31] jynus: https://quarry.wmflabs.org/query/6224 [20:02:36] only reports s1 [20:02:43] resolvconf not resolve.conf [20:03:01] yes, that is a know bug, can you test other than s1 ? :-) [20:03:30] _joe_: any objections to utilizing the third instance on rdb1007? [20:04:49] chasemp: aha. But that works iff we used the /run/resolvconf/resolv.conf interface and the /etc/resolvconf/resolv.conf.d/ setup [20:05:04] jynus: ya, works for nlwiki.labsdb [20:05:17] thanks! [20:05:29] Which, for some reason, we do not. [20:05:58] ‘for some reason’ is probably that I broke it [20:05:59] marc@tools-bastion-01:~$ resolvconf [20:05:59] resolvconf disabled. [20:06:10] well, bummer [20:06:12] because getting resolv.conf to work properly with labs dhcp is insanely difficult [20:06:26] or rather resolvconf [20:07:10] prod doesn't use it either, actually. [20:11:36] Coren, chasemp, so there are really two things here, right? [20:11:43] 1) Wait a good long while before moving anything [20:11:51] 2) Maybe load-balance somehow [20:12:10] you have two ip's out there in the wild as resolvers? [20:12:44] 2) requires 1) but 1) is already essentially underway since I just moved things a few hours ago [20:13:18] andrewbogott: you have two ip's out there in the wild as resolvers? [20:13:28] chasemp: sort of? [20:13:42] All labs hosts have two resolvers listed: [20:13:45] nameserver 208.80.155.118 [20:13:45] nameserver 208.80.154.20 [20:13:51] (in that order?) [20:14:05] Yes, but swapped the order earlier today. [20:14:22] So services started since the swap are hitting 118 first, services started before are hitting .20 [20:14:25] if you had one primary legacy resolver you could make that IP the real ip for the LVS service [20:14:30] and push out the new IP for the LVS service [20:14:37] but allow teh server to directly reply to queries anyway [20:15:10] that would mean that the recursor would have to respond to two different IPs, right? [20:15:15] (which is fine, just confirming) [20:15:29] well sort of, it would have the LB ip on loopback [20:15:45] but you could (I believe) leave the original resolver IP alone and use that as the real server IP in lvs [20:15:55] then boxes could either hit LVS or the direct box and both work [20:16:02] sure [20:16:05] but the idea is fewer and fewer boxes hit directly [20:16:17] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:16:24] so, my earlier question — LVS introduces a spof where currently we have a primary and a backup, right? [20:16:39] I don't understand the question [20:16:44] lvs boxes are all redundant [20:16:56] ok — I must not know how lvs works [20:17:43] I could help you with this but probably not today? [20:18:44] (03PS2) 10Luke081515: Enable filemover group at ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255422 (https://phabricator.wikimedia.org/T119636) [20:19:02] that’s fine, either way I’m clearly not moving anything today [20:19:16] since things will keep hitting holmium dns for ages [20:19:48] what would be left to hit it (just curious) if we did a full all VM's restart? [20:20:15] PROBLEM - check_disk on boron is CRITICAL: DISK CRITICAL - free space: / 255028 MB (58% inode=59%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 3978 MB (99% inode=99%): /run 796 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 3987 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 222 MB (84% inode=99%): /mnt/squish 0 MB (0% inode=0%) [20:21:37] (03CR) 10Base: [C: 031] "now it looks more sensible, thanks for the patch" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255422 (https://phabricator.wikimedia.org/T119636) (owner: 10Luke081515) [20:24:45] chasemp: I’m sure restarting VMs would reset things. [20:25:03] But… there are lots of ways to solve this problem if we don’t care about causing an outage. [20:25:15] RECOVERY - check_disk on boron is OK: DISK OK - free space: / 254991 MB (58% inode=59%): /sys/fs/cgroup 0 MB (100% inode=99%): /dev 3978 MB (99% inode=99%): /run 796 MB (99% inode=99%): /run/lock 5 MB (100% inode=99%): /run/shm 3987 MB (100% inode=99%): /run/user 100 MB (100% inode=99%): /boot 222 MB (84% inode=99%) [20:25:21] just quantifying teh problem for my own benefit :) [20:26:44] andrewbogott: with LVS-based DNS recursors, you'd have a single IP address, which was handled by a redundant pair of LVS boxes for failover, and which backended that one IPs traffic to both actual DNS resolvers. [20:27:12] (like production does in eqiad, codfw, and esams, but not ulsfo yet) [20:28:19] bblack: ok, that sounds ideal. I guess I was imagining that there was something that sat between the two boxes and monitored which was up; clearly incorrect. [20:29:37] well there is, that't the redundant set of LVS machines [20:30:08] there's also better ways to improve DNS resolution resiliency, which are outlined a bit in https://phabricator.wikimedia.org/T103921 [20:30:21] someday I might have the luxury of actually looking at doing something like that heh [20:30:29] wow wall of text [20:30:32] * andrewbogott reads [20:30:55] oh it had its own ticket, for the part that matters here: [20:30:56] https://phabricator.wikimedia.org/T104442 [20:31:07] (that was broken out of the end of the previous task link) [20:31:30] !log running `nodetool cleanup` on restbase1007 to make sure that we don't have extra sstables from the 1001 decommision taking up space [20:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:35] bblack: that wouldn’t be a system-wide solution, though, would it? It’s just a way to add dns redundancy for a specific service (pybal)? [20:33:36] no, the original ticket I linked was a pybal problem being triggered by a DNS resolver problem. but the solution is generic, in the second ticket. [20:33:56] (which is basically, build a shared module that plugs into glibc and changes how resolv.conf DNS lookups work) [20:34:43] Ah, I see, so basically patching the entire OS [20:36:19] no, just the resolver part of glibc :) [20:36:52] but yeah it would affect all client software that uses normal interfaces like gethostbyname() [20:38:21] 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1832376 (10ArielGlenn) Well, I was misled by the timestamps in the events read and reported by the debugging script and by the command line client. Here's what happens. Each minion auths and... [20:40:26] PROBLEM - puppet last run on bast2001 is CRITICAL: CRITICAL: puppet fail [20:48:13] 6operations, 10RESTBase, 10procurement: Get some Samsung 850 Pro 1T spares - https://phabricator.wikimedia.org/T119659#1832400 (10GWicke) p:5Triage>3High [20:48:35] 6operations, 10RESTBase, 10procurement: Get some Samsung 850 Pro 1T spares - https://phabricator.wikimedia.org/T119659#1832405 (10GWicke) a:3RobH [20:50:48] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Set up LVS for labs dns recursors - https://phabricator.wikimedia.org/T119660#1832410 (10Andrew) 3NEW a:3Andrew [20:51:12] 6operations, 6Labs, 10Labs-Infrastructure, 5Patch-For-Review: Set up LVS for labs dns recursors - https://phabricator.wikimedia.org/T119660#1832417 (10chasemp) p:5Triage>3Normal [20:54:56] (03PS1) 10Rush: openstack: more refactor and cleanup for multisite [puppet] - 10https://gerrit.wikimedia.org/r/255442 [20:59:38] gwicke: heh, I have one of those in my personal pc [21:00:24] oooh, so do I [21:00:45] nice, isn't it? [21:00:56] so vary vary nice :) [21:01:02] commodity SSDs are pretty nice, these days [21:01:04] *very, oh dear I should really go home soon... [21:02:15] Ahh, actually I have the Evo not the Pro! [21:02:39] that's the same flash afaik, just a slightly slower controller [21:03:16] we tried the Evo once, but either got a bad batch or have some hardware incompatibility -- they would all return errors after writing a couple TB [21:03:28] yeh, when I bought it I couldnt justify the extra £100! [21:03:55] oohhh, now that makes me worry ;) [21:04:10] I think it was a bad batch, tbh [21:04:36] PROBLEM - Outgoing network saturation on labstore1003 is CRITICAL: CRITICAL: 14.29% of data above the critical threshold [100000000.0] [21:04:38] firmware, most likely [21:04:41] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1832441 (10MaxSem) Disregard the above patch :P [21:04:46] it's still going to stay in the back of my mind ;) [21:05:33] sorry ;) [21:06:27] 6operations, 7Database, 5Patch-For-Review: Drop `user_daily_contribs` table from all production wikis - https://phabricator.wikimedia.org/T115711#1832444 (10MaxSem) @jcrespo, the above patches should take care of remaining vestiges. The reappeared table is safe to drop at any moment meanwhile. [21:08:46] RECOVERY - puppet last run on bast2001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [21:09:31] 6operations: mw1123 unresponsive - https://phabricator.wikimedia.org/T119339#1832448 (10Dzahn) 5stalled>3Resolved a:3Dzahn [mw1123:~] $ uptime 21:08:56 up 188 days, top - 21:09:06 up 188 days, 4:04, 1 user, load average: 5.84, 5.54, 5.57 it's responsive now [21:11:20] 6operations: Trigger some sort of alert if the memcache-serious log file is filling up at a greater than usual rate - https://phabricator.wikimedia.org/T95231#1832455 (10Dzahn) @Yuvipanda is "log file filling up at a greater than usual rate" something that can be solved with check_graphite, maybe it has been sol... [21:12:10] an italian wikimedian wrote on the topic recently https://pietrodn.wordpress.com/2015/10/25/how-to-temporarily-regain-write-performance-on-samsung-840-evo-ssd/ [21:12:40] (03PS1) 10Jhobs: Disable QuickSurveys reader segmentation survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) [21:13:28] (03CR) 10Jhobs: "Scheduled for Monday's deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255448 (https://phabricator.wikimedia.org/T116433) (owner: 10Jhobs) [21:13:55] (03PS2) 10Rush: openstack: more refactor and cleanup for multisite [puppet] - 10https://gerrit.wikimedia.org/r/255442 [21:20:16] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) [21:22:16] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [21:24:31] (03PS3) 10Rush: openstack: more refactor and cleanup for multisite [puppet] - 10https://gerrit.wikimedia.org/r/255442 [21:28:37] RECOVERY - Outgoing network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [21:28:42] (03PS1) 10Ottomata: 1.0.7-1 release with fix to allow larger config lines [software/varnish/varnishkafka] (debian) - 10https://gerrit.wikimedia.org/r/255451 [21:33:02] (03PS4) 10Rush: openstack: more refactor and cleanup for multisite [puppet] - 10https://gerrit.wikimedia.org/r/255442 [21:33:58] (03CR) 10Ottomata: "Hm, am now getting a segfault when using format.topic. This worked before...hm." [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) (owner: 10Ottomata) [21:34:05] (03CR) 10Ottomata: [C: 04-1] Add format.topic configuration [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/230173 (https://phabricator.wikimedia.org/T108379) (owner: 10Ottomata) [21:34:46] (03CR) 10Rush: [C: 032] openstack: more refactor and cleanup for multisite [puppet] - 10https://gerrit.wikimedia.org/r/255442 (owner: 10Rush) [21:35:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:36:46] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: puppet fail [21:37:40] ^...running it now [21:37:55] (03CR) 10Brian Wolff: "Is this the sort of thing I can ask to be swat deployed (once the deploy freeze goes away), or does this need some more complicated proces" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254700 (https://phabricator.wikimedia.org/T118887) (owner: 10Brian Wolff) [21:39:15] bblack [21:39:22] varnishkafka 1.0.7-1 avail with your change [21:39:28] i didn't include mine for this [21:39:41] i tested it and had some trouble so .. meh, will have to talk to magnus more [21:40:46] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:45:02] (03PS1) 10Legoktm: Change ExtensionDistributor default to REL1_26 branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255454 [21:45:12] (03PS4) 10MaxSem: WIP: OSM replication for maps [puppet] - 10https://gerrit.wikimedia.org/r/254490 (https://phabricator.wikimedia.org/T110262) [21:46:13] (03CR) 10Chad: [C: 032] Change ExtensionDistributor default to REL1_26 branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255454 (owner: 10Legoktm) [21:46:48] (03Merged) 10jenkins-bot: Change ExtensionDistributor default to REL1_26 branch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/255454 (owner: 10Legoktm) [21:47:18] ostriches: are you going to deploy and scap or should I? [21:47:31] I'm already on tin, can do [21:48:05] thanks :) https://gerrit.wikimedia.org/r/255455 is the cherry-pick [21:48:23] ^^ that's ok, in case anyone cares :) [21:48:28] (03PS1) 10MaxSem: Revoke my key while I'm travelling [puppet] - 10https://gerrit.wikimedia.org/r/255456 [21:48:50] hey, can someone please take care of ^^^ [21:49:40] (03PS1) 10Rush: openstack: remove openstack::database-server [puppet] - 10https://gerrit.wikimedia.org/r/255457 [21:49:55] (03PS2) 10Rush: openstack: remove openstack::database-server [puppet] - 10https://gerrit.wikimedia.org/r/255457 [21:50:57] !log demon@tin Started scap: new MW release, swapping extdist config + msgs [21:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:56:17] (03CR) 10Dzahn: [C: 032] Revoke my key while I'm travelling [puppet] - 10https://gerrit.wikimedia.org/r/255456 (owner: 10MaxSem) [22:06:17] 6operations: move torrus behind misc-web - https://phabricator.wikimedia.org/T119582#1832584 (10Dzahn) @mark any concerns if we move torrus behind misc-web? [22:07:57] *dang* sync-common is being slow. [22:09:51] (03PS1) 10Dzahn: torrus: move behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/255460 (https://phabricator.wikimedia.org/T119582) [22:10:11] i thought there was no scap until a while [22:10:54] mutante: greg-g just needs to approve [22:11:40] Or me :p [22:11:55] WHO'S WATCHING THE WATCHER? [22:12:47] ostriches: the new part should be that you do not see a lot of mira errors anymore [22:12:53] s/ostriches/ozymandias/ [22:12:56] RECOVERY - Apache HTTP on mw1128 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.611 second response time [22:13:03] ostriches: clearly Katie [22:13:34] nobody should want K4 mad at them [22:13:43] K4 loves me :) [22:14:09] special:watchwatcherslist [22:14:36] mutante: well we do have a special page for the least watched [22:14:45] RECOVERY - HHVM rendering on mw1128 is OK: HTTP OK: HTTP/1.1 200 OK - 65449 bytes in 1.259 second response time [22:14:57] !log demon@tin Finished scap: new MW release, swapping extdist config + msgs (duration: 23m 59s) [22:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:15:26] p858snake: :) "The 10 least interesting pages on wikipedia" [22:17:08] 7Puppet, 6operations, 6Labs, 5Patch-For-Review: Self hosted puppetmaster is broken - https://phabricator.wikimedia.org/T119541#1832634 (10MaxSem) [22:18:00] (03PS1) 10Dzahn: torrus: switch to misc-web [dns] - 10https://gerrit.wikimedia.org/r/255463 (https://phabricator.wikimedia.org/T119582) [22:19:25] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:21:46] PROBLEM - puppet last run on mw1115 is CRITICAL: CRITICAL: Puppet has 1 failures [22:23:32] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1832666 (10Dzahn) [cp4007:~] $ uptime 22:22:20 up 21 days, 5:08, 1 user, load average: 3.15, 3.65, 3.52 [cp4007:~] $ uname -a Linux cp4007 3.19.0-1-amd64 #1 SMP Debian 3.19.3-7 (2015-07-20) x86_64 GNU/Linux has been stable... [22:23:58] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1832667 (10Dzahn) 5Open>3Resolved a:3Dzahn [22:24:27] 6operations, 10Traffic: cp4007 crashed - https://phabricator.wikimedia.org/T117746#1782679 (10Dzahn) a:5Dzahn>3MoritzMuehlenhoff [22:40:17] (03PS1) 10Andrew Bogott: Openstack: Removed redundant admin_token settings in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/255466 [22:42:36] (03CR) 10Rush: [C: 031] "nice" [puppet] - 10https://gerrit.wikimedia.org/r/255466 (owner: 10Andrew Bogott) [22:43:56] RECOVERY - puppet last run on mw1115 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [22:46:12] (03PS2) 10Andrew Bogott: Openstack: Removed redundant admin_token settings in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/255466 [22:48:53] (03CR) 10Andrew Bogott: [C: 032] Openstack: Removed redundant admin_token settings in a couple of places. [puppet] - 10https://gerrit.wikimedia.org/r/255466 (owner: 10Andrew Bogott) [22:59:43] (03PS3) 10Rush: openstack: remove openstack::database-server [puppet] - 10https://gerrit.wikimedia.org/r/255457 [23:01:28] (03PS4) 10Rush: openstack: refactor glance role/module [puppet] - 10https://gerrit.wikimedia.org/r/255457 [23:11:35] (03PS1) 10Ori.livneh: Simplify multi-instance redis jobqueue server configuration [puppet] - 10https://gerrit.wikimedia.org/r/255469 [23:12:47] 6operations, 6Labs, 5Patch-For-Review: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1832767 (10BBlack) Some systems already had the jessie backports repo in sources.list, I think it varies depending on the install date of the system. Will probably have to... [23:13:04] ottomata: I don't see the new vk package, is it for jessie? [23:13:10] (on carbon I mean) [23:18:50] (03CR) 10Rush: [C: 032] openstack: refactor glance role/module [puppet] - 10https://gerrit.wikimedia.org/r/255457 (owner: 10Rush) [23:21:20] (03PS2) 10Ori.livneh: Simplify multi-instance redis jobqueue server configuration [puppet] - 10https://gerrit.wikimedia.org/r/255469 [23:21:52] !log krenair@tin Synchronized private/README_BEFORE_MODIFYING_ANYTHING: 334ca105e92aaf7046e244ff39189f3823d31a7d (duration: 00m 32s) [23:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:38] 6operations, 6Labs, 5Patch-For-Review: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1832776 (10hashar) I very welcome the decision to add jessie-backports. That has hit me when setting up Nodepool, the POC had jessie-backports but when we rebuild it we had... [23:23:00] (03PS3) 10Ori.livneh: Simplify multi-instance redis jobqueue server configuration [puppet] - 10https://gerrit.wikimedia.org/r/255469 [23:23:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Simplify multi-instance redis jobqueue server configuration [puppet] - 10https://gerrit.wikimedia.org/r/255469 (owner: 10Ori.livneh) [23:26:43] (03PS1) 10Ori.livneh: Migrate rdb1007 to jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/255470 [23:27:17] (03PS2) 10Ori.livneh: Migrate rdb1007 to jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/255470 [23:28:56] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: puppet fail [23:32:25] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: puppet fail [23:43:37] bd808: logstash is fairly snappy currently, did anything change to make it faster? [23:44:01] gwicke: yeah we stopped sending the job queue logs into it [23:44:21] it's quite noticeable ;) [23:45:10] bblack, ja for jessie [23:45:12] http://apt.wikimedia.org/wikimedia/pool/main/v/varnishkafka/ [23:45:15] http://apt.wikimedia.org/wikimedia/pool/main/v/varnishkafka/varnishkafka_1.0.7-1_amd64.deb [23:45:16] ja? [23:46:00] gwicke: it cut the on-disk size of the daily index down from 100+Gb to 25Gb [23:46:29] bd808: ouch; those logs *are* very verbose [23:47:11] We had it tuned down to one log event per job but yeah that was still a ton of log activity [23:47:21] 6operations, 6Release-Engineering-Team: Rewrite http://download.wikimedia.org/mediawiki/ -> https://releases.wikimedia.org/mediawiki in less than 3 redirects - https://phabricator.wikimedia.org/T119679#1832826 (10Reedy) 3NEW [23:47:57] We do still record error/warning log events from the jobs which is nice [23:48:24] bd808: for the others, rates are probably more useful anyway [23:48:33] to check that stuff is being processed [23:50:59] the next thing on my chatty chatty logging hit list is ocg [23:51:24] Debug: Doing something [23:51:26] Debug: Doing something else [23:51:30] Debug: Doing more things [23:51:35] Debug: Doing more things done [23:51:52] pretty much -- https://logstash.wikimedia.org/#dashboard/temp/AVFBD2AEptxhN1XaqE3j [23:52:33] bd808: those progress reports are especially ridiculous [23:56:16] it is about 30% of our log volume now which is a bit ridiculous I think [23:56:21] are there tasks for getting rid of them? [23:56:48] no, it's just something I've been meaning to look at in the source [23:57:08] please do! [23:57:26] I think everything it does right now is logged at info level so it's hard to filter [23:57:40] but I need to read the source to find out if that's true or not [23:58:01] (03CR) 10Ori.livneh: [C: 032] Migrate rdb1007 to jobqueue_redis [puppet] - 10https://gerrit.wikimedia.org/r/255470 (owner: 10Ori.livneh) [23:58:16] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures