[00:00:04] RoanKattouw ostriches Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T0000). Please do the needful. [00:00:04] ebernhardson: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [00:02:24] i suppose it's just me, i'll deploy [00:02:31] !log enable puppet and codify the 192 thread count for nfsd [00:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:02:43] (03CR) 10EBernhardson: [C: 032] Put more like query load back on eqiad for codfw load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266559 (owner: 10EBernhardson) [00:02:53] PROBLEM - RAID on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:03:04] PROBLEM - puppet last run on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:03:07] (03Merged) 10jenkins-bot: Put more like query load back on eqiad for codfw load testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266559 (owner: 10EBernhardson) [00:03:23] PROBLEM - nutcracker process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:03:32] PROBLEM - SSH on mw1161 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:03:53] PROBLEM - DPKG on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:04:42] PROBLEM - configured eth on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:43] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:52] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:06:13] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [00:06:52] PROBLEM - dhclient process on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:13] PROBLEM - salt-minion processes on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:07:52] mw1161 is down? [00:08:03] (I was scared for a second, but its only one host) [00:09:16] not sure how this will work out...sync-file is stuck at sync-proxies having only synced 11 of 12 proxies [00:09:22] RECOVERY - Disk space on mw1161 is OK: DISK OK [00:10:19] greg-g: but annoyingly, mw1161 is one of the 12 proxies used for scap [00:10:19] ebernhardson: shouldn't matter too much [00:10:20] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: point morelike queries back at the eqiad cluster (duration: 05m 41s) [00:10:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:10:25] they'll just use other hosts [00:10:28] oh good [00:10:31] greg-g, happens to some hosts sometimes [00:10:33] not hardcoded [00:10:41] so yeah, scary [00:10:59] ahh, i thought this was supposed to be 'row aware', and was worried it would stick with the proxy in the same DC row [00:11:03] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.001 second response time on port 11212 [00:11:27] The proxy selection mechanism is the 2nd coolest thing in the python version of scap. The coolest thing is the ssh dispatch loop. ori came up with both of them. :) [00:11:28] nah [00:11:38] ebernhardson: it uses the "best" server [00:11:48] it should be the same rack/row [00:11:51] But often not [00:11:57] lowest tcp hop count [00:12:02] it chooses the best proxies, and proxies are typically selected to be one per row, right? [00:12:27] Yeah [00:13:04] https://github.com/wikimedia/operations-puppet/blob/2015be754e75ee2ceeb6a8aa6449f5a706bb7df0/hieradata/common/scap/dsh.yaml#L3-L16 [00:13:28] well, theres one per rack with mw app servers in [00:13:38] 10Ops-Access-Requests, 6operations, 6Parsing-Team: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968165 (10ssastry) @RobH Thanks. Looks good. @tstarling, @arlolra, @cscott are the others besides me that will need... [00:14:43] PROBLEM - Disk space on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:15:54] !log ebernhardson@mira Synchronized php-1.27.0-wmf.11/extensions/CirrusSearch/: Allow pointing morelike queries at a specific datacenter (duration: 03m 04s) [00:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:10] (03CR) 10Bmansurov: "Yeah, that's what the task says now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [00:16:23] PROBLEM - nutcracker port on mw1161 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:16:51] 6operations, 6Discovery: Elasticsearch health and capacity planning FY2016-17 - https://phabricator.wikimedia.org/T124626#1968174 (10Tfinc) In my previous life managing search clusters we split our corpus by geographic location, then function (full text, prefix, etc) and then partitioned the data to fit into t... [00:23:31] (03PS1) 10RobH: creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) [00:25:15] 6operations, 10vm-requests, 5Patch-For-Review: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968197 (10Dzahn) [00:25:17] 6operations, 5Patch-For-Review: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968198 (10Dzahn) [00:27:14] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968200 (10RobH) [00:27:17] 6operations, 5Patch-For-Review: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968202 (10Dzahn) instead of reinstalling caesium, we decided to move the only service that was on it, releases.wikimedia.org, over to an exising virtual machine, bromine.eqiad.wmnet wh... [00:28:06] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968207 (10RobH) a:5ssastry>3RobH Please note my patchset does NOT include the actual users y... [00:28:15] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968209 (10RobH) 5Open>3stalled [00:28:20] 6operations, 5Patch-For-Review: Reinstall caesium with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968211 (10Dzahn) 5Open>3Invalid [00:28:52] 6operations, 5Patch-For-Review: Reinstall caesium (releases.wm.org) with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968212 (10greg) [00:29:10] gah, I was confused [00:29:15] ignore that :) [00:29:25] 6operations, 5Patch-For-Review: Reinstall caesium (releases.wm.org) with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1968214 (10Dzahn) [00:30:20] 6operations: Reinstall caesium (releases.wm.org) with jessie (and convert to VM) - https://phabricator.wikimedia.org/T123714#1936383 (10Dzahn) [00:30:53] 6operations, 10vm-requests: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968217 (10Dzahn) [00:33:49] <_joe_> AaronSchulz: I prepared a few patches to mediawiki-config, most still definitely need refining, but I think it's a shot in the right direction to make switching datacenters easier [00:34:22] RECOVERY - nutcracker port on mw1161 is OK: TCP OK - 0.000 second response time on port 11212 [00:34:23] RECOVERY - Disk space on mw1161 is OK: DISK OK [00:34:30] 6operations, 10Beta-Cluster-Infrastructure, 6Labs, 10Labs-Infrastructure: beta: Get SSL certificates for *.{projects}.beta.wmflabs.org - https://phabricator.wikimedia.org/T50501#1968235 (10bd808) [00:35:03] RECOVERY - configured eth on mw1161 is OK: OK - interfaces up [00:35:12] RECOVERY - RAID on mw1161 is OK: OK: no RAID installed [00:35:23] RECOVERY - puppet last run on mw1161 is OK: OK: Puppet is currently enabled, last run 54 minutes ago with 0 failures [00:35:23] RECOVERY - dhclient process on mw1161 is OK: PROCS OK: 0 processes with command name dhclient [00:35:33] RECOVERY - nutcracker process on mw1161 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [00:35:42] RECOVERY - SSH on mw1161 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.4 (protocol 2.0) [00:35:43] RECOVERY - salt-minion processes on mw1161 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [00:35:59] 6operations, 10vm-requests: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1968240 (10Dzahn) Hi, if you have been subscribed to this ticket it's because you are a member in one of the "releasers-" admin groups and have shell access. This is fyi... [00:36:02] RECOVERY - DPKG on mw1161 is OK: All packages OK [00:38:03] (03CR) 10GWicke: "I think we want to separate ganglia and especially alerts / nagios anyway, and doing so is a lot simpler when using cluster." [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [00:39:06] mw1161 is back? who touched it? [00:40:28] greg-g, mind if I send a couple of site-requests changes through before the end of this window? [00:41:09] Krenair: i don't think so? [00:41:10] :) [00:41:21] (03PS2) 10Alex Monk: Disable NewUserMessage on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266161 (https://phabricator.wikimedia.org/T122441) [00:41:41] (03CR) 10Alex Monk: [C: 032] Disable NewUserMessage on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266161 (https://phabricator.wikimedia.org/T122441) (owner: 10Alex Monk) [00:42:04] (03Merged) 10jenkins-bot: Disable NewUserMessage on metawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266161 (https://phabricator.wikimedia.org/T122441) (owner: 10Alex Monk) [00:43:39] (03PS2) 10Alex Monk: Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) [00:43:44] (03CR) 10Alex Monk: [C: 032] Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) (owner: 10Alex Monk) [00:44:21] (03Merged) 10jenkins-bot: Change ukwikinews logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266497 (https://phabricator.wikimedia.org/T124778) (owner: 10Alex Monk) [00:44:49] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266161/ (duration: 02m 27s) [00:44:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:39] why is sync-masters taking so long? [00:48:02] !log krenair@mira Synchronized w/static/images/project-logos/ukwikinews.png: https://gerrit.wikimedia.org/r/#/c/266497/ (duration: 02m 29s) [00:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:30] Hello. After a 'Portal' namespace added to Wuu Wikipedia in T124389, pages originally beginning with 'Portal:' are inaccessible now (like Portal:地理 and its talk page Talk:Portal:地理). I hope someone can fix it. [00:51:08] oh, yeah [00:51:11] there's a script for that [00:52:01] !log krenair@mira Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/266497/ (duration: 02m 26s) [00:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:52:59] Lantern, try now [00:53:43] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Traffic, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1968281 (10BBlack) I'... [00:55:17] Talk:Portal:地理 is still inaccessible, https://wuu.wikipedia.org/wiki/Talk:Portal:%E5%9C%B0%E7%90%86 [00:56:06] That hasn't moved namespaces, it's still in Talk: [00:56:16] huh [01:00:04] ok, thx [01:01:02] bd808, any idea what's up with sync-masters? [01:01:19] Krenair: nope. what are you seeing? [01:01:34] bd808, 00:51:41 Finished sync-masters (duration: 02m 07s) [01:02:10] Lantern, can you open a task about this? [01:02:29] It also affects Talk:Portal:江南古镇 [01:03:06] ok [01:03:24] i will open a task [01:03:42] Krenair: hmm... so rsync between mira and tin is super slow. ping times don't look bad and load averages are low [01:04:10] ssh between them seems fine [01:07:20] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1968335 (10RobH) a:5RobH>3ssastry @ssastry: Can you have your manager approve the request to... [01:08:55] Krenair: my next guess in debugging would be to get a root involved and have them run the rsync command from /usr/local/bin/scap-master-sync with --verbose added to see it sheds any light [01:19:35] Krenair, are you poking at namespaces? cuz I'm about to press a button... :P [01:19:51] not at the moment MaxSem [01:19:55] ok [01:19:57] what button are you about to press? [01:21:08] namespaceDupes [01:21:25] !log running mwscript namespaceDupes.php --wiki=wuuwiki --move-talk --fix [01:21:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:21:43] (I did a dry run first) [01:22:09] "Database is read-only: Brief Database Maintenance in progress, please try again in 3 minutes" [01:22:16] uuuuuugh? :P [01:22:39] did you run it from mira or something? [01:22:45] yup [01:22:51] those codfw servers won't let you write to the DB [01:22:57] kekeke [01:22:59] should be using terbium [01:23:05] ? [01:23:26] this was discovered earlier, I did ask for someone to change the reason given to be useful, but... :/ [01:26:50] !log Fail, trying something else... [01:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:27:11] 7Puppet, 10MediaWiki-extensions-ORES, 6Revision-Scoring-As-A-Service: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1968371 (10Ladsgroup) I don't think this is related to the extension or maybe I'm wrong [01:29:31] !log on terbium: ran mwscript namespaceDupes.php --wiki=wuuwiki --source-pseudo-namespace='' --add-suffix=/renamed --fix [01:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:31:40] 7Puppet, 6Revision-Scoring-As-A-Service, 10ores: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1968374 (10Ladsgroup) [01:35:12] 7Blocked-on-Operations, 6operations, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1968377 (10GWicke) [01:36:13] gwicke: heh. thanks for reviving that. [01:36:53] it's an ongoing issue for us [01:41:19] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1968383 (10RobH) a:3RobH I'll hunt someone down to review this tomorrow, it has sat long enough. [01:50:00] 7Blocked-on-Operations, 6operations, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1968401 (10GWicke) FYI, here are some upcoming changes in Services that will use more disk space for metrics: - We are about to split RESTBase metrics by request type (internal, in... [01:50:42] (03PS1) 10Ori.livneh: Add a speed experiment which inlines the top stylesheet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266647 [01:51:07] (03CR) 10Ori.livneh: [C: 032] Add a speed experiment which inlines the top stylesheet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266647 (owner: 10Ori.livneh) [01:51:37] (03Merged) 10jenkins-bot: Add a speed experiment which inlines the top stylesheet [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266647 (owner: 10Ori.livneh) [01:53:41] (03PS1) 10Ori.livneh: dotfiles: symlink .hosts/mira to .hosts/tin [puppet] - 10https://gerrit.wikimedia.org/r/266648 [01:54:10] (03CR) 10Ori.livneh: [C: 032 V: 032] dotfiles: symlink .hosts/mira to .hosts/tin [puppet] - 10https://gerrit.wikimedia.org/r/266648 (owner: 10Ori.livneh) [01:59:52] !log ori@mira Synchronized docroot and w: Icc4f6134b0: Add a speed experiment which inlines the top stylesheet (duration: 02m 28s) [01:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:06:53] YuviPanda: https://gerrit.wikimedia.org/r/#/c/266332/ [02:07:47] (03CR) 10Yuvipanda: [C: 031] Add a service alias for mw1017 (app server debug backend) [dns] - 10https://gerrit.wikimedia.org/r/266332 (owner: 10Ori.livneh) [02:07:55] (03CR) 10Ori.livneh: [C: 032] Add a service alias for mw1017 (app server debug backend) [dns] - 10https://gerrit.wikimedia.org/r/266332 (owner: 10Ori.livneh) [02:09:33] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [02:13:04] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [02:23:58] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.10) (duration: 09m 51s) [02:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:48:08] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.11) (duration: 10m 25s) [02:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:55:21] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Jan 27 02:55:21 UTC 2016 (duration 7m 13s) [02:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:42] (03PS1) 10EBernhardson: Correct invalid shard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266655 [03:15:15] (03CR) 10EBernhardson: [C: 032] Correct invalid shard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266655 (owner: 10EBernhardson) [03:15:56] (03Merged) 10jenkins-bot: Correct invalid shard configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266655 (owner: 10EBernhardson) [03:19:29] !log ebernhardson@mira Synchronized wmf-config/CirrusSearch-production.php: Correct invalid cirrus shard configuration (duration: 02m 59s) [03:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:50:22] PROBLEM - puppet last run on ms-fe2003 is CRITICAL: CRITICAL: puppet fail [03:52:14] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [03:59:22] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [04:10:03] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [24.0] [04:13:52] (03PS1) 10EBernhardson: Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 [04:14:16] (03CR) 10jenkins-bot: [V: 04-1] Reduce replica count for commonswiki_file in codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson) [04:17:57] (03CR) 10EBernhardson: "If disk space is a concern, i could try and focus this in on more specific metrics to hold onto. Within cirrussearch we want to keep serve" [puppet] - 10https://gerrit.wikimedia.org/r/266567 (owner: 10EBernhardson) [04:18:32] RECOVERY - puppet last run on ms-fe2003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:21:18] (03PS2) 10Dereckson: Raise file upload limit to 2.5 GB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [04:22:27] (03PS3) 10Dereckson: Raise file upload limit to 2500 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [04:41:52] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [05:20:21] (03PS1) 10EBernhardson: Allow access to graphite/events/get_data [puppet] - 10https://gerrit.wikimedia.org/r/266663 [05:24:49] (03PS2) 10EBernhardson: Allow access to graphite/events/get_data [puppet] - 10https://gerrit.wikimedia.org/r/266663 [05:29:59] 7Blocked-on-Operations, 6operations, 10Wikimedia-General-or-Unknown: Invalidate all users sessions - https://phabricator.wikimedia.org/T124440#1968538 (10Tgr) >>! In T124440#1966254, @Legoktm wrote: > It's still running :/ Opened T124861 about that. [05:58:14] RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected [06:14:12] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [06:29:53] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:30:42] (03PS1) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [06:30:42] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:52] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:53] PROBLEM - puppet last run on wtp2008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:44] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:22] (03CR) 10Florianschmidtwelzow: [C: 031] Raise file upload limit to 2500 MB [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266544 (https://phabricator.wikimedia.org/T116514) (owner: 10TheDJ) [06:42:33] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [06:47:29] (03PS2) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [06:53:42] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [06:55:43] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:42] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [24.0] [06:57:04] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:13] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:57:13] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:22] RECOVERY - puppet last run on wtp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:23] PROBLEM - puppet last run on mw1134 is CRITICAL: CRITICAL: Puppet has 1 failures [06:57:32] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:13] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:23] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:16] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS node on codfw - https://phabricator.wikimedia.org/T124862#1968603 (10Smalyshev) 3NEW [07:08:31] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Adjust balance of WDQS nodes to allow continued operation if eqiad went offline. - https://phabricator.wikimedia.org/T124627#1968613 (10Smalyshev) [07:08:34] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS node on codfw - https://phabricator.wikimedia.org/T124862#1968614 (10Smalyshev) [07:10:06] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Deploy WDQS node on codfw - https://phabricator.wikimedia.org/T124862#1968603 (10Smalyshev) [07:12:53] <_joe_> SMalyshev: I doubt this can happen this quarter [07:13:24] _joe_: what is the blocker - hw, time, something else? [07:13:31] <_joe_> time, mainly [07:13:43] _joe_: we've got new ops guy, maybe he could help? [07:13:53] <_joe_> hw, it must come from your budget :) [07:14:01] <_joe_> SMalyshev: when did guillame joins? [07:14:12] _joe_: next week I understand [07:14:22] <_joe_> ok so I got that right :P [07:14:33] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [07:14:47] <_joe_> SMalyshev: to be honest, I'd like him to help with the switchover of ES to codfw when the time arrives [07:15:02] <_joe_> given it's a shared goal this quarter [07:15:08] _joe_: yeah so maybe he could help. It shouldn't be a lot of work. But it's not super-urgent - it's just part of making us less critically dependent on one cluster [07:15:23] <_joe_> SMalyshev: I agree fully it needs to be done :) [07:16:22] _joe_: well, while he gets more familiar with eqiad/codfw stuff, that may come as one of the tasks too :) anyway, I just created the task so we know it should be done. We'll see how it works budget/time wise, if we have to wait for a couple of months, no problem, current servers work just fine for now [07:16:47] * _joe_ nod [07:20:13] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:02] RECOVERY - puppet last run on mw1134 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:32:32] (03CR) 10Amire80: [C: 04-1] Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [07:33:05] wdqs in codfw will probably be begining of next FY or so. Or at least it came up during planning for next FY, sounds like i should include a machine in the budget (and note we are giving ops back a machine in eqiad) ? [07:48:01] ebernhardson: that's a good idea [07:50:51] (03PS3) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [07:51:58] (03CR) 10KartikMistry: Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [08:01:35] <_joe_> ebernhardson: include at least two [08:01:46] <_joe_> we don't want to have to switch datacenters if one machine fails [08:15:47] (03CR) 10Amire80: [C: 04-1] Beta: Add cxserver registry to Beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [08:29:05] (03PS4) 10KartikMistry: Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 [08:30:35] (03CR) 10Amire80: [C: 031] Beta: Add cxserver registry to Beta [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [08:31:56] Can anyone from Ops merge https://gerrit.wikimedia.org/r/#/c/266668/ ? It will 'unbreak' beta Content Translation. [08:32:02] akosiaris: godog ^^ [08:35:14] kart_: it can be cherry-picked on the beta puppetmaster [08:35:52] that is probably a good idea anyway, since i'm not sure adding 1,145 lines of hiera data is the right way to do this [08:38:17] ori: better as of now :) [08:38:28] great [08:42:29] <_joe_> thanks ori [08:42:44] <_joe_> I wasn't paying attention to this channel early enough :/ [08:43:55] (03CR) 10KartikMistry: "Cherry-picked to Beta, but https://cxserver-beta.wmflabs.org/v1#!/Languages/get_v1_languagepairs is still empty, so I will look into this " [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [08:44:07] (03CR) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) (owner: 10Giuseppe Lavagetto) [08:44:55] (03PS3) 10Giuseppe Lavagetto: Use the logical redis definition for GettingStarted. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266481 (https://phabricator.wikimedia.org/T124671) [08:45:16] good morning [08:45:28] (03PS2) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) [08:45:43] _joe_: ori: could you confirm production redis servers have been transitioned to Jessie ? [08:46:07] the redis servers on beta cluster are on Trusty and the redis-server package there doesn't support one of the new option we are using [08:46:18] (oh and good morning / night) [08:46:37] which option? [08:46:38] I can check [08:46:49] <_joe_> hashar: in eqiad most redises are precises [08:47:16] <_joe_> so it's still redis 2.6 [08:47:35] <_joe_> hashar: they'll be moved to jessie during this quarter, I guess [08:47:39] no, I updated those to 2.8 with a backported package [08:48:07] <_joe_> ori: which ones? [08:48:12] ori: 'latency-monitor-threshold 100' [08:48:16] <_joe_> rdb1001 has redis-server 2.6 [08:48:25] our main bug is https://phabricator.wikimedia.org/T124677 (job queue broken) [08:48:32] <_joe_> ii redis-server 2:2.6.13-1+wmf1 Persistent key-value database with network interface [08:49:13] not sure; we were seeing latency spikes and i wanted to use the latency monitor [08:49:15] <_joe_> hashar: which version of redis do you have in beta? [08:49:22] there is a couple other tasks that got merged in, but all related to deployment-redis01 being dead (because it can't start) [08:49:36] what ever is shipped by Trusty so 2.8.4-2+wmf1 [08:49:41] i'll add a conditional [08:49:45] whereas Jessie ships 3.0.6-2~bpo8+1 [08:50:02] hrm [08:50:04] it's there [08:50:11] it's in a if os_version('debian >= jessie') { } block [08:50:13] I am surprised it hasn't impacted production yet , but I guess the redis server services are rather stables [08:51:07] <_joe_> ori: uhm maybe os_version doesn't behave as it's supposed to be? [08:51:08] oh [08:51:24] hashar: https://github.com/wikimedia/operations-puppet/blob/production/modules/redis/manifests/init.pp#L36-L42 [08:51:27] or the redis configuration file got generated before the os_version harness has been enabled [08:51:35] could be [08:51:53] simply deleting the line should resolve it, then [08:51:55] <_joe_> hashar: and you never ran puppet again? [08:52:04] surely os_version being broken would have been noticed and iirc it is covered by tests (though they could be wrong) [08:52:06] <_joe_> oh I see there is no ensure => absent [08:52:09] <_joe_> damn puppet [08:52:23] I know puppet fails to apply some refresh from time to time [08:52:37] i don't mean to sneak off, but i'm really tired [08:52:47] sounds like this can be solved by editing out the line [08:52:51] <_joe_> go to bed, I think I can figure this out :) [08:52:53] <_joe_> and yes [08:52:55] ori: go go to bed :-} [08:53:07] <_joe_> hashar: I'm on it [08:53:09] ori: thank you for the confirmation we still have Precise redis on prod. [08:53:36] redis.conf:#latency-monitor-threshold 100 [08:53:42] looks like it has been monkey patched [08:54:05] quoting Mukunda "I commented the line from the config file and started redis, I'm going to leave it to ori to decide what to do about a permanent solution." [08:54:48] <_joe_> yeah but he did that wrong [08:54:54] <_joe_> let me fix this [08:56:08] <_joe_> ok it seems allright now [08:56:48] <_joe_> hashar: so the problem is that file_line wasn't absented on non-jessie hosts after it was already applied [08:56:50] ? [08:57:01] hm [08:57:12] ohhh [08:57:29] <_joe_> and things like file_line, cron, etc do remain on the system, they're simply unmanaged [08:57:34] _joe_: that is because we monkey patch the configuration file that is provided by the .deb package isn't it ? [08:57:35] <_joe_> a fact we often forget [08:57:39] <_joe_> yes [08:57:57] <_joe_> we patch, monkey-patching is something else :) [08:58:11] so should we manually edit them or is there a change to apply in puppet ? [08:58:24] twentyafterfour: hello ! basically the latency-monitor-threshold invalid value is a left over [08:58:27] <_joe_> manually edit it [08:58:40] <_joe_> I reapplied puppet and it didn't come back [08:58:50] twentyafterfour: it is only supposed to be applied on Jessie. and production does use old redis-server not supporting that setting [08:58:58] _joe_: doing the mass edits :-} [08:59:03] thank you very much [08:59:21] _joe_: would you mind writing a quick summary on https://phabricator.wikimedia.org/T124677 ? [08:59:21] <_joe_> I did nothing :) [08:59:24] for the record [08:59:30] <_joe_> yup [08:59:39] well at least explain how puppet (mis?)behave [09:00:33] so why was commenting the line the wrong thing to do? It seemed like a valid temporary fix. [09:00:59] <_joe_> twentyafterfour: no it seemed to me that you didn't restart all the redis instances, while you did [09:01:07] <_joe_> twentyafterfour: you did the right thing [09:01:53] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [09:06:33] :) [09:08:25] Notice: /Stage[main]/Sysfs/Service[sysfsutils]/ensure: ensure changed 'stopped' to 'running' [09:08:25] Info: /Stage[main]/Sysfs/Service[sysfsutils]: Unscheduling refresh on Service[sysfsutils] [09:08:26] bah [09:08:33] it is not even a daemon :} [09:19:12] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [09:20:53] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [09:21:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:23:23] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:24:56] Debug: Executing '/etc/init.d/sysfsutils status' [09:24:56] Debug: Executing '/etc/init.d/sysfsutils start' [09:24:56] Notice: /Stage[main]/Sysfs/Service[sysfsutils]/ensure: ensure changed 'stopped' to 'running' [09:24:56] ah [09:25:02] and there is no status .. [09:26:28] <_joe_> hashar: so hasstatus => no should fix that, maybe [09:26:50] i guess [09:27:02] gotta look at what happens on other distributions [09:27:20] <_joe_> I have no time to look into it, sorry [09:27:33] i will [09:27:39] just sharing my thoughts out loud [09:27:45] since I feel lonely in my coworking place [09:32:02] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 58.62% of data above the critical threshold [5000000.0] [09:37:52] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [09:41:05] 5xx reqs/min getting better, it looks like a spike [09:44:53] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:46:14] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [09:46:43] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:47:40] (03PS1) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 [09:54:04] (03PS2) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 [10:02:03] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 76 failures [10:03:03] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [24.0] [10:10:50] 7Puppet, 6operations, 10Salt: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1968960 (10ArielGlenn) We could write a runner for the salt master that accepts a key after checking the puppet accepted cert, and we could configure the... [10:11:35] (03PS3) 10Hashar: sysfs: puppet always restarted the sysfsutils service [puppet] - 10https://gerrit.wikimedia.org/r/266684 [10:12:20] PROBLEM - NTP on mc2009 is CRITICAL: NTP CRITICAL: No response from NTP server [10:12:20] PROBLEM - NTP on mc2012 is CRITICAL: NTP CRITICAL: No response from NTP server [10:13:39] (03CR) 10Hashar: "Else puppet keeps attempting to restart sysfsutils :(" [puppet] - 10https://gerrit.wikimedia.org/r/266684 (owner: 10Hashar) [10:15:18] PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:15:39] <_joe_> oh gee, toollabs [10:16:56] RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.075 second response time [10:18:05] looking into ntpd on mc2009/2012 [10:20:15] (03PS3) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) [10:20:53] puppet failures on mw1119 are due to lack of memory [10:21:03] (03CR) 10jenkins-bot: [V: 04-1] Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:22:51] (03CR) 10Alex Monk: "New config file, will need to be added to noc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:23:12] (03PS4) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) [10:23:20] <_joe_> Krenair: ah, right [10:23:50] <_joe_> thanks [10:25:51] !log restarting apache2 and hhvm on mw1119 [10:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:13] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [10:29:02] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [10:30:55] (03PS5) 10Giuseppe Lavagetto: Rationalize definition of service hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266509 (https://phabricator.wikimedia.org/T114273) [10:34:00] 7Puppet, 6operations, 10Salt: Make it possible for wmf-reimage to work seamlessly with a non-local salt master - https://phabricator.wikimedia.org/T124761#1969025 (10Joe) @ArielGlenn it seems like a good idea. [10:34:31] (03PS1) 10Muehlenhoff: Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 [10:35:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 765 [10:39:12] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [24.0] [10:40:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 673316 Threads: 2 Questions: 4975629 Slow queries: 4496 Opens: 1802 Flush tables: 2 Open tables: 417 Queries per second avg: 7.389 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [10:42:11] (03PS5) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [10:43:17] (03CR) 10jenkins-bot: [V: 04-1] Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) (owner: 10Giuseppe Lavagetto) [10:47:31] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimediafoundation.org - https://phabricator.wikimedia.org/T124804#1969049 (10Aklapper) >>! In T124804#1968040, @TheD... [10:48:33] (03PS2) 10KartikMistry: cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) [10:48:56] (03CR) 10Ema: [C: 031] Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 (owner: 10Muehlenhoff) [10:51:34] RECOVERY - NTP on mc2009 is OK: NTP OK: Offset -0.0001429319382 secs [10:53:20] godog: or akosiaris: around? [10:53:23] RECOVERY - NTP on mc2012 is OK: NTP OK: Offset 0.0004059076309 secs [10:53:27] (03PS6) 10Giuseppe Lavagetto: Define Production service entries for InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266510 (https://phabricator.wikimedia.org/T114273) [10:53:43] kart_: yup [10:53:57] kart_: how can I help ? [10:54:22] akosiaris: deploy the https://gerrit.wikimedia.org/r/#/c/265691/ in around 2-3 hours time? :) [10:54:35] akosiaris: let me know. I need to let other people before it. [10:54:36] sounds like a very good candidate for puppet swat [10:54:40] That's all :) [10:54:46] lemme check and +1 it if it's ok [10:54:48] akosiaris: sadly no puppet SWAT today? [10:55:00] a wednesday [10:55:02] indeed [10:55:05] ok [10:55:10] I will anyway be around indeed [10:55:11] <_joe_> why are we using hiera for such ginormous config? [10:55:31] <_joe_> (this is probably the 20th time I ask) [10:55:36] _joe_: it's the old in cxserver config vs in puppet config issue. cxserver had a regression [10:56:02] it used to be moved into the cxserver repo instead of puppet but with the migration to service-runner there was a regression LE is still investigating [10:58:35] _joe_: I'm working on it. [10:58:45] <_joe_> ok ok :) [10:59:28] akosiaris: thanks. I will ping for 'go ahead'. [11:03:21] 6operations, 10DBA, 10MediaWiki-Configuration, 6Release-Engineering-Team, and 2 others: codfw is in read only according to mediawiki - https://phabricator.wikimedia.org/T124795#1969195 (10Joe) jcrespo: how do we make it 100% read-only? is there an easy way to do that? I agree that we should stop using cro... [11:05:43] jynus: so, on Feb 3rd, I will need to create a backup copy of the OTRS database in the fastest possible way. What's your recommendation ? [11:05:47] mydumper ? [11:07:52] guc outage, is it worth notifying? [11:08:07] <_joe_> guc? [11:08:53] global user contributions [11:09:02] labs' tool [11:09:15] <_joe_> oh, sorry, I wasn't thinking about labs :P [11:09:30] usually no, if it is a tool no it is not [11:09:35] <_joe_> Vito: I'll get in #wikimedia-labs, if help is needed [11:09:43] but maybe we can help [11:09:53] <_joe_> yeah, my point too [11:11:41] <_joe_> Vito: I just tried to use it and it seems to work [11:12:32] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [11:13:52] _joe_: seems some istance is gone, so it should happen randomly [11:14:04] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [11:14:58] (03CR) 10Alexandros Kosiaris: [C: 031] "https://puppet-compiler.wmflabs.org/1659/ says OK, this is ready for merge" [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry) [11:16:45] 6operations, 10vm-requests: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1969251 (10akosiaris) >>! In T124261#1967740, @Dzahn wrote: > @akosiaris It does mean that all shell users who are in "releasers-mediawiki" or "releasers-mobile" now get... [11:16:59] 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1969252 (10akosiaris) [11:21:54] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [11:24:03] PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 15 data above and 45 below the confidence bounds [11:31:43] PROBLEM - puppet last run on elastic1009 is CRITICAL: CRITICAL: Puppet has 1 failures [11:36:07] 6operations, 5Patch-For-Review, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Check the memcached and redis (sessions) configuration and functionality in codfw - https://phabricator.wikimedia.org/T124879#1969286 (10Joe) 3NEW [11:36:48] akosiaris stopping replication, probably [11:37:15] revert by failovering to the slave [11:40:06] let me see what else is there on the shard to make it possible [11:43:57] I 'll probably revert within 8 tops if all goes south [11:44:04] 8 hours that is [11:52:35] RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected [11:54:47] 6operations, 10Traffic: update the multicast purging documentation - https://phabricator.wikimedia.org/T82096#1969329 (10hashar) Great, thank you @BBlack [11:56:27] other than that, maybe creating a snapshot [11:58:14] RECOVERY - puppet last run on elastic1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:59:23] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [12:00:54] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Traffic, 7Documentation: Automate and/or better-document varnish ban procedure for operations staff, so it can be accomplished with more speed and confidence in outage conditions - https://phabricator.wikimedia.org/T124835#1969361 (10BBlack) 5... [12:06:42] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [12:09:56] in general, creating a backups is not a problem, recovering it when it is not the only thing on that server is :-/ [12:13:43] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [12:16:44] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [12:18:32] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [12:21:02] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [12:22:34] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.044 second response time [12:23:04] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50430 bytes in 0.093 second response time [12:28:23] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.049 second response time [12:29:22] !log rebooting analytics1028 for kernel update [12:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:29:54] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 67653 bytes in 0.126 second response time [12:33:58] jynus :O [12:34:11] Hey [12:39:22] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [24.0] [12:42:06] Bsadowski1, ? [12:42:53] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [12:52:03] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [12:57:00] 6operations, 10DBA: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1969500 (10mark) p:5Normal>3High [13:04:00] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry) [13:04:05] (03PS3) 10Alexandros Kosiaris: cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry) [13:04:09] (03CR) 10Alexandros Kosiaris: [V: 032] cxserver: Enable new pairs for Yandex MT [puppet] - 10https://gerrit.wikimedia.org/r/265691 (https://phabricator.wikimedia.org/T121053) (owner: 10KartikMistry) [13:05:30] (03PS2) 10Bene: Use custom generator for mobile search on Wikibase [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) [13:06:09] (03CR) 10Bene: [C: 031] "I think the issues have been resolved in the task and this should be ready to get merged." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254645 (https://phabricator.wikimedia.org/T85368) (owner: 10Bene) [13:06:35] akosiaris: thanks. [13:10:21] !log rebooting analytics1029 for kernel upgrade [13:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:22] PROBLEM - DPKG on fermium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:15:36] !log rebooting fermium for kernel upgrades [13:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:15:39] that's me ^ [13:17:12] RECOVERY - DPKG on fermium is OK: All packages OK [13:19:12] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:19:14] (03PS1) 10Giuseppe Lavagetto: conftool: temporarily remove the appservers that are down. [puppet] - 10https://gerrit.wikimedia.org/r/266711 [13:22:32] (03PS2) 10Giuseppe Lavagetto: conftool: temporarily remove the appservers that are down. [puppet] - 10https://gerrit.wikimedia.org/r/266711 [13:24:01] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool: temporarily remove the appservers that are down. [puppet] - 10https://gerrit.wikimedia.org/r/266711 (owner: 10Giuseppe Lavagetto) [13:29:04] akosiaris: can you check /etc/cxserver/config.yaml? Our change isn't reflected there (yet). [13:29:23] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [13:29:55] akosiaris: on sca1001/1002 [13:31:13] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [13:32:24] akosiaris: any idea how long it will take? It is usually fast. [13:32:25] !log rebooting analytics1030/1031 for kernel upgrade [13:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:32:34] kart_: takes a while for the change to propagate. tops 30 mins [13:32:40] a from what I see it is there now [13:33:54] kart_: so I saw that it is there now, I assume you are ok ? [13:34:11] <_joe_> mark: you think you can trick me in doing budget? ;) [13:34:17] 6operations, 10DBA: External Storage on codfw (es2005-2010) is consuming 100-90GB of disk space per server and per month and it has 370GB available - https://phabricator.wikimedia.org/T119056#1969592 (10mark) Let's aim for the same specs as the ES refresh we did for eqiad recently, and get quotes ASAP. [13:34:17] akosiaris: ok. Working. [13:34:19] <_joe_> uh wrong channel [13:34:30] akosiaris: I will keep this time in my mind from next time. [13:34:35] _joe_: well you may want to make sure I have budget for my staff next year ;) [13:34:35] Sorry for noise! [13:34:40] <_joe_> eheh [13:34:43] <_joe_> fair enough [13:50:11] (03PS5) 10Mdann52: Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) [13:50:28] (03CR) 10jenkins-bot: [V: 04-1] Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [13:50:52] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 73.91% of data above the critical threshold [5000000.0] [13:53:50] (03PS1) 10Giuseppe Lavagetto: conftool-data: remove mw1031, decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/266716 [13:54:14] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: remove mw1031, decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/266716 (owner: 10Giuseppe Lavagetto) [13:54:28] (03CR) 10Giuseppe Lavagetto: [V: 032] conftool-data: remove mw1031, decommissioned [puppet] - 10https://gerrit.wikimedia.org/r/266716 (owner: 10Giuseppe Lavagetto) [13:54:51] (03PS1) 10Jcrespo: Repool pc1003 for cloning to pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266717 [13:55:21] (03PS2) 10Jcrespo: Depool pc1003 for cloning to pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266717 [13:57:39] (03CR) 10Jcrespo: [C: 032] Depool pc1003 for cloning to pc1006 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266717 (owner: 10Jcrespo) [13:59:38] !log about to going new hardware/OS/mariadb-only for parsercache service [13:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:06] (03PS1) 10KartikMistry: cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 [14:01:39] (03PS1) 10Muehlenhoff: Grant icinga permissions to ema and elukey [puppet] - 10https://gerrit.wikimedia.org/r/266722 [14:01:43] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [14:02:16] (03PS1) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 [14:02:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 (owner: 10KartikMistry) [14:03:05] (03PS2) 10Alexandros Kosiaris: cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 (owner: 10KartikMistry) [14:03:11] (03CR) 10Alexandros Kosiaris: [V: 032] cxserver: Add missing ru as source for MT [puppet] - 10https://gerrit.wikimedia.org/r/266721 (owner: 10KartikMistry) [14:03:39] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Depool pc1003 for cloning to pc1006 (duration: 02m 30s) [14:03:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:02] !log rebooting analytics 1032 to 1035 for kernel upgrades [14:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:14] PROBLEM - NTP on seaborgium is CRITICAL: NTP CRITICAL: Offset 26.52398765 secs [14:04:33] kart_: https://gerrit.wikimedia.org/r/266721 merged [14:04:48] cool. Thanks! [14:11:02] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [14:11:34] (03PS2) 10Muehlenhoff: Grant icinga permissions to ema and elukey [puppet] - 10https://gerrit.wikimedia.org/r/266722 [14:11:54] (03CR) 10Muehlenhoff: [C: 032 V: 032] Grant icinga permissions to ema and elukey [puppet] - 10https://gerrit.wikimedia.org/r/266722 (owner: 10Muehlenhoff) [14:12:17] (03PS1) 10Giuseppe Lavagetto: [WiP] Allow treating pooled=inactive differently from pooled=no in the etcd driver [debs/pybal] - 10https://gerrit.wikimedia.org/r/266728 [14:12:37] <_joe_> bblack: ^^ this is a sketch of what needs to be done, but I'm not satisfied with it [14:12:42] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [14:13:34] 6operations, 10Graphoid, 6Services, 10Traffic: Remove graphoid from parsoidcache - https://phabricator.wikimedia.org/T110477#1969647 (10BBlack) I'm doing some final validation now (checking request logs for any trailing requests to these hostnames). Will upload the changes to remove this, but not merge ye... [14:14:14] (03PS1) 10BBlack: graphoid(.eqiad).wm.o hostname removal [dns] - 10https://gerrit.wikimedia.org/r/266731 (https://phabricator.wikimedia.org/T110477) [14:14:28] (03PS1) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) [14:18:32] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969652 (10BBlack) FYI, I'm still seeing live requests to the cxserver public hostnames on cache_parsoid, e.g. ``` 32 RxURL c /v1/dictionary/rec... [14:19:05] 7Puppet, 6Revision-Scoring-As-A-Service, 10ores: Fix puppet webservice name to uwsgi-ores-web - https://phabricator.wikimedia.org/T124621#1969653 (10Halfak) Yeah. That's right. My mistake! [14:22:47] 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1969654 (10BBlack) The problem with the redirect is it's complicated, because we still have this conflict between internal and external RB URLs due to the whole `Host:` header vs `/h... [14:30:11] (03PS2) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) [14:35:51] !log analytics 1035 hasn't been rebooted because it is a Hadoop Journal Node (will be restarted in the end) [14:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:36:56] ooo, hi elukey! what's happening (still checking email) [14:38:23] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [14:40:04] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [14:40:12] ottomata: o/ rebooting all the nodes to update the kernel, nothing big :) [14:42:52] ottomata: with the notable exception of the hadoop master/standby :-) [14:44:26] ah ok [14:44:28] cool [14:45:09] !log rebooting analytics 1036 to 1039 for kernel upgrade [14:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:47:20] (03CR) 10DCausse: "left one comment but the unit test already detected the problem :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266658 (owner: 10EBernhardson) [14:58:45] !log cloning persercache contents from pc1003 to pc1006 [14:58:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:01:39] (03PS1) 10BBlack: cxserver, citoid -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) [15:02:01] (03PS3) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) [15:02:03] (03PS1) 10BBlack: Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) [15:02:05] (03PS1) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) [15:06:31] elukey: ja analytics1026 you can just do anytime, it'll be fine [15:06:34] 1027 hm. [15:06:50] can we coordinate that with this? [15:06:51] https://phabricator.wikimedia.org/T110090 [15:07:08] i am ready to do it, but keep putting it off because i was going to do it after we do the mobile->text changes [15:07:36] 6operations, 10RESTBase, 6Services, 10Traffic: Remove restbase from parsoidcache - https://phabricator.wikimedia.org/T110475#1969780 (10BBlack) Well, the other thing I can do to make this simple is just treat it like the legacy citoid/cxserver entrypoints: if it's one of the legacy restbase hostnames, just... [15:09:43] ottomata: rebooting a journalnode host ist fine as long as two other are active in the cluster, right? [15:11:22] (03PS2) 10Muehlenhoff: Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 [15:11:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Fix random ntp startup failures [puppet] - 10https://gerrit.wikimedia.org/r/266693 (owner: 10Muehlenhoff) [15:11:39] (03PS1) 10BBlack: Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) [15:11:41] (03PS1) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) [15:11:45] (03PS1) 10BBlack: restbase legacy hostnames -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266749 (https://phabricator.wikimedia.org/T110475) [15:12:42] moritzm: correct [15:12:45] one at a time they will be just fine [15:13:02] ok [15:13:29] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969808 (10Nikerabbit) >>! In T110478#1965406, @BBlack wrote: > Does that imply that **nothing** should be using the hostnames `cxserv... [15:14:42] ottomata: yes I'll skip 2017 [15:14:47] *1027 [15:16:30] moritzm: does anything special need to happen to apply this other than a reboot? [15:16:43] PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 12 data above and 47 below the confidence bounds [15:16:49] hm! [15:16:54] probably because I changed metrics, checking.. [15:17:13] moritzm: if you hold off on 1027, I'm hoping to move some services off of there soon, and i have to schedule some maintenance for it anyway [15:21:27] ottomata: ok for 1027 [15:21:44] ok cool [15:22:04] ottomata: just installing the new kernel and a reboot (but the new kernel has been installed on all analytics hosts already) [15:22:07] i should be able to do that shortly after the mobile->text merge is complete [15:22:19] perfect [15:22:19] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969829 (10BBlack) Ok. I was under the impression that as part of some eventual plan, the CX extensions would switch to using public... [15:29:54] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969853 (10akosiaris) >>! In T110478#1969829, @BBlack wrote: > Ok. I was under the impression that as part of some eventual plan, the... [15:31:52] (03PS5) 10Ottomata: Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [15:32:00] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1969857 (10Ottomata) I got it… [15:32:16] (03CR) 10Ottomata: [C: 032 V: 032] Create new puppet group analytics-search-users [puppet] - 10https://gerrit.wikimedia.org/r/265795 (https://phabricator.wikimedia.org/T122620) (owner: 10EBernhardson) [15:33:33] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969870 (10BBlack) @akosiaris - we're talking about two different parts of the problem. Regardless of whether/how cxserver's app code... [15:33:39] (03PS1) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) [15:37:37] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969876 (10akosiaris) >>! In T110478#1969870, @BBlack wrote: > @akosiaris - we're talking about two different parts of the problem. R... [15:39:43] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [15:39:52] ottomata: Looks like you were involved with the kernel upgrades mentioned in SAL, were they super urgent or something? [15:40:21] (03CR) 10Yurik: [C: 031] "seems like everything points to the restbase's url" [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack) [15:40:41] (03PS1) 10Ottomata: Include role::elasticsearch::analytics on Hadoop namenodes and stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/266754 (https://phabricator.wikimedia.org/T122620) [15:40:46] MarkTraceur: moritzm knows more [15:40:51] Hmm [15:41:05] I think stat1003 was included in the reboots but I'm not 100% certain [15:41:11] Maybe it was an unrelated downtime [15:41:46] MarkTraceur: that sounds right but I'm not sure [15:41:53] Ah well [15:41:57] elukey: ? [15:42:06] MarkTraceur, ottomat: yeah, stat1002/stat1003 needed reboots for a kernel security update [15:42:10] It killed a script I was running, just wondered if I missed coordination of that [15:42:20] Oh, okay, if it was an urgent security thing then fine :) [15:42:27] I sent a headsup mail to the analytics list yesterday [15:42:34] Oh, yeah, so I just fail [15:43:06] (03CR) 10Ottomata: [C: 032] Include role::elasticsearch::analytics on Hadoop namenodes and stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/266754 (https://phabricator.wikimedia.org/T122620) (owner: 10Ottomata) [15:43:23] (03CR) 10Yurik: [C: 031] "haven't tested, but looks ok. If this is how extensions should be loaded now, i'm fine with it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266433 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [15:43:40] \o/ [15:43:59] Going through email now, I'm going to identify and fix the failing in my communications [15:45:30] hmmmm _joe_, admin::groups are not collected from multiple roles? [15:46:09] ottomata: _joe_ is traveling ATM [15:46:35] ah k [15:47:41] !log rebooting analytics 1026, 1040 -> 1042 due to kernel upgrade. [15:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:48:34] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: puppet fail [15:49:03] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: puppet fail [15:49:49] ^ that's me [15:49:51] am working on it [15:51:45] (03PS2) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 [15:53:18] ottomata: re ^, do you know where analytics webrequest parsing code lives? I've lost track, but I wanted to double-check that recent changes in X-Cache format don't break their parsing of it for cache_status (and for that matter, I suspect cache_status doesn't report what we really want it to report anyways right now...) [15:53:56] yes think so... [15:54:37] bblack, it looks like no special parsing is done of cache_status [15:54:46] or x_cache [15:54:53] both are included in the refined webrequest table [15:54:59] directly as they are from varnish [15:55:33] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/refine/refine_webrequest.hql#L64 [15:55:43] oh %{Varnish:handling@cache_status}x [15:55:53] ja [15:56:01] ok [15:56:10] whatever varnishkafka is configured to send is what makes it into those fields [15:56:17] probably not a very relevant field, as it's only reporting a naive interpretation of the frontend cache disposition [15:56:28] aye, huh ok [15:56:30] i.e. cache_status may come up "miss", but it is in fact a cache hit at a deeper layer, etc... [15:56:44] i can't think of any analysis that is using either of those atm. i think ops folks have looked at them before [15:56:49] aye, makes sense [15:56:55] x_cache has the results all the way down? [15:57:08] yes, although interpreting them is non-trivial [15:57:11] aye [15:57:28] feel free to do like you did with client_ip in varnish fanciness if you like [15:57:33] to make it all canonical and stuff :) [15:58:02] yeah I was thinking about (a) leaving X-Cache basically as it is for debugging and analysis we sometimes do on deeper cache-layers-internal stuff [15:58:49] and then also summarizing in a new output header that just applies one of a few overall labels for "all of the cache layers as a black box". probably just "hit|int|miss|pass" [15:59:44] (where hit = real cache object hit, int = internally-generated dynamically by varnish caches, miss|pass -> usual meanings which always result in applayer fetch) [16:00:05] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T1600). [16:00:05] tgr: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [16:00:18] o/ [16:00:27] going to lurk this SWAT [16:00:47] tgr: I can SWAT if you're around [16:01:01] thcipriani|afk: here [16:01:31] (03PS1) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master Bug: T124704 [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) [16:03:12] !log rebooting analytics 1043 -> 1050 for kernel upgrade. [16:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:04] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969978 (10GWicke) The last time we talked about moving the CXServer API to RB the issue was that some of those APIs are really not re... [16:05:28] (03PS1) 10Ottomata: Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) [16:06:29] (03CR) 10jenkins-bot: [V: 04-1] Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) (owner: 10Ottomata) [16:07:25] (03PS2) 10Ottomata: Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) [16:08:19] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969991 (10BBlack) @gwicke no need for the stopgap, we'll just keep doing traffic pass-through of cxserver.wikimedia.org for now (but... [16:09:31] 6operations, 10ContentTranslation-cxserver, 6Services, 10Traffic, 5Patch-For-Review: Remove cxserver from parsoidcache cluster - https://phabricator.wikimedia.org/T110478#1969995 (10GWicke) @bblack: Okay, thanks! [16:09:54] (03CR) 10Ottomata: [C: 032] Fix for analytics-search-user changes [puppet] - 10https://gerrit.wikimedia.org/r/266757 (https://phabricator.wikimedia.org/T122620) (owner: 10Ottomata) [16:11:21] !log thcipriani@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/session/CentralAuthSessionProvider.php: SWAT: Avoid forceHTTPS cookie flapping if core and CA are setting the same cookie [[gerrit:266671]] (duration: 02m 26s) [16:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:11:24] (03PS3) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 [16:11:24] ^ tgr check please [16:12:08] (03PS4) 10BBlack: X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 [16:12:24] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:12:26] (03CR) 10BBlack: [C: 032 V: 032] X-Cache: add "int" status for internal responses [puppet] - 10https://gerrit.wikimedia.org/r/266723 (owner: 10BBlack) [16:12:54] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:13:13] thcipriani: verified, thanks! [16:13:24] tgr: thanks for checking [16:15:20] <_joe_> ottomata: no, hiera data can either be defined in one role only, or be exactly equal across roles [16:15:45] <_joe_> Or, you use a container role [16:16:33] (03PS3) 10BBlack: Undo special-casing of testwiki in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/266414 (owner: 10Ori.livneh) [16:16:53] <_joe_> Anyways, @airport, on mobile. Read the docs and the code :-P [16:17:51] ha, ok, container role? [16:17:58] (03CR) 10BBlack: [C: 032] Undo special-casing of testwiki in Varnish [puppet] - 10https://gerrit.wikimedia.org/r/266414 (owner: 10Ori.livneh) [16:19:51] (03PS2) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) [16:19:51] is gerrit review a bit broken? [16:20:02] I keep getting "line 1:66 no viable alternative at character '%'" [16:21:29] (03PS2) 10BBlack: Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) [16:21:31] (03PS4) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) [16:21:33] (03PS2) 10BBlack: cache_parsoid: remove restbase pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266748 (https://phabricator.wikimedia.org/T110475) [16:21:35] (03PS2) 10BBlack: cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) [16:21:37] (03PS2) 10BBlack: Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) [16:21:39] (03PS3) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) [16:21:43] RECOVERY - NTP on seaborgium is OK: NTP OK: Offset -0.0001972913742 secs [16:21:49] (03CR) 10GWicke: [C: 031] Text VCL: add support for legacy rb passes [puppet] - 10https://gerrit.wikimedia.org/r/266747 (https://phabricator.wikimedia.org/T110475) (owner: 10BBlack) [16:22:16] !log thcipriani@mira Synchronized php-1.27.0-wmf.11/extensions/CentralAuth/includes/CentralAuthUtils.php: SWAT: Preserve certain keys when updating central session [[gerrit:266672]] (duration: 02m 28s) [16:22:18] ^ tgr check please [16:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:28] (03CR) 10GWicke: [C: 031] cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [16:23:59] (03CR) 10Subramanya Sastry: [C: 031] Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) (owner: 10Jcrespo) [16:24:55] (03CR) 10GWicke: [C: 031] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [16:25:54] thcipriani: also verified, thanks again! [16:26:02] tgr: thank you! [16:26:23] PROBLEM - puppet last run on analytics1026 is CRITICAL: CRITICAL: Puppet has 1 failures [16:29:18] 6operations: move releases.wm.org to bromine (was: request VM for releases.wm.org) - https://phabricator.wikimedia.org/T124261#1970042 (10Niedzielski) {icon thumbs-up} @Dzahn, thanks for the heads up and quick summary. bromine works fine for me. [16:37:40] (03PS4) 10Jcrespo: Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) [16:39:24] (03CR) 10Jcrespo: [C: 032] Add testreduce account and grants from ruthenium on m5-master [puppet] - 10https://gerrit.wikimedia.org/r/266756 (https://phabricator.wikimedia.org/T124704) (owner: 10Jcrespo) [16:41:36] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1970065 (10RobH) a:5RobH>3Ottomata It appears that @ottomatta merged all the patches (which is one step better than just reviewing my review). It appears th... [16:45:24] ostriches, https://gerrit.wikimedia.org/r/#/q/project:operations/mediawiki-config+status:open+-label:Code-Review%253C%253D-1+-label:Verified-1,n,z is broken :( [16:45:55] broken? [16:46:05] Krenair: double encode -- https://gerrit.wikimedia.org/r/#/q/project:operations/mediawiki-config+status:open+-label:Code-Review%3C%3D-1+-label:Verified-1,n,z [16:46:16] indeed [16:46:21] but it's a URL that gerrit actually generates [16:46:37] I'm not sure exactly when or why the JS in gerrit started doing that but it is really annoying [16:46:46] when you put a '=' in the query [16:51:27] bd808: Must've been that secret upgrade and change to all the apache config I did a few months ago when you started complaining :P [16:51:37] jynus, should I hardcode 'testreduce' as the db user in my patches or is $db_user variable set to 'testreduce'? [16:51:39] muahaha [16:52:14] ostriches: I just saw you holding a cat and touching your pinkie to your lips [16:52:31] well, a variables is better- I just like the username in the public pupet repo, only the password in the non-public [16:52:41] bd808: s/cat/puppy/ [16:52:44] subbu, ^if that makes sense to you [16:53:20] (as username is already public the puppet for the mysql server configuration) [16:54:36] jynus, yes .. once i upload newer version of the patches, could you leave your review comments on the patches in case they need further changes? [16:54:56] yes, I will [16:55:17] let me also confirm access from ruthemium [17:00:25] subbu, I can confirm right acccess from ruthemium [17:01:26] (03CR) 10Alex Monk: [C: 04-1] "It's still called wmgBetaFeaturesWhitelist in InitialiseSettings" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [17:01:34] (03PS2) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) [17:01:38] (03CR) 10Chad: "I don't think we're going to install any FreeBSD apaches...like ever :p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265619 (owner: 10Dereckson) [17:02:16] (03CR) 10Alex Monk: [C: 04-1] "non-merged MW core dependency" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266454 (https://phabricator.wikimedia.org/T85538) (owner: 10Cenarium) [17:02:30] jynus, ok .. i updated my patches if you want to take a look. [17:03:12] error: 'files/misc/ubuntu-cloud.key': short read Success [17:03:16] error Success? [17:04:09] subbu, see comment on https://gerrit.wikimedia.org/r/#/c/266752/2 [17:04:16] Krenair, where is that? [17:05:03] from git-grep [17:05:14] while looking through the puppet repo [17:06:19] (03CR) 10Alex Monk: "Where is this actually used? I see where 404.php is used, but not 404.html." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264461 (owner: 10Suriyaa Kudo) [17:09:24] (03PS3) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) [17:13:13] (03CR) 10Alex Monk: [C: 04-1] "I don't think those constants you uncommented will be defined when InitialiseSettings gets run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium) [17:13:40] jynus, https://gerrit.wikimedia.org/r/#/c/266753/ is the public puppet part of it. [17:13:59] will submit this, which will allow testing the other [17:14:11] ah, ok. [17:14:25] I do not know if I mentioned this already, the private part was already done [17:14:55] that is why I wanted it without the user, as it had been commited with it [17:16:11] !log rebooting analytics1035.eqiad.wmnet for kernel upgrade [17:16:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:18] Anyone have the ssh fingerprints for deployment.eqiad.wmnet? [17:18:41] jynus, ah, that makes sense now .. i am slowly comprehending all the pieces. [17:22:51] is the change something that you can test immediatelly? [17:23:13] Assume you have someone with sudo helping you [17:23:23] yes. [17:23:42] (03CR) 10Alex Monk: [C: 031] Rename two namespaces at bswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247093 (https://phabricator.wikimedia.org/T115812) (owner: 10Luke081515) [17:23:45] the parsoid-rt and parsoid-vd services would come up. [17:23:57] and i should be able to open http://parsoid-tests.wikimedia.org/ [17:25:03] running puppet-compiler [17:26:03] (03CR) 10BBlack: [C: 032] "Monitored traffic for a while just-in-case, only seeing random (and very rare) crawler hits. This is easily reverted with a 10 minute neg" [dns] - 10https://gerrit.wikimedia.org/r/266731 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack) [17:26:16] (03CR) 10Alex Monk: [C: 031] Add namespace aliases for English Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264066 (https://phabricator.wikimedia.org/T123187) (owner: 10Pmlineditor) [17:26:42] jynus, oh .. i think i forgot ot set hostname in the config since the dbs are no longer on ruthenium. [17:26:58] what is the hostname i should use? [17:27:45] (03CR) 10Jcrespo: [C: 04-1] "Role mariadb?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:27:51] m5-master.eqiad.wmnet. got it. [17:27:54] !log rebooting analytics105* hosts to upgrade their kernel [17:27:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:28:13] that is right [17:28:15] (03CR) 10Alex Monk: [C: 031] Add Category to $wgNamespacesWithSubpages on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264460 (https://phabricator.wikimedia.org/T121985) (owner: 10Pmlineditor) [17:28:18] also role mariadb? [17:28:39] so, replace mariadb with mysql-client? [17:28:55] i don't know what you meant by auto-install of mysql-client [17:28:58] (03CR) 10Alex Monk: [C: 04-1] "Seems to be some confusion on the task about this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264937 (https://phabricator.wikimedia.org/T123837) (owner: 10Mdann52) [17:29:08] for now, you can just delete that role [17:29:32] we can latter assess the need for a command line client, etc [17:29:51] let's make the patch as small as possible to make the service work [17:29:55] ok. [17:31:02] (03CR) 10Alexandros Kosiaris: "I don't see it cherry-picked in deployment-puppetmaster though." [puppet] - 10https://gerrit.wikimedia.org/r/266668 (owner: 10KartikMistry) [17:31:11] (as that would probably require additional operation-access-requests) [17:31:13] (03PS4) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) [17:31:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1970326 (10Ottomata) 5Open>3Resolved Thanks! I think we done, I had to move things around a little bit. [17:31:40] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Create new puppet group `analytics-search-users` - https://phabricator.wikimedia.org/T122620#1970331 (10Ottomata) Sorry, I thought I replied here that I was taking this. [17:34:11] greg-g, i'm about to deploy new graphoid service - seem like noone is deploying at the moment [17:34:34] (03CR) 10Alex Monk: [C: 031] Namespace configuration on cu.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265885 (https://phabricator.wikimedia.org/T123654) (owner: 10Dereckson) [17:34:36] marxarelli, you haven't started the train yet, right? [17:34:46] not that it should be affected - its a service [17:36:07] hmm, actually never mind, its seems mira needs to be setup for graphoid deployment first... checking [17:36:10] yurik: two questions: 1) why doesn't it go in the service deploy window at 21:00 UTC? or 2) Why don't you request a window for this so you don't have to kinda-ask-but-not-really-because-you're-just-going-to-do-it-anyway-even-if-greg-is-sick-and-not-watching-irc? [17:36:29] * greg-g is sick and barely watching irc [17:36:40] I do not think it is getting the password right [17:36:45] ^subbu [17:37:12] lol, sorry greg-g - that's how i have been deploying it before and i thought it was ok for a service. I didn't know we had a serivce depl window [17:37:42] is shows empty on puppet compiler but should show the fake one [17:37:53] now that it is getting to be more of a real thing, it needs to follow the process more rigidly, yurik [17:38:00] i guess i am not including it properly or referenceing the password variable properly. [17:38:09] let me check how it is used in other files .. unless you know what the problem is. [17:38:24] let me rebuild it again, to be sure [17:38:43] I think you have to reference the full namespace, but I may be wrong [17:39:08] looks like it has to be referenced as $passwords::testreduce::mysql::user [17:39:19] subbu, see https://puppet-compiler.wmflabs.org/1663/ruthenium.eqiad.wmnet/ [17:39:41] greg-g, sure thing. That window doesn't include graphoid though. Plus its at midnight-1am, so a bit inconvenient. I will add a window to the deployment schedule if that's ok with you? [17:39:41] I expect nosecret there [17:40:21] db_user [17:40:21] (03PS5) 10Subramanya Sastry: ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) [17:40:50] jynus, i updated the patch .. can you check if that does better. [17:40:54] db_pass, yes, you got it right [17:41:00] let me recheck [17:41:29] (03PS1) 10Chad: Also keep /srv/patches in sync between masters [puppet] - 10https://gerrit.wikimedia.org/r/266773 [17:43:04] subbu, that's better :-) , https://puppet-compiler.wmflabs.org/1664/ruthenium.eqiad.wmnet/ [17:43:31] yurik: propose another window time [17:43:35] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [17:43:37] \o/ [17:44:13] greg-g, another stable window? [17:44:16] (03CR) 10Jcrespo: [C: 031] ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:44:36] deploy and test, subbu ? [17:44:40] because graphoid is usually one-off, and i don't want to move the window for all services if that's convinient for everyone else? [17:44:44] greg-g, % [17:44:54] that was a ^, not % [17:45:02] jynus, works for me. [17:45:17] (03CR) 10Jcrespo: [C: 032] ruthenium services: Use puppetized db credentials for testreduce dbs [puppet] - 10https://gerrit.wikimedia.org/r/266753 (https://phabricator.wikimedia.org/T124704) (owner: 10Subramanya Sastry) [17:45:29] yurik: what I mean is pick a time for graphoid that works for you/your team, I'll see if it makes sense on the calendar [17:46:17] greg-g, right, but as a permanent fixture? If possible, it would be great to simply request a window when nothing else is being deployed [17:46:29] deploying now [17:46:59] !log migrating ruthenium parsoid-test database to m5-master [17:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:18] * subbu logs onto ruthenium as well [17:47:25] parsoid-vd has refreshed automaticall, is that enough? [17:47:40] yurik: ocg is in that situation. i still find it useful to have a regularly scheduled time (coscheduled with the parsoid deploy) even if I don't do a deploy most weeks. [17:48:22] yurik: you could join the *oids in the parsoid deploy window. ;) [17:48:35] and /etc/testreduce/parsoid-vd.settings.js updated [17:48:36] jynus, yay ... http://parsoid-tests.wikimedia.org/ is now live :) [17:48:38] what cscott said, yurik [17:49:11] and http://parsoid-tests.wikimedia.org/commits looks right. [17:49:11] yurik: I don't like the one-off requests, if you have a window you have a window and all is good [17:49:11] this has not finished, let me check load [17:49:11] cscott, i was hoping to have an earlier window because its running a bit late for UTC+3 greg-g [17:49:11] jynus, thanks .. at least the db part of it seems good. [17:49:11] yurik: exactly, so propose one, as I said a while ago [17:49:20] (03PS1) 10ArielGlenn: dumps: stash some current dump run config settings in file and reuse [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/266775 [17:49:27] greg-g, yes yes, i'm writing one now :) [17:49:27] yurik: again, I'm not OK with continued "just ping greg 5 minutes before I want to do soemthing" [17:49:36] understood :) [17:49:43] cool :) [17:49:48] jynus, and http://parsoid-tests.wikimedia.org/vd_testreduce/commits is also up. [17:50:00] so, both testreduce services are operational and are connecting with the right m5-master dbs. [17:50:05] subbu, remember that you are now in a misc production server [17:50:08] greg-g, i will schedule something every day, but will skip it most of the time :P [17:50:16] !log deploy patch for T97157 [17:50:18] that has advantages (more resources) [17:50:19] yurik: sure, i understand. even the parsoid deploy window is a bit late for UTC-5, since it sometimes gets uncomfortably close to when i have to leave to pick up my kids. [17:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:50:25] managed server if it fails [17:50:31] yurik: no, not every day [17:50:32] jynus, you mean wrt m5-master? [17:50:34] but also responsabilities [17:50:44] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: puppet fail [17:50:48] do not bring it down, ok ;-) [17:50:52] cscott, how about two hours before the -oids? you and i can join :) [17:51:07] jynus, do you mean wrt. ruthenium or the database .. m5-master? [17:51:13] greg-g will be happy, and I will only schedule it mon-thursday :) [17:51:20] m5-master [17:51:49] i see .. ah, ok. i guess we need to tune our queries then. [17:52:04] yurik: i suspect greg-g will say that plan has conflicts on t/th but is fine on m/w. ;) [17:52:19] yurik: no [17:52:39] 2hrs before parsoid would be 19:00-20:00 UTC M/W [17:52:46] (03CR) 10Alexandros Kosiaris: [C: 04-1] "looks good, minor nitpick" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [17:53:02] yurik: two days/week, go on Tues/Thu, pick a time [17:53:55] greg-g, tues thu is good - this way i will either use that window, or join other oids at a later time. cscott - what TZ are you in? [17:54:17] later time on mon-wed [17:54:21] EST, currently UTC-5. [17:54:24] jynus occasionally the queries that populate parsoid-tests.wikimedia.org tend to be expensive ... so, i'll work to fix those queries soon. [17:54:46] or, i should say, EST is always UTC-5, but i'm currently in EST sometimes in EDT. ;) [17:54:50] look, databases are there to being used [17:55:48] just makse sure you do not create 1000 connections and use all io available, and I will be happy [17:55:49] ah, ok. that is not a problem. :) [17:55:49] 6operations, 6Parsing-Team, 10hardware-requests: Dedicated server for running Parsoid's roundtrip tests to get reliable parse latencies and use as perf. benchmarking tests - https://phabricator.wikimedia.org/T116090#1970414 (10ssastry) [17:56:12] thanks again. [17:56:16] cscott, 18-19 UTC, which is 1pm-2pm EST i think [17:56:21] greg-g, ^ ? [17:56:27] on tue thu [17:56:40] between puppet swat & mw train [17:58:32] (03CR) 10Thcipriani: "Inline question." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [18:00:20] !log deploy patch for T103239 [18:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:10] (03CR) 10Daniel Kinzler: [C: 031] "It's what we want, and it works for me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/263046 (https://phabricator.wikimedia.org/T123112) (owner: 10Thiemo Mättig (WMDE)) [18:01:42] (03CR) 10Jhernandez: [C: 031] Add sampling rates for mobile web language switcher [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265292 (https://phabricator.wikimedia.org/T123932) (owner: 10Bmansurov) [18:06:43] PROBLEM - citoid endpoints health on sca1002 is CRITICAL: /api (bad URL) is CRITICAL: Could not fetch url http://10.64.48.29:1970/api: Timeout on connection while downloading http://10.64.48.29:1970/api [18:06:43] PROBLEM - citoid endpoints health on sca1001 is CRITICAL: /api (bad URL) is CRITICAL: Could not fetch url http://10.64.32.153:1970/api: Timeout on connection while downloading http://10.64.32.153:1970/api [18:07:34] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:08:05] greg-g, cscott, gwicke - i added another earlier service deployment window at 09:00 PST on TUE and THU - this should make it easier for European and East Coast based services to be deployed :) [18:08:14] RECOVERY - citoid endpoints health on sca1001 is OK: All endpoints are healthy [18:08:20] and i didn't do this ^ [18:10:04] RECOVERY - citoid endpoints health on sca1002 is OK: All endpoints are healthy [18:11:03] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [18:11:16] (03CR) 10Mobrovac: [C: 04-1] cxserver, citoid -> cache_text cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [18:13:07] (03CR) 10Mobrovac: [C: 031] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [18:13:23] (03CR) 10Chad: Also keep /srv/patches in sync between masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [18:13:26] (03CR) 10Alexandros Kosiaris: Also keep /srv/patches in sync between masters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266773 (owner: 10Chad) [18:13:28] (03PS6) 10Jean-Frédéric: Add 2 sites to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [18:13:57] (03CR) 10Jean-Frédéric: "Rebased against master." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/262893 (https://phabricator.wikimedia.org/T122995) (owner: 10Mdann52) [18:14:03] (03CR) 10BBlack: cxserver, citoid -> cache_text cluster (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [18:14:19] (03PS2) 10BBlack: restbase legacy hostnames -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266749 (https://phabricator.wikimedia.org/T110475) [18:14:21] (03PS2) 10BBlack: cxserver, citoid -> cache_text cluster [dns] - 10https://gerrit.wikimedia.org/r/266739 (https://phabricator.wikimedia.org/T110476) [18:15:33] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [18:16:49] (03CR) 10Mobrovac: [C: 031] cache_parsoid: remove citoid+cxserver pass-through [puppet] - 10https://gerrit.wikimedia.org/r/266741 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [18:17:14] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [18:18:58] jynus or any other root, can you run netstat -ltpn on ruthenium (i don't have root to do that) to see what is running on port 58805 .. since we are getting some mysterious failures on some tests? [18:19:02] ex. http://parsoid-tests.wikimedia.org/resultFlagNew/b410e18e3e9b25ed487f92d24995502dc2782bc9/f1ddfb884e32715c8b16d5149ee9b5119fc7de74/enwiki/2015%20NASCAR%20Xfinity%20Series [18:20:32] nodejs [18:21:20] ah, no more info besides that? [18:21:35] testred+ 27506 0.0 0.3 964968 50304 ? Sl Jan26 0:10 /usr/bin/nodejs /usr/lib/parsoid/src/tests/../bin/server.js --num-workers 1 --config /usr/lib/parsoid/src/tests/testreduce/parsoid-rt-client.rttest.localsettings.js [18:22:07] thanks. [18:24:24] PROBLEM - puppet last run on mw2094 is CRITICAL: CRITICAL: Puppet has 1 failures [18:29:21] subbu: do you have a link to a test failure? [18:30:10] bblack, http://parsoid-tests.wikimedia.org/resultFlagNew/b410e18e3e9b25ed487f92d24995502dc2782bc9/f1ddfb884e32715c8b16d5149ee9b5119fc7de74/enwiki/2015%20NASCAR%20Xfinity%20Series .. looks like it is one of the parsoid workers that communicates with a test client. [18:30:44] with the ruthenium reimage .. we also got upgraded from node 0.10 to node 4.2 [18:30:44] this is tests btw, not production. [18:31:15] oh sorry didn't notice link above :) [18:31:21] the funny thing is that all the failures reported on http://parsoid-tests.wikimedia.org/regressions/between/b410e18e3e9b25ed487f92d24995502dc2782bc9/f1ddfb884e32715c8b16d5149ee9b5119fc7de74 .. (with a 1 in the error column) are from the same worker. [18:31:53] there are 8 separate test clients running and all the other 7 aren't reporting it .. [18:31:58] the test output says port 58580, you asked jynus 58805 [18:32:13] oh ... good catch. :) [18:32:43] 7Blocked-on-Operations, 6operations, 6Services, 7Graphite: scale graphite deployment (tracking) - https://phabricator.wikimedia.org/T85451#1970579 (10GWicke) [18:32:58] * subbu remembers not to trust his short term memory [18:33:31] nothing listening on 58580 at the moment [18:33:33] (03CR) 10Mobrovac: [C: 031] graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack) [18:34:08] 6operations, 10Parsoid, 6Services, 10service-template-node, 7service-runner: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1970580 (10GWicke) p:5High>3Normal [18:34:09] hmm .. interesting. [18:34:45] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1970581 (10GWicke) p:5High>3Normal [18:35:10] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1524920 (10GWicke) Lowered priority as the main multi-DC goal is reached, and the main remaining bit is adding encryption for c... [18:35:12] subbu: "nothing listening on that port" explains the ECONNREFUSED [18:36:27] yup .. the error message is not helpful .. i don't know if it is parsoid or if it is the test client code .. maybe i should add more error logs. [18:39:39] (03PS1) 10Jcrespo: Repool pc1006 after cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266787 (https://phabricator.wikimedia.org/T121888) [18:40:20] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970606 (10Dzahn) [18:41:34] subbu: if you have the ability to re-test older revs, you could figure out whether it's a test setup problem or a real code regression [18:41:44] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 60.71% of data above the critical threshold [5000000.0] [18:42:39] bblack, i have that ability .. but the only change is that we upgrade from node 0.10 to node 4.2 .. so, i suspect it is exposing something. [18:42:55] ah [18:43:27] so, we'll have to figure this out before we consider upgrading production to node 4.2 :) [18:44:21] nodejs 4.2 changelog: added default feature to randomly refuse connections to reduce server load for the performance win! [18:44:47] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1970623 (10ssastry) I've asked @trevorparscal to approve. But, one other sudo permission require... [18:45:52] bblack, :) [18:46:34] (03CR) 10Jcrespo: [C: 032] Repool pc1006 after cloning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266787 (https://phabricator.wikimedia.org/T121888) (owner: 10Jcrespo) [18:48:26] !log HHVM on mw1019 still dying on a regular basis with "Lost parent, LightProcess exiting" [18:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:20] bd808, is it time to make a ticket? [18:49:58] !log jynus@mira Synchronized wmf-config/db-eqiad.php: Repool pc1006 after cloning (duration: 02m 25s) [18:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:39] Krenair: J.oe said mysteriously 2 days ago that he knew what the problem was and that it was a "red herring". Something about it having not been restarted in a year. Maybe that server is depooled and just puking due to health checks? [18:50:53] RECOVERY - puppet last run on mw2094 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:50:58] it is annoying the fatalmonitor logs for sure [18:53:02] (03PS1) 10Subramanya Sastry: ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 [18:55:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:57:43] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [18:58:26] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970654 (10Dzahn) In the private puppet repository, on palladium, in `/root/private/modules/secret/secrets/nagios/contacts.cfg` , i added: ``` define contact{ contact_name... [18:59:20] tgr, anomie: group0 seems in pretty good shape from what i can see. any concerns about group1 promotion today? [19:00:05] marxarelli: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T1900). Please do the needful. [19:01:14] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [19:01:18] (03CR) 10Mobrovac: [C: 031] ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry) [19:01:25] marxarelli: I have no concerns [19:01:27] marxarelli: haven't seen any new issues since wmf11->group0 [19:01:42] great [19:02:40] i do see loads on "parent, LightProcess exiting" on flourine but jynus (or someone), this is a known issue, right? [19:03:17] Krenair: ^ ? [19:03:29] is it from mw1019 marxarelli? [19:03:36] no, it happening on mira is known [19:03:43] the other is 19 or something else [19:04:15] ah, yes. it's just 19 [19:04:39] yes, known [19:04:44] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970675 (10Dzahn) In the public repo in `nagios_common/files/contactgroups.cfg` there is a contact_group called "sms". This is the critical one for paging. The newly created contacts would be a... [19:05:00] Krenair: jynus: known and OK I presume? :) Also, is there a task for it? [19:05:08] 7Puppet, 6operations, 6Release-Engineering-Team, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1970677 (10GWicke) @mobrovac, should we resolve this task? [19:05:11] alright then, will promote group1 shortly [19:05:55] Krenair, not ok, but not causing issues [19:06:00] ^greg [19:06:02] greg-g, I asked the same thing earlier [19:06:08] well [19:06:09] sort of [19:06:13] 31<Krenair>30 bd808, is it time to make a ticket? [19:06:22] but I am talking about mira, not the other [19:06:32] 21 Krenair: J.oe said mysteriously 2 days ago that he knew what the problem was and that it was a "red herring". Something about it having not been restarted in a year. Maybe that server is depooled and just puking due to health checks? [19:06:46] that should be checked [19:06:48] let's get a task so we have more than irc logs [19:06:50] (03PS5) 10BBlack: graphoid(.eqiad).wm.o VCL removal [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) [19:08:12] (03CR) 10BBlack: [C: 032] "Should be safe! If revert is necessary, also revert https://gerrit.wikimedia.org/r/#/c/266731/" [puppet] - 10https://gerrit.wikimedia.org/r/266732 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack) [19:08:24] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [19:08:25] one day i'll try to make a bot where you just say !IRC2PHAB 10 or something and it creates a task and copies the last couple lines over there [19:08:34] (03CR) 10BBlack: "If it looks necessary to revert this, also revert https://gerrit.wikimedia.org/r/#/c/266732/" [dns] - 10https://gerrit.wikimedia.org/r/266731 (https://phabricator.wikimedia.org/T110477) (owner: 10BBlack) [19:08:47] (03CR) 10Dereckson: "It's not really the point. The point is more to have a correct handling of fatal errors and die nicely instead of have a cascading of erro" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/265619 (owner: 10Dereckson) [19:10:02] 6operations, 10Analytics, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (10Ottomata) 3NEW [19:10:11] 6operations, 10Analytics, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970755 (10Ottomata) [19:11:06] robh, do you know when T124701 will be approved (i.e. when is the ops meeting)? [19:11:09] (03PS3) 10BBlack: Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) [19:11:20] (03CR) 10BBlack: [C: 032] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [19:11:34] (03PS1) 10Dduvall: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266795 [19:11:39] (03CR) 10BBlack: [V: 032] Text VCL: Add support for citoid+cxserver passes [puppet] - 10https://gerrit.wikimedia.org/r/266740 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [19:11:43] subbu: yep, so we need to append in journalctl service access? [19:12:02] yep = yes the patch addition has to have ops meeting to approve everyone getthing the sudo rights [19:12:05] yes please so i can look at logs. [19:12:19] so you need to sudo as the user, not a service? [19:12:54] to look at logs use this: [19:12:56] 'ALL = NOPASSWD: /bin/journalctl *'] [19:13:01] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs per for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970805 (10Ottomata) 3NEW a:3JAllemandou [19:13:02] (03CR) 10Dduvall: [C: 032] group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266795 (owner: 10Dduvall) [19:13:14] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970814 (10Ottomata) [19:13:21] i don't understand the distinction .. but right now all the services are logging to 'journal' in the systemd files .. so i / parsoid-rt-admin members need to be able to view them. [19:13:25] (03Merged) 10jenkins-bot: group1 wikis to 1.27.0-wmf.11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266795 (owner: 10Dduvall) [19:13:26] hurray for well-formatted json. so much easier to verify the wikiversions diff [19:13:27] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970805 (10Ottomata) [19:13:57] subbu: cool, mutante gave the answer. So yea, the other rights I gave are for services, now you need to read that file as that user so it should be what mutante put [19:14:02] !log dduvall@mira rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.27.0-wmf.11 [19:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:13] subbu: robh: journalctl * is what we do, it will let users read all logs, trying to limit that per service doesnt really work [19:14:17] I'll append it into the patchset and on Monday we can get the meeting review to allow it [19:14:29] mutante: duly noted, thank you! [19:14:48] robh, monday. ok. [19:14:50] at least not with "journalctl -u service *" [19:14:59] (03PS2) 10Dereckson: Get rid of $wg = $wmg for BetaFeatures [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) [19:16:30] (03CR) 10Dereckson: "PS2: addressed PS1 comment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266470 (https://phabricator.wikimedia.org/T119117) (owner: 10Dereckson) [19:17:27] (03PS2) 10RobH: creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) [19:17:28] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970836 (10Dzahn) @elukey could you logout of Icinga, log back in with "elukey" (non-capitalized) and the normal LDAP/wikitech password, then execute a command, like send a "custom notification"... [19:18:35] (03CR) 10jenkins-bot: [V: 04-1] creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [19:18:44] (03PS3) 10RobH: creation of parsoid-rt-admin group [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) [19:20:21] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1970860 (10RobH) I've updated the patchset to include: 'ALL = NOPASSWD: /bin/journalctl *' which... [19:21:05] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1970876 (10Ottomata) 3NEW [19:22:36] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1970895 (10Ottomata) [19:22:56] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1970805 (10Ottomata) [19:22:58] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: Hadoop Node expansion for end of FY - https://phabricator.wikimedia.org/T124951#1970876 (10Ottomata) [19:23:01] 6operations, 10Analytics, 10Analytics-Cluster, 6Analytics-Kanban, 10hardware-requests: New Hive / Oozie server node in eqiad Analytics VLAN - https://phabricator.wikimedia.org/T124945#1970744 (10Ottomata) [19:25:36] (03CR) 10Subramanya Sastry: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [19:27:37] marxarelli: while you watch the fatalmonitor, can you report the hhvm light process issue in phab, cc'ing jynu.s and _joe._ ? kthx (if it hasn't already been, I may have missed it) [19:27:38] 6operations, 7Icinga: icinga contacts and permissions for ema and elukey - https://phabricator.wikimedia.org/T124941#1970923 (10Dzahn) The meta check "Check correctness of the icinga configuration" ([[ https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=neon&service=Check+correctness+of+the+ici... [19:28:11] greg-g: sure thing [19:30:15] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [19:30:16] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [19:31:35] !log stat1002 - running puppet, was reported as last run about 4 hours ago but not deactivated [19:31:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:44] aaah: [19:31:49] redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out. [19:31:59] !log stat1002 - redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out. [19:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:33:29] 6operations, 5Patch-For-Review: rack/setup pc1004-1006 - https://phabricator.wikimedia.org/T121888#1970936 (10jcrespo) [19:34:23] 6operations, 5Patch-For-Review: rack/setup pc1004-1006 - https://phabricator.wikimedia.org/T121888#1970937 (10jcrespo) 5Open>3Resolved pc100[456] are in production and pc100[123] are depooled: https://grafana.wikimedia.org/dashboard/db/server-board?from=1453318327796&to=1453922887796&var-server=pc1*&var-n... [19:34:31] 6operations, 10Traffic: Decrease max object TTL in varnishes - https://phabricator.wikimedia.org/T124954#1970940 (10BBlack) 3NEW [19:34:54] PROBLEM - puppet last run on cp1060 is CRITICAL: CRITICAL: Puppet has 1 failures [19:35:20] 6operations, 10Analytics-Cluster: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1970956 (10Dzahn) [19:35:53] ottomata: https://phabricator.wikimedia.org/T124955 [19:36:05] PROBLEM - puppet last run on cp1059 is CRITICAL: CRITICAL: Puppet has 1 failures [19:36:31] mutante, do you know what is happening? [19:36:40] jynus: i am guessing this server used to connect to tin.eqiad.wmnet in the past [19:37:02] jynus: and now we switched deployment servers to mira, so it tries to use that.. but there are missing ACLs or firewall rules [19:37:09] yes, an error, but nothing ongoing, right? [19:37:14] letting a server from analytics connect to mira [19:37:27] i dont really know what is broken if the redis on stat1002 cant connect [19:37:30] some stats i assume [19:38:08] something for discovery analytics, maybe numbers are wrong, but nothing like downtime [19:38:09] 6operations, 5WMF-deploy-2016-01-19_(1.27.0-wmf.11): Rise in "parent, LightProcess exiting" fatals on mw1019 since 1.27.0-wmf.11 deploy - https://phabricator.wikimedia.org/T124956#1970973 (10dduvall) 3NEW [19:38:38] PROBLEM - puppet last run on cp1047 is CRITICAL: CRITICAL: Puppet has 1 failures [19:39:20] there are varnish puppet errors, is that you bblack ? [19:40:16] possibly! looking [19:41:17] sorry, I was seeing too many errors and got nervous [19:41:39] yeah it's me, somehow [19:41:42] not for this thing, in general [19:42:13] PROBLEM - puppet last run on cp1046 is CRITICAL: CRITICAL: Puppet has 1 failures [19:42:19] everyting seems fine [19:42:26] well yeah I meant the puppet fails on cp10xx are me. they're not causing problems. [19:42:34] ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures daniel_zahn https://phabricator.wikimedia.org/T124955 [19:43:28] ah, so re: stat1002, what it's doing is trying to deploy this: [19:43:31] Error: /Stage[main]/Role::Elasticsearch::Analytics/Package[wikimedia/discovery/analytics]/ [19:43:40] but it cant deploy it, because cant talk to mira [19:43:45] it was the combination of cps and stat what got me nervous, ignore me [19:43:54] i dont think it's an issue besides "no new deploys" [19:44:05] yeah my brain started assuming cache_mobile was no longer relevant, but of course it still (barely) is :P [19:44:45] 7Blocked-on-Operations, 10Deployment-Systems, 10RESTBase, 6Services: RESTBase deployment process - https://phabricator.wikimedia.org/T103344#1971004 (10GWicke) [19:44:53] mutante: hmm, sorry i didn't realize we were in a no new deploys ATM [19:45:14] ebernhardson: no, i'm just saying it's broken [19:45:17] ebernhardson, he means that it is technically impossible now [19:45:19] :-) [19:45:23] oh :) [19:45:34] mutante: but yes that is us, and it was just deployed to puppet this morning [19:45:43] i think the issue is: analytics network needs to be allowed to talk to deployment server in codfw [19:45:59] i think it already can, because this is the same deployment method used by analytics for their refinery repository [19:46:06] but maybe only tin, and not mira? [19:46:07] ebernhardson: the problem is redis.exceptions.ConnectionError: Error connecting to mira.codfw.wmnet:6379. timed out. [19:46:13] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [24.0] [19:46:25] akosiaris mentioned something about not being a fan of cross-DC ACL's so it might make sense that only tin can talk to analytics [19:46:29] ebernhardson: yes, but tin is eqiad and mira is codfw, and i think there are only ACLs for eiqad [19:46:35] there are now some cross-datacenter issues [19:46:39] ebernhardson: yea [19:46:54] like the one we found yesteday about db writes from codfw [19:47:10] (03PS1) 10BBlack: Add cxserver/citoid to cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/266799 (https://phabricator.wikimedia.org/T110476) [19:47:14] I agree with that, we do not necessarily want that [19:47:24] (03CR) 10BBlack: [C: 032 V: 032] Add cxserver/citoid to cache_mobile [puppet] - 10https://gerrit.wikimedia.org/r/266799 (https://phabricator.wikimedia.org/T110476) (owner: 10BBlack) [19:48:01] mutante: i suppose the question is what to do, i can put together a patch to back out the repository until tin is back in service. Unless the plan is for tin to become the backup and mira to stay primary [19:48:15] !log started nfs-exports daemon on labstore1001, had been dead for a few days [19:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:48:38] 6operations, 10Analytics-Cluster: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971034 (10Dzahn) what it's doing is trying to deploy wikimedia/discovery/analytics and it can't deploy it because of the redis connection timeout. Error: Execution of '/usr/bin/sal... [19:49:04] RECOVERY - puppet last run on cp1060 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:49:05] ebernhardson: i made this https://phabricator.wikimedia.org/T124955 maybe you can link that patch there? [19:49:25] (03PS2) 10Cenarium: Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 [19:49:25] RECOVERY - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is OK: OK - nfs-exports is active [19:49:44] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [19:49:47] 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971050 (10jcrespo) [19:49:49] ebernhardson: afaik, we want to switch to mira for at least 48 hours but then back, but i also have to ask [19:50:05] 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971054 (10EBernhardson) https://gerrit.wikimedia.org/r/#/c/265795/ is the patch that added this, it adds a new user to analytics mac... [19:50:56] (03CR) 10Cenarium: "So that's why they were commented out, OK I've fixed that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium) [19:51:30] 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971066 (10EBernhardson) I would also note that this means analytics can't deploy new versions of refinery as long as mira is master... [19:52:22] 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971075 (10Dzahn) It might need #netops because ACLs on network hardware might have to be adjusted, since the analytics VLAN is separ... [19:52:44] RECOVERY - puppet last run on cp1046 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:53:44] RECOVERY - puppet last run on cp1059 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:04] RECOVERY - puppet last run on cp1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:37] 6operations, 10ops-eqiad, 10DBA: Decommission pc1001-1003 - https://phabricator.wikimedia.org/T124962#1971085 (10jcrespo) 3NEW [19:57:17] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are redirecting to wikimedia.org - https://phabricator.wikimedia.org/T124804#1971095 (10Krenair) [19:58:46] Krenair: they were redirecting to wikimediafoundation.org, though. [19:59:09] (03CR) 10Alex Monk: [C: 031] Autopromotion: remove deprecated onView event, fix INGROUPS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/264219 (owner: 10Cenarium) [19:59:15] PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 46 below the confidence bounds [19:59:20] MatmaRex, I thought some were showing the portal page from wikimedia.org? [19:59:46] hmm. maybe? the ones i've seen were doing a HTTP redirect to wikimediafoundation.org, though. [20:00:26] twentyafterfour: "There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). " [20:00:29] for 45 minutes now [20:00:31] ebernhardson: i checked the iptables rules, they exist on tin and mira, and allow that connection from that IP on that port, it must be on the network hardware [20:00:32] (on tin) [20:01:28] ottomata: eventlog1001 has puppet disabled with no reason specified [20:01:32] paravoid: would you have time to look at a router ACL maybe? stat1002 in analytics can't talk to mira, but it can talk to tin, i believe we are missing one to allow the codfw part [20:01:52] oh woops, thanks paravoid, that is leftover from yesterday's wikimediafoundation outage [20:02:30] fixed. [20:02:31] thanks [20:03:14] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [20:03:23] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1971105 (1... [20:03:33] 6operations, 10Analytics-Cluster, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971106 (10Dzahn) i checked ferm/iptables rules on tin and mira. they are the same and allow connections to 6379 (the redis port) fro... [20:03:54] 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971110 (10Dzahn) [20:05:03] RECOVERY - puppet last run on eventlog1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:05:07] 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971121 (10Ottomata) Yeah, makes sense! stat1002 is in the Analytics VLAN, so a rule will need to be opened up in the VL... [20:05:55] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1971123 (10Ottomata) [20:06:55] PROBLEM - Disk space on ms-be2003 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdg1 is not accessible: Input/output error [20:07:34] PROBLEM - RAID on ms-be2003 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [20:08:49] 6operations, 10Parsoid, 6Services, 10service-template-node, 7service-runner: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1971133 (10mobrovac) [20:08:52] 7Puppet, 6operations, 6Release-Engineering-Team, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1971131 (10mobrovac) 5Open>3Resolved a:5GWicke>3mobrovac [20:09:00] 7Puppet, 6operations, 6Release-Engineering-Team, 6Services, 7service-runner: Create a standard puppet module for service-runner services - https://phabricator.wikimedia.org/T89901#1048322 (10mobrovac) Indeed @Gwicke :) Done. [20:10:11] mutante: should be fixed I think [20:10:17] 6operations, 6Analytics-Kanban, 10hardware-requests: 8 x 3 SSDs for AQS nodes. - https://phabricator.wikimedia.org/T124947#1971137 (10JAllemandou) When discussing about cassandra response time issues with @Gwicke, he told me the Services Team had used SSDs to mitigate that issue. They use Samsung 850 Pro 1Tb... [20:10:34] paravoid: thank you :) [20:13:04] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:13:05] ebernhardson: paravoid: ottomata: confirmed fixed Package[wikimedia/discovery/analytics]/ensure: ensure changed 'purged' to 'present' [20:13:08] ^ [20:14:01] mutante: thanks! [20:15:43] 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971162 (10Dzahn) fixed by @Faidon , thanks! -- confirmed working now: Package[wikimedia/discovery/analytics]/ensure:... [20:15:48] danke! [20:16:01] 6operations, 10Analytics-Cluster, 10netops, 5codfw-rollout, 3codfw-rollout-Jan-Mar-2016: stat1002 - redis can't connect to mira.codfw.wmnet - https://phabricator.wikimedia.org/T124955#1971166 (10Dzahn) 5Open>3Resolved a:3Dzahn [20:16:20] greg-g: heading to lunch. things looks fine according to fatalmonitor, so a tentative \o/ [20:16:41] anomie, tgr: thanks for fixing all the things :) [20:17:53] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: Puppet has 1 failures [20:18:15] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 60.71% of data above the critical threshold [5000000.0] [20:19:03] 7Blocked-on-Operations, 6operations, 6Discovery, 3Discovery-Search-Sprint: Make elasticsearch cluster accessible from analytics hadoop workers - https://phabricator.wikimedia.org/T120281#1971185 (10EBernhardson) Another option for analytics<->codfw that me and @SMalyshev just talked about would be using an... [20:20:27] marxarelli|afk: sweet [20:21:34] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [20:23:54] (03PS2) 10Ori.livneh: ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry) [20:25:19] (03CR) 10Ori.livneh: [V: 032] ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry) [20:25:31] (03CR) 10Ori.livneh: [C: 032] ruthenium: Rename parsoid systemd file + start up a single worker [puppet] - 10https://gerrit.wikimedia.org/r/266788 (owner: 10Subramanya Sastry) [20:31:23] PROBLEM - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 45 below the confidence bounds [20:34:00] man why anomaly detection gotta be all weird [20:36:03] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [20:46:37] how's the train doing? [20:46:57] any blockers for the parsoid/ocg deploy window in 15 min? [20:48:13] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:48:14] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1971299 (1... [20:51:08] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops, 5Patch-For-Review: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafounda... - https://phabricator.wikimedia.org/T124804#1971323 [20:51:16] 6operations, 10Incident-20160126-WikimediaDomainRedirection, 10Wikimedia-Apache-configuration, 10netops: All wikis under wikimedia.org (Meta, Commons, loginwiki, others) are using wikimedia.org VHost, including /wiki/ -> wikimediafoundation.org redirect - https://phabricator.wikimedia.org/T124804#1971326 (1... [20:53:01] (03CR) 10Dzahn: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [20:54:30] (03CR) 10Dzahn: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [20:55:32] (03CR) 10Dzahn: [C: 031] make default log rotation for apache be 30 days [puppet] - 10https://gerrit.wikimedia.org/r/266480 (owner: 10ArielGlenn) [20:56:03] RECOVERY - Kafka Cluster analytics-eqiad Broker Messages In Per Second on graphite1001 is OK: OK: No anomaly detected [20:56:51] (03CR) 10Dzahn: [C: 04-1] "sorry, -1 unless we get the SSL cert issue resolved with letsencrypt some time later" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [20:57:12] GOOD better NOT have an anomaly when you don't [20:57:14] better stay like that! [20:59:31] warning: abnormal anomalies detected :p [21:00:04] gwicke cscott arlolra subbu bearND mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160127T2100). [21:00:15] (03PS2) 10Dzahn: phabricator: don't use communitymetrics@, use wikitech [puppet] - 10https://gerrit.wikimedia.org/r/266316 (https://phabricator.wikimedia.org/T123581) [21:00:24] (03CR) 10Dzahn: [C: 032] phabricator: don't use communitymetrics@, use wikitech [puppet] - 10https://gerrit.wikimedia.org/r/266316 (https://phabricator.wikimedia.org/T123581) (owner: 10Dzahn) [21:01:10] robh, greg-g, marxarelli|afk: any update on the train deploy? i'm assuming it has completed successfully and we are not currently in an outage and i'm clear to deploy ocg? [21:01:43] (03CR) 10Odder: "Hugely disappointing as the redirection doesn't work any longer." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [21:01:54] RECOVERY - Disk space on ms-be2003 is OK: DISK OK [21:03:55] cscott: yeah I think you are all good [21:07:21] cscott, are you deploying? I need to deploy something too, pls ping me when done [21:10:17] yurik: yup, on it. i'll ping you when done. shouldn't be long (assuming the world doesn't break) [21:14:56] * yurik thinks the world shouldn't break more than twice in one day... or was it yesterday? [21:15:59] 6operations, 7Mail, 7Mobile, 5Patch-For-Review: consolidate mailman redirects in exim aliases file - https://phabricator.wikimedia.org/T123581#1971422 (10Dzahn) @Jkrauska can you do this kind of thing on your side even if both addresses are external on lists? Or should that stay in exim? Do you happen to k... [21:20:22] (03CR) 10Dzahn: "I understand that must be very disappointing after all that time WMF let you wait on this just to donate a domain and i'm sorry for the wa" [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [21:22:14] (03CR) 10Subramanya Sastry: creation of parsoid-rt-admin group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/266632 (https://phabricator.wikimedia.org/T124701) (owner: 10RobH) [21:22:41] (03PS1) 10Ori.livneh: Speed trials: fix-up for inlined CSS variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266918 [21:22:55] (03CR) 10Ori.livneh: [C: 032] Speed trials: fix-up for inlined CSS variant [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266918 (owner: 10Ori.livneh) [21:23:12] hola mutante [21:23:40] letsencrypt looks really nice, any chance WMF might actually sponsor them? [21:24:05] odder: yes, it has been discussed, we are just not there yet [21:24:33] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [21:26:07] (hi odder!) [21:26:14] odder: the way this went since back in RT days is really unfortunate, sorry in the name of WMF, don't abandon that yet [21:26:22] !log ori@mira Synchronized docroot and w: (no message) (duration: 02m 26s) [21:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:26:31] !log updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf [21:26:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:27:25] hi ori! long time no see [21:27:36] cscott: gah! sorry, i neglected to update the roadmap [21:28:04] just to confirm, yeah, group1 is on wmf.11 [21:28:14] odder: yeah, how have you been? [21:28:21] odder: some more info on letsencrypt and related ticket https://phabricator.wikimedia.org/T101048 [21:29:10] !log mobileapps deployed 6f35859 [21:29:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:29:26] yurik: ok, i'm done. [21:29:55] cscott, thx. I saw that we switched off from tin. What do i need to do to set up git deploy on the new host? [21:30:40] ori: Been alright! Donated a domain the other day to the WMF and trying to unsquat a few others [21:31:03] dem f^%$s keep renewing them though, and probably not worth to get the lawyers involved, I don't think [21:31:25] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1971467 (10TrevorParscal) Approved. [21:31:58] yurik: nothing, as far as I could tell. i just logged into the new host and everything was there already. [21:32:15] yurik: https://wikitech.wikimedia.org/w/index.php?title=OCG&type=revision&diff=274824&oldid=270998 [21:32:46] cscott, it complains on git deploy start about missing user.name & user.email. Will see if i need anything else [21:33:01] (03CR) 10Odder: "I'd say let's wait for letsencrypt and make sure to dig this patch up when it's all ready and shiny." [dns] - 10https://gerrit.wikimedia.org/r/254294 (owner: 10Odder) [21:35:52] 10Ops-Access-Requests, 6operations, 6Parsing-Team, 5Patch-For-Review: Getting parsing-team members sudo access to manage (start, stop, restart) services on ruthenium - https://phabricator.wikimedia.org/T124701#1971477 (10ssastry) a:5ssastry>3RobH [21:44:44] PROBLEM - graphoid endpoints health on sca1001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:45:13] PROBLEM - graphoid endpoints health on sca1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) [21:45:39] !log updated graphoid on scb* [21:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:45:54] checking why sca is showing issues ^ [21:46:15] mobrovac, is this test still pointing to sca^, or is it really on scb? [21:52:16] yurik: no, those are the tests running on sca100x, we need to stop graphoid there [21:52:49] mobrovac, is git deploy still deploys there? [21:53:00] i thought sca100x was removed [21:53:12] i just did a full graphoid deployment + restart [21:53:41] yurik: you should have seen in the output of trebuchet that 2/4 minions succeeded [21:55:06] (03PS1) 10Papaul: admin: add dc-ops to install-server, allow to read log files [puppet] - 10https://gerrit.wikimedia.org/r/266919 [22:00:25] (03CR) 10Papaul: "I am able now to run puppet agent -t -v from carbon but i able not able to view syslog to troubleshoot MAC address issues when a new syst" [puppet] - 10https://gerrit.wikimedia.org/r/266919 (owner: 10Papaul) [22:01:41] 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1971588 (10Papaul) 5Open>3Resolved Closing this since the system is back up [22:03:05] 6operations, 10ops-codfw, 5Patch-For-Review: mw2039 fails to reboot, mgmt interface unreachable - https://phabricator.wikimedia.org/T124282#1971594 (10Papaul) 5Open>3Resolved Closing, system is back up in service. [22:04:01] 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1971597 (10Papaul) 5Resolved>3Open [22:05:01] 6operations, 10ops-codfw, 10hardware-requests: mw2173 has probably a broken disk, needs substitution and reimaging - https://phabricator.wikimedia.org/T124408#1955110 (10Papaul) Was en error closed this ticket by mistake. [22:12:34] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [5000000.0] [22:16:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 51.85% of data above the critical threshold [5000000.0] [22:21:05] robh: would it be possible for you dump the last 1000 lines of parsoid-rt and parsoid-rt-client logs (from ruthenium) to /tmp/ that I can take a look at? There is the mysterious error in testing and I want to take a look at the logs to see if it reveals something. [22:22:15] i think i can pipe into your home directory, should be ok and you can just rm it when you finish [22:22:23] that ok? [22:22:32] if you can read tmp thats cool doo [22:22:32] too [22:22:41] sounds good. [22:22:41] thanks. [22:23:05] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:23:05] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:23:31] i can read /tmp [22:23:33] subbu: Where do these log files you want live? just checking and /var/log doesnt seem to have those [22:24:06] robh, journalctl -fu .. but might need to be sudo / root. [22:24:35] journalctl -fu parsoid-rt and journalctl -fu parsoid-rt-client [22:25:14] if they are not there either .. then, i need to fix the logging setup with systemd next. :) [22:25:55] so that gives me the realtime output, not a look backward [22:26:13] trying to review how to do a historical tail [22:26:28] mobrovac, do you know ^^ .. [22:27:13] robh: just omit the "f" for a historical tail [22:27:24] journalctl -u parsoid-rt-client [22:27:55] yea but that still is a more|less type review, i just want it to grab the last 1k lines ahd shove into a file [22:28:46] and it starts at the start of the log file, where i want the end of it in a snapshot of jsut the last x lines (in this case 1k) [22:28:58] robh: journalctl -n 1000 -u service_name > /tmp/blah.log [22:29:29] !log starting mysqldump of MobileWebSectionUsage_14321266 from db1047 into m4-master [22:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:29:51] mobrovac: you rock [22:30:00] haha [22:30:14] i thought it wouldnt like that since when not piped it shows the more type fashion but nope its cool it likes it [22:30:49] yeah, it tests for tty before starting the output [22:31:44] subbu: they are in tmp [22:31:48] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971690 (10mobrovac) 3NEW [22:31:58] subbu: im going to sudo you ownership so you can rm when done [22:32:04] great. thanks. [22:32:05] and remove other read rights [22:32:13] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971709 (10mobrovac) [22:32:17] 6operations, 6Services: Migrate SCA cluster to SCB (Jessie and Node 4.2) - https://phabricator.wikimedia.org/T96017#1971710 (10mobrovac) [22:33:00] they are all yours [22:33:17] hope it helps, lemme know if you need more of them after you fix the issue =] [22:33:25] (or more in general) [22:36:14] robh .. can you restart parsoid-rt-client service? i am curious if it was just some bad state one of the test clients was stuck in .. once the services came up after successful puppet run. [22:36:46] !log restarting parsoid-rt-client service on ruthenium [22:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:52] doing now [22:36:55] done [22:37:51] thanks. [22:43:52] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971808 (10mobrovac) [22:58:23] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [23:05:35] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [23:05:48] ^ we know yuvi is going to disable a certain tool and we will try to talk w/ the user [23:12:43] 6operations, 10RESTBase-Cassandra: replace default Cassandra superuser - https://phabricator.wikimedia.org/T113622#1971939 (10GWicke) p:5Triage>3Normal We are not using the default "admin" user for any ongoing operational tasks. Additionally, the credentials for the default admin user have been recently re... [23:20:09] (03PS1) 10Tim Landscheidt: Tools: Fix double file resource for jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266934 [23:22:25] (03PS1) 10Tim Landscheidt: Tools: Fix argument quoting in jlocal [puppet] - 10https://gerrit.wikimedia.org/r/266935 [23:25:47] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1971985 (10mobrovac) [23:28:41] 6operations, 10Beta-Cluster-Infrastructure, 6Services: Move Node.JS services to Jessie and Node 4.2 - https://phabricator.wikimedia.org/T124989#1972000 (10mobrovac) [23:30:20] (03CR) 10Yuvipanda: "I don't think the comment is not pertinent (but hey, I'm biased, I wrote it)" [puppet] - 10https://gerrit.wikimedia.org/r/266935 (owner: 10Tim Landscheidt) [23:37:23] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [24.0] [23:39:46] greg-g: can I quickly deploy a SessionManager patch before the SWAT? [23:40:02] https://phabricator.wikimedia.org/T124971 [23:41:18] (03PS1) 10Mobrovac: RESTBase: Start using deployment-restbase02 for clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266945 (https://phabricator.wikimedia.org/T125003) [23:41:20] although given that CI takes 10 min per patch I probably wouldn't finish [23:41:25] after the SWAT, then [23:42:26] tgr: sure, or during [23:42:54] ah, it's full [23:44:23] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [23:56:56] (03PS1) 10Mattflaschen: Have Beta job queue settings shadow production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/266949