[00:00:05] RoanKattouw, ^d, Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150219T0000). [00:00:46] * legoktm just added another patch to swat [00:02:03] <^d> JohnLewis: You're first [00:02:16] ^d: joyful [00:02:27] (03CR) 10Chad: [C: 032] beta: don't rate limit office IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191489 (https://phabricator.wikimedia.org/T87841) (owner: 10John F. Lewis) [00:02:48] <^d> RoanKattouw: Ping for yer swat [00:04:23] James_F, ^ [00:04:28] <^d> legoktm: And started the merge for your core change [00:04:28] !log aaron Synchronized php-1.25wmf16/includes/db/LoadBalancer.php: 9dc01855bca9ba322f6cb15092b29c654d74cecc (duration: 00m 05s) [00:04:34] Logged the message, Master [00:04:42] woot [00:04:55] <^d> AaronS: We're in swat. Please just add to the list next time so we don't step on one another :) [00:08:21] (03Merged) 10jenkins-bot: beta: don't rate limit office IPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191489 (https://phabricator.wikimedia.org/T87841) (owner: 10John F. Lewis) [00:08:49] (03PS1) 10Chad: Start tracking ResourceLoaderImage debug logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191506 [00:09:48] !log demon Synchronized wmf-config/InitialiseSettings-labs.php: no-op, for completeness (duration: 00m 05s) [00:09:53] <^d> JohnLewis: You're live [00:09:53] Logged the message, Master [00:10:09] sweet [00:10:32] thankfully you didn't ask me to test as I don't feel like flying to SFO and using the WiFi :p [00:12:16] !log demon Synchronized php-1.25wmf17/includes/resourceloader/ResourceLoaderImage.php: fix up svg handling in RL (duration: 00m 07s) [00:12:18] Logged the message, Master [00:12:43] !log demon Synchronized php-1.25wmf18/includes/resourceloader/ResourceLoaderImage.php: fix up svg handling in RL (duration: 00m 07s) [00:12:45] Logged the message, Master [00:14:04] !log demon Synchronized php-1.25wmf17/includes/skins/Skin.php: (no message) (duration: 00m 05s) [00:14:07] Logged the message, Master [00:14:13] !log demon Synchronized php-1.25wmf17/includes/skins/SkinTemplate.php: (no message) (duration: 00m 05s) [00:14:14] <^d> legoktm: Your core stuff is live [00:14:15] Logged the message, Master [00:14:28] thanks, lgtm [00:15:10] <^d> Our biggest error in prod right now is OOM in specialpage :\ [00:15:20] which specialpage? [00:15:54] (03CR) 10Chad: [C: 032] Start tracking ResourceLoaderImage debug logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191506 (owner: 10Chad) [00:16:00] (03Merged) 10jenkins-bot: Start tracking ResourceLoaderImage debug logs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191506 (owner: 10Chad) [00:16:24] !log demon Synchronized wmf-config/InitialiseSettings.php: RL image debug logs (duration: 00m 07s) [00:16:27] Logged the message, Master [00:16:33] ^d: Here, sorry [00:16:42] ^d: I turned out to have a meeting during the SWAT [00:16:48] <^d> It happens :) [00:17:01] does anyone know how the job queue is set up for labswiki? [00:17:04] ^d: Need me to do a submodule update for you? [00:17:26] <^d> That'd be nice. Your extension change is nowai'd by Jenkins at the moment [00:18:53] Yeah because Jenkins is stupid [00:19:49] <^d> MatmaRex: We over-fixed our bug [00:20:03] <^d> Because we properly do the rsvg code branch now, we never hit the failure mode :p [00:20:08] <^d> So no log on what file is busted! [00:20:29] jamesofur: replied to the ticket [00:20:31] (03PS1) 10Dzahn: don't include restbase-roots in restbase.yaml [puppet] - 10https://gerrit.wikimedia.org/r/191508 (https://phabricator.wikimedia.org/T89366) [00:21:12] ^d: yeah, but shouldn't the rsvg branch also log? [00:21:24] <^d> Long as it returns false [00:21:26] since you put the log() somewhere up the call stack [00:21:32] or down [00:21:33] whatever [00:21:34] <^d> Yeah [00:22:51] legoktm: is GUP supposed to be making labswiki jobs? [00:23:00] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1048613 (10Dzahn) ``` Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Conflicting value for admin::groups found in role cassandra Warning: Not using cache on fa... [00:23:01] AaronS: um, no...... [00:24:25] ^d: https://gerrit.wikimedia.org/r/191509 [00:24:30] > var_dump(in_array('labswiki', GlobalUserPage::getEnabledWikis())); [00:24:30] bool(true) [00:24:32] ugh what [00:24:40] (03CR) 10GWicke: "@Dzahn, do you think adding the group in cassandra.yaml instead could work?" [puppet] - 10https://gerrit.wikimedia.org/r/191508 (https://phabricator.wikimedia.org/T89366) (owner: 10Dzahn) [00:26:30] !log demon Synchronized php-1.25wmf17/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.Target.js: (no message) (duration: 00m 06s) [00:26:34] Logged the message, Master [00:26:41] (03PS1) 10Legoktm: Return false from wmfCentralAuthWikiList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191510 [00:26:57] AaronS: ^ should fix [00:27:11] ^d: one more patch for swat ^ :) [00:27:32] ... [00:27:40] <^d> RoanKattouw: You're done [00:27:55] Cool, thanks [00:28:07] AaronS: the code does if ( wfRunHooks( 'GlobalUserPageWikis', array( &$list ) ) ) { $list = $wgLocalDatabases; } [00:28:33] yes, I see [00:29:04] I'll delete the jobs when chad deploys that [00:32:04] (03CR) 10Aaron Schulz: [C: 031] Return false from wmfCentralAuthWikiList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191510 (owner: 10Legoktm) [00:32:57] (03CR) 10Chad: [C: 032] Return false from wmfCentralAuthWikiList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191510 (owner: 10Legoktm) [00:33:05] (03Merged) 10jenkins-bot: Return false from wmfCentralAuthWikiList [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191510 (owner: 10Legoktm) [00:33:25] !log demon Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 06s) [00:33:55] Logged the message, Master [00:36:30] <^d> Ok, we're all done other than legoktm's last patch [00:37:12] and scap! [00:37:30] do you want me to do the scap so you're not stuck for another 30-40min? [00:37:46] (03CR) 10Chad: [C: 032] Put GlobalUserPage in the normal extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191501 (owner: 10Legoktm) [00:37:52] (03Merged) 10jenkins-bot: Put GlobalUserPage in the normal extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191501 (owner: 10Legoktm) [00:38:22] (03CR) 10Dzahn: [C: 032] don't include restbase-roots in restbase.yaml [puppet] - 10https://gerrit.wikimedia.org/r/191508 (https://phabricator.wikimedia.org/T89366) (owner: 10Dzahn) [00:38:38] !log demon Started scap: global user page extension-list fix + l10n rebuild [00:38:44] Logged the message, Master [00:39:03] 3Scrum-of-Scrums, Services, operations, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1048675 (10Dzahn) [00:39:05] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1048673 (10Dzahn) 5Open>3Resolved [00:39:12] !log Deleted labswiki redis jobs (labswiki uses the db queue) for GlobalUserPage and flushed the queue aggregator [00:39:14] Logged the message, Master [00:39:19] wikibugs: wut? [00:39:44] Dzahn closed this task as "Resolved" by committing .. eh no, i didn't [00:39:52] that's a bit too auto :) [00:39:59] <^d> "Fixes T1234" does that [00:40:16] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [00:40:17] <^d> Or "resolves" [00:40:23] <^d> Or some other black magic I'm not aware of [00:40:27] haha, that's fun [00:40:32] i said " This was supposed to resolve T89366." [00:40:40] on a partial revert [00:40:43] <^d> Hehehe [00:41:11] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1048677 (10Dzahn) 5Resolved>3Open [00:41:12] 3Scrum-of-Scrums, Services, operations, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1048679 (10Dzahn) [00:41:22] dbperformance log is quite again [00:42:22] 51.79% 1597.947 2 - SimpleCaptcha::findLinks [00:42:28] arrgg...so bad [00:43:26] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [00:43:28] TimStarling: does https://phabricator.wikimedia.org/T88661 sound like fine? [00:43:34] s/fine/fun [00:45:14] original links as in links in the previous version of the page? [00:45:20] (03PS1) 10Dzahn: remove cassandra-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/191512 (https://phabricator.wikimedia.org/T89366) [00:45:36] could just use the externallinks table for that [00:46:43] yeah, "original" as in pre-edit...which could probably just use the tables [00:47:14] there is a jumble messy code that in theory tries to use the parser cache [00:47:46] RECOVERY - puppet last run on restbase1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [00:52:09] <^d> legoktm: At scap-rebuild-cdbs now [00:52:14] :D [00:54:00] !log demon Finished scap: global user page extension-list fix + l10n rebuild (duration: 15m 21s) [00:54:03] Logged the message, Master [00:54:18] that was faster than I expected? [00:54:39] l10n was just done in the last scap [00:54:55] https://en.wikipedia.org/wiki/Special:Version looks good! [00:55:09] thanks ^d [00:55:33] <^d> bd808: Yeah, had to do it again [00:55:36] <^d> extension-list fix [00:55:49] *nod* but that's why it was fast [00:55:52] samll delta [00:55:59] *small [00:59:36] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [01:02:06] ^d: soo <<$db = wfGetLB( 'labswiki' )->getConnection( DB_SLAVE, array(), 'labswiki' );>> fails yet labswiki is in $wgLocalDatabases [01:02:28] why is it in that global? [01:03:37] <^d> i dunno [01:04:00] it really shouldn't be [01:04:19] (03CR) 10BryanDavis: "1.25wmf18 is live on group0 so this could go out as soon as it is approved by Ori and Paravoid" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [01:04:20] it breaks any foreach() loops on that var...if any exist [01:04:24] (03PS6) 10Ejegg: Ugly URLs to override mobile redirect for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182078 (owner: 10AndyRussG) [01:04:31] unless it's needed by $wgConf? [01:04:49] (03CR) 10Ejegg: [C: 032] Ugly URLs to override mobile redirect for CentralNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/182078 (owner: 10AndyRussG) [01:04:55] 3Labs, Wikimedia-Labs-Infrastructure, operations: Make labs/private really private - https://phabricator.wikimedia.org/T89642#1048724 (10Andrew) First, to clarify: the labs-private repo, although poorly named, is just as (un) private as we want it. It will most likely be replaced by something using Hiera eventu... [01:05:27] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1048725 (10Dzahn) >>! In T89366#1047097, @faidon wrote: > On the puppet front, I wasn't talking about disabling puppet for some limited time but more of... `CRITICAL: Puppet last ran 6 da... [01:05:46] 3Wikimedia-Labs-wikitech-interface, operations: wikitech instances list is blank - https://phabricator.wikimedia.org/T89808#1048726 (10Andrew) I just re-ran the smw rebuild, so that might have fixed half of this. [01:08:44] 3Ops-Access-Requests, operations: Requesting deployment access for milimetric - https://phabricator.wikimedia.org/T88769#1048737 (10Milimetric) 5Resolved>3Open @akosiaris, I don't have access to see the event logging log files on vanadium yet. I'm re-opening this as that was a part of the request. [01:12:42] !log ejegg Synchronized wmf-config/CommonSettings.php: Use URLs without mobile redirects for CentralNotice (duration: 00m 07s) [01:12:47] Logged the message, Master [01:13:01] AFComputedVariable::compute 10.64.32.27 2013 Lost connection to MySQL server during query (10.64.32.27) SELECT DISTINCT rev_user_text FROM `revision` WHERE rev_page = '8972734' AND (rev_timestamp<'20150219011106') ORDER BY rev_timestamp DESC LIMIT 10 [01:13:05] that's odd query [01:22:01] (03PS1) 10Tim Landscheidt: Tools: Install at [puppet] - 10https://gerrit.wikimedia.org/r/191521 (https://phabricator.wikimedia.org/T72324) [01:24:56] heh, reminds me of the feature request I made against OrientDB [02:06:56] 3RESTBase, operations: restbase - some nodes missing systemd unit for service - https://phabricator.wikimedia.org/T89922#1048846 (10Dzahn) [02:09:46] RECOVERY - puppet last run on restbase1005 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [02:11:45] !log restbase1004/1005 systemctl daemon-reload to run systemd-sysv-generator to make it create missing unit for restbase and unbreak puppet running the service [02:11:52] Logged the message, Master [02:18:24] !log restbase1004 - starting restbase service, running puppet [02:18:27] RECOVERY - puppet last run on restbase1004 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [02:18:30] Logged the message, Master [02:19:11] 3RESTBase, operations: restbase - some nodes missing systemd unit for service - https://phabricator.wikimedia.org/T89922#1048893 (10Dzahn) Fixed it. Just needed some patience after running systemctl daemon-reload. It did in fact recreate the files and the file for restbase appeared. After that i could start the... [02:19:55] !log l10nupdate Synchronized php-1.25wmf17/cache/l10n: (no message) (duration: 00m 01s) [02:19:59] Logged the message, Master [02:20:06] 3RESTBase, operations: restbase - some nodes missing systemd unit for service - https://phabricator.wikimedia.org/T89922#1048895 (10Dzahn) a:3Dzahn [02:20:38] 3RESTBase, operations: restbase - some nodes missing systemd unit for service - https://phabricator.wikimedia.org/T89922#1048896 (10Dzahn) 5Open>3Resolved [02:21:05] !log LocalisationUpdate completed (1.25wmf17) at 2015-02-19 02:20:01+00:00 [02:21:08] Logged the message, Master [02:23:09] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1048898 (10Dzahn) I fixed the puppet runs on 1004 and 1005 and got the restbase service to start again. details on T89922. not sure though how to prevent it from needing manual command. [02:26:00] 3Labs, Wikimedia-Labs-Infrastructure, operations: Make labs/private really private - https://phabricator.wikimedia.org/T89642#1048901 (10Dzahn) @kartik usually how it works is that you ask ops to add the private thing into the (really private) ops/private repo (how we do it with passwords as well), and then you... [02:35:37] RECOVERY - HHVM rendering on mw1141 is OK: HTTP OK: HTTP/1.1 200 OK - 66271 bytes in 1.060 second response time [02:35:38] !log l10nupdate Synchronized php-1.25wmf18/cache/l10n: (no message) (duration: 00m 01s) [02:35:43] Logged the message, Master [02:36:08] RECOVERY - Apache HTTP on mw1141 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.078 second response time [02:36:45] !log LocalisationUpdate completed (1.25wmf18) at 2015-02-19 02:35:41+00:00 [02:36:48] Logged the message, Master [02:41:49] !log restarted hhvm on mw1141 (locked up, T89912?) [02:41:55] Logged the message, Master [02:42:08] RECOVERY - HHVM busy threads on mw1141 is OK: OK: Less than 30.00% above the threshold [57.6] [02:42:16] mutante, eh - TimStarling was investigating:P [02:42:39] not really [02:42:57] RECOVERY - HHVM queue size on mw1141 is OK: OK: Less than 30.00% above the threshold [10.0] [02:43:27] oh, ok, i just found T89912 after making a new ticket [02:47:34] (03CR) 10Dzahn: "Error: left operand of * is not a number at /opt/wmf/software/compare-puppet-catalogs/external/change/190689/puppet/modules/vm/manifests/m" [puppet] - 10https://gerrit.wikimedia.org/r/190689 (owner: 10Matanya) [02:48:50] (03CR) 10Dzahn: "limitation of compiler or error in change? http://puppet-compiler.wmflabs.org/586/change/190689/html/lvs1001.wikimedia.org.html" [puppet] - 10https://gerrit.wikimedia.org/r/190689 (owner: 10Matanya) [02:53:04] 3RESTBase, operations: restbase - some nodes missing systemd unit for service - https://phabricator.wikimedia.org/T89922#1048954 (10GWicke) Thanks a lot for investigating & fixing this, @dzahn! [04:02:16] PROBLEM - Cassandra database on cerium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (cassandra), command name java, args CassandraDaemon [04:03:17] RECOVERY - Cassandra database on cerium is OK: PROCS OK: 1 process with UID = 109 (cassandra), command name java, args CassandraDaemon [04:06:25] Cassandra weirdness? [04:06:53] yeah dunno, never seen this alert trigger before [04:07:12] might be the alert's fault [04:07:39] ps shows that the process was restarted 5 mins ago [04:07:45] ah, hm [04:08:26] don't have shell on those boxes yet, so dunno [04:08:51] generally if you want to learn about what cassandra is up to check /var/log/cassandra/system.log [04:08:59] very detailed [04:09:13] on cerium? 'last' shows you logged in yesterday. [04:09:24] oh, cerium [04:09:36] sorry, didn't look carefully [04:09:51] that might be oom actually [04:10:30] there isn't much buffer on those test nodes, they only have 16g [04:11:08] cassandra is configured to use close to 10g of those [04:11:58] on ganglia i see cached memory trending toward 0 leading up to the event [04:12:26] one of the restbase processes used close to 3g [04:12:32] normal is ~200m [04:12:59] it sometimes happens under heavy write load when the cassandra nodes are overloaded [04:13:25] I'm currently hammering the cluster with ~2k req/s while doing about 20 writes/s [04:13:33] all the writes on cerium [04:14:35] http://bit.ly/1znKPjp [04:15:26] what happened at 1:30? [04:15:27] reqs are currently underreported as txstatsd is overloaded & drops packets [04:16:00] jgage: the one process using a lot of memory I think [04:16:13] hmm [04:16:48] I should add a small heap limit check that restarts the worker cleanly when memory reaches some limit [04:17:35] the basic issue is that sometimes, when the cassandra nodes are overloaded, the connection to cassandra hangs [04:18:03] the driver doesn't have app-level heartbeats yet (datastax are working on it), so doesn't detect the situation well [04:18:14] which then causes write requests to pile up, which uses memory [04:18:28] that will be nice to have (heartbeats) [04:18:36] yup [04:18:46] I added tcp keep-alive support earlier [04:18:58] cool [04:22:29] https://datastax-oss.atlassian.net/browse/NODEJS-29 [04:23:53] !log restarting txstatsd on graphite1001 with --profile; will disable profiling in a few minutes. [04:23:56] Logged the message, Master [04:24:11] jgage: was implemented a week ago [04:25:24] yay [04:32:33] !log txstatsd on graphite1001: disabled profiling and returned service to normal state [04:32:36] Logged the message, Master [04:44:50] ori: the etsy statsd daemon indeed sanitizes on ingress by default [04:45:07] so the dropping only applies to txstatsd [04:47:17] gwicke: related: the aggregation function applied by graphite is configurable; the default is to average. [04:47:40] that makes sense [04:47:40] but: https://github.com/etsy/statsd/blob/master/docs/graphite.md [04:47:52] "n the case of the above example, what would happen if you flush from statsd any faster than every 10 seconds? in that case, multiple values for the same metric may reach Graphite at any given 10-second timespan, and only the last value would take hold and be persisted - so your data would immediately be partially lost." [04:50:57] if not configured correctly, yes [04:52:02] my understanding is that with the config they provide there several flushes per graphite sampling interval should work correctly [04:53:25] I'm surprised that statsd also seems to be a simgle-process app [04:54:27] Request: POST http://cs.wikinews.org/w/index.php?title=11._z%C3%A1%C5%99%C3%AD_2008&action=submit, from 91.198.174.65 via amssq38 amssq38 ([10.20.0.138]:3128), Varnish XID 1710177164 [04:54:31] Forwarded for: 88.101.17.127, 91.198.174.65, 91.198.174.65 [04:54:33] Error: 503, Service Unavailable at Thu, 19 Feb 2015 04:53:33 GMT [05:07:32] Danny_B: I see one narrow spike in https://gdash.wikimedia.org/dashboards/reqerror/ [05:08:07] nothing dramatic though [05:18:30] gwicke: i'm just reporting, up to you guys to decide what to do... ;-) [05:37:48] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [05:49:43] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Feb 19 05:48:40 UTC 2015 (duration 48m 39s) [05:49:51] Logged the message, Master [05:55:09] !log etherpad-lite restart to pick up m1-master CNAME [05:55:13] Logged the message, Master [05:57:09] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:05:32] (03PS1) 10KartikMistry: comment: Fix instance name of $wmgParsoidURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191540 [06:05:50] Krenair: ^^ [06:05:57] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet has 1 failures [06:06:49] (03Abandoned) 10KartikMistry: WIP: Give apertium-admins access to kartik [puppet] - 10https://gerrit.wikimedia.org/r/189915 (owner: 10KartikMistry) [06:07:30] !log ran RT update-rt-siteconfig + apache restart to pick up m1-master CNAME [06:07:33] Logged the message, Master [06:11:36] !log bacula-director restart to pick up m1-master CNAME [06:11:39] Logged the message, Master [06:11:53] akosiaris: ^ [06:14:12] (03PS1) 10KartikMistry: Beta: CX: Enable en-uz language pair [puppet] - 10https://gerrit.wikimedia.org/r/191542 (https://phabricator.wikimedia.org/T88037) [06:16:02] (03PS2) 10KartikMistry: Beta: CX: Enable en-uz and ru-uz language pairs [puppet] - 10https://gerrit.wikimedia.org/r/191542 (https://phabricator.wikimedia.org/T88037) [06:23:07] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:28:27] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Puppet has 3 failures [06:28:47] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:52] (03PS1) 10KartikMistry: Beta: CX: Add Minangkabau (min) in target language [puppet] - 10https://gerrit.wikimedia.org/r/191544 (https://phabricator.wikimedia.org/T87163) [06:29:37] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - puppet last run on db1046 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:47] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:47:07] RECOVERY - puppet last run on db1046 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:47:17] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:48:57] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:49:07] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:07] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:24:28] (03Abandoned) 10Giuseppe Lavagetto: Puppetize a few symlinks that are hotfixed on silver [puppet] - 10https://gerrit.wikimedia.org/r/189774 (owner: 10Andrew Bogott) [07:25:23] (03PS2) 10Giuseppe Lavagetto: puppetmaster: rename reimage.sh to wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/190226 (owner: 10Filippo Giunchedi) [07:25:37] (03CR) 10Giuseppe Lavagetto: [C: 032] puppetmaster: rename reimage.sh to wmf-reimage [puppet] - 10https://gerrit.wikimedia.org/r/190226 (owner: 10Filippo Giunchedi) [07:28:07] (03PS2) 10Giuseppe Lavagetto: wmf-reimage: ask for ipmi password if possible [puppet] - 10https://gerrit.wikimedia.org/r/190227 (owner: 10Filippo Giunchedi) [07:29:10] (03CR) 10Giuseppe Lavagetto: [C: 032] wmf-reimage: ask for ipmi password if possible [puppet] - 10https://gerrit.wikimedia.org/r/190227 (owner: 10Filippo Giunchedi) [07:50:16] _joe_: can you please have a look at a patch ? https://gerrit.wikimedia.org/r/#/c/190689/2 [07:50:43] <_joe_> matanya: not really :P [07:50:50] ok :) [07:50:52] <_joe_> not now I mean, it's pretty long [07:51:09] <_joe_> I know it's just linting, but it's on lvs [07:51:57] _joe_: i was refering to dzahn's comment [07:53:10] <_joe_> matanya: disregard the compiler for the moment [07:53:24] <_joe_> I'm going to fix it once and for all soon-ish [07:53:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] lvs: init.pp lint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/190689 (owner: 10Matanya) [07:54:40] thanks for your help ! [08:01:06] PROBLEM - HTTP on uranium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:04:07] RECOVERY - HTTP on uranium is OK: HTTP OK: HTTP/1.1 302 Found - 426 bytes in 0.005 second response time [08:24:02] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1049171 (10Nemo_bis) > Nemo, what sort of variance are you seeing on file downloads? When jhs reported the Wikidata XML dump was very slow, I tried downloading that and other XML d... [08:26:38] 3operations: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1049178 (10Joe) I still need the mac addresses for these hosts; I'll pull out the one for mc2001 anyways as I need to test installation at this point. [08:35:09] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1049205 (10ArielGlenn) Forgot to give an ETA. But the rsync just finished. We're back to regular rsyncs out of cron now. Waiting for bond of the ethernet interfaces on the host,... [08:40:41] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1049211 (10ArielGlenn) The problem for me is that we had some folks on the general ops queues in RT and suddenly they are being told now, "No... [08:44:44] (03PS1) 10Alexandros Kosiaris: Reduce backups retention period to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/191549 [08:48:51] greetings [08:52:27] <_joe_> hey godog [08:52:36] <_joe_> I merged two changes of yours [08:54:57] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: CX: Enable en-uz and ru-uz language pairs [puppet] - 10https://gerrit.wikimedia.org/r/191542 (https://phabricator.wikimedia.org/T88037) (owner: 10KartikMistry) [08:55:54] (03PS2) 10Alexandros Kosiaris: Beta: CX: Add Minangkabau (min) in target language [puppet] - 10https://gerrit.wikimedia.org/r/191544 (https://phabricator.wikimedia.org/T87163) (owner: 10KartikMistry) [08:56:11] _joe_: cool, thanks! the wmf-reimage ones I suppose? [08:56:42] (03CR) 10KartikMistry: [C: 031] "Go ahead!" [puppet] - 10https://gerrit.wikimedia.org/r/191544 (https://phabricator.wikimedia.org/T87163) (owner: 10KartikMistry) [08:56:55] <_joe_> godog: yep [08:59:23] nice [09:02:21] (03CR) 10Alexandros Kosiaris: [C: 032] Reduce backups retention period to 60 days [puppet] - 10https://gerrit.wikimedia.org/r/191549 (owner: 10Alexandros Kosiaris) [09:07:34] (03PS2) 10Filippo Giunchedi: Add a statsd_port parameter to the restbase class [puppet] - 10https://gerrit.wikimedia.org/r/191350 (owner: 10GWicke) [09:07:52] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Add a statsd_port parameter to the restbase class [puppet] - 10https://gerrit.wikimedia.org/r/191350 (owner: 10GWicke) [09:08:03] (03PS2) 10Filippo Giunchedi: es-tool: Also show unassigned shards during restart [puppet] - 10https://gerrit.wikimedia.org/r/191356 (owner: 10Chad) [09:08:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] es-tool: Also show unassigned shards during restart [puppet] - 10https://gerrit.wikimedia.org/r/191356 (owner: 10Chad) [09:09:31] (03CR) 10Alexandros Kosiaris: [C: 032] "How did this even get merged in the first place ? Thanks for the fix matanya!" [puppet] - 10https://gerrit.wikimedia.org/r/191386 (owner: 10Matanya) [09:20:27] akosiaris: Coren has very strong feeling about the 4 digit thingy [09:20:56] had a long disscussion with him around the matter in -labs [09:21:08] matanya: as in he wants 4 digits ? or not ? [09:21:15] not [09:21:21] and that is wrong [09:21:28] 3operations, Wikimedia-Git-or-Gerrit: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1049252 (10Prtksxna) > a local repo (as in the owner's computer) would work fine as well @MSyed and I need to collaborate on this so a local copy won't suffice.... [09:21:47] btw 5 is completely wrong btw [09:21:59] s/btw// [09:22:25] so 5 is absolutely wrong, 3 is an omission, 4 is the correct way to do it. Thanks for fixing it [09:23:45] akosiaris: if you have 3 minutes: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20150214.txt and http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-labs/20150215.txt [09:30:02] (03CR) 10Faidon Liambotis: "No objections in concept -- can't comment on the MW side of things." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [09:31:25] (03CR) 10Filippo Giunchedi: [C: 031] "I'm +1 on this and let T89366 move forward with restbase-roots" [puppet] - 10https://gerrit.wikimedia.org/r/191512 (https://phabricator.wikimedia.org/T89366) (owner: 10Dzahn) [09:36:09] (03PS5) 10KartikMistry: Use compact registry format for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/190990 [09:36:13] 3operations, Wikimedia-Git-or-Gerrit: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1049261 (10akosiaris) >>! In T89640#1049252, @Prtksxna wrote: >> a local repo (as in the owner's computer) would work fine as well > > @MSyed and I need to collab... [09:43:08] PROBLEM - puppet last run on cp3013 is CRITICAL: CRITICAL: puppet fail [09:45:42] (03CR) 10Alexandros Kosiaris: [C: 032] "LGTM, but not merging until the blocking cxserver change is deployed" [puppet] - 10https://gerrit.wikimedia.org/r/190990 (owner: 10KartikMistry) [09:50:23] (03Abandoned) 10Nemo bis: Graph User::pingLimiter() actions in gdash [puppet] - 10https://gerrit.wikimedia.org/r/166511 (https://bugzilla.wikimedia.org/65478) (owner: 10Nemo bis) [09:55:41] akosiaris: how +2 won't merge? Is it different in Puppet repo than usual repo? (just curious question) [09:56:04] I actually have to hit a submit button [09:56:09] there is no automatic merging [09:57:04] No jenkins magic there then? [09:58:31] !log Deleted more bogus GlobalUserPage purge job queues [09:58:35] Logged the message, Master [10:02:47] RECOVERY - puppet last run on cp3013 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [10:07:42] (03PS1) 10ArielGlenn: certs: remove leading blank from dumps.wm.o pem file [puppet] - 10https://gerrit.wikimedia.org/r/191560 [10:10:40] (03CR) 10ArielGlenn: [C: 032] certs: remove leading blank from dumps.wm.o pem file [puppet] - 10https://gerrit.wikimedia.org/r/191560 (owner: 10ArielGlenn) [10:17:38] (03PS1) 10Giuseppe Lavagetto: memcached: install one host in codfw for testing [puppet] - 10https://gerrit.wikimedia.org/r/191562 [10:21:30] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1049354 (10ArielGlenn) [10:21:33] 3operations: replace dumps.wikimedia.org sha1 cert with sha256 cert - https://phabricator.wikimedia.org/T88497#1049352 (10ArielGlenn) 5Open>3Resolved Fixed, we had the old 'leading blank before ---BEGIN line in the cert' bug. Deployed and serving. Closing ticket. [10:22:31] (03PS2) 10Giuseppe Lavagetto: memcached: install one host in codfw for testing [puppet] - 10https://gerrit.wikimedia.org/r/191562 [10:22:45] (03CR) 10Giuseppe Lavagetto: [C: 032] memcached: install one host in codfw for testing [puppet] - 10https://gerrit.wikimedia.org/r/191562 (owner: 10Giuseppe Lavagetto) [10:23:38] <_joe_> I should count the time I spend waiting for jenkins [10:31:27] PROBLEM - etherpad.wikimedia.org HTTPS on zirconium is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - string titleEtherpad not found on https://etherpad.wikimedia.org:443//p/Etherpad - 605 bytes in 0.055 second response time [10:33:37] RECOVERY - etherpad.wikimedia.org HTTPS on zirconium is OK: HTTP OK: HTTP/1.1 200 OK - 28844 bytes in 0.075 second response time [10:39:16] PROBLEM - Apache HTTP on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [10:39:37] PROBLEM - HHVM processes on mw1114 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [10:40:08] PROBLEM - HHVM rendering on mw1114 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.030 second response time [10:45:10] <_joe_> mh this is strange [10:46:47] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.136 second response time [10:47:16] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 25 processes with command name hhvm [10:47:37] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 66345 bytes in 0.977 second response time [10:59:08] 3operations, Wikimedia-Git-or-Gerrit: TransparencyReport repository master in Gerrit silently made private - https://phabricator.wikimedia.org/T89640#1049408 (10Prtksxna) >>! In T89640#1049261, @akosiaris wrote: > A (different and newly created) private gerrit repo then. In fact a clone of the current one. Then... [11:09:03] (03Draft1) 10Filippo Giunchedi: Configuring git-fat to work with Archiva [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191573 [11:09:05] (03Draft1) 10Filippo Giunchedi: add README.md [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191574 [11:09:07] (03Draft1) 10Filippo Giunchedi: add metrics-ganglia and metrics-graphite via git-fat [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191575 [11:24:25] (03CR) 10Alexandros Kosiaris: [C: 04-2] "I have talked with Filippo on IRC and discussed this. Putting binary .jar files straight in a .deb file is not a very good approach on thi" [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188385 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [11:26:31] (03CR) 10Alexandros Kosiaris: [C: 031] Configuring git-fat to work with Archiva [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191573 (owner: 10Filippo Giunchedi) [11:27:02] (03CR) 10Alexandros Kosiaris: [C: 032] add README.md [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191574 (owner: 10Filippo Giunchedi) [11:27:25] (03CR) 10Alexandros Kosiaris: [C: 031] add metrics-ganglia and metrics-graphite via git-fat [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191575 (owner: 10Filippo Giunchedi) [11:28:25] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1049511 (10Aklapper) > So I'd like to see how we can _at least_ preserve folks' access from RT Technically, past: I added folks which have/ha... [11:32:29] (03PS4) 10Giuseppe Lavagetto: cassandra: set rack/dc/cluster name [puppet] - 10https://gerrit.wikimedia.org/r/191339 (https://phabricator.wikimedia.org/T89657) (owner: 10Filippo Giunchedi) [11:33:36] <_joe_> godog: ^^ [11:35:09] (03CR) 10Giuseppe Lavagetto: [C: 031] "The alternative would be including both groups on all machines, but cassandra-test-roots was created specifically for testing and should b" [puppet] - 10https://gerrit.wikimedia.org/r/191512 (https://phabricator.wikimedia.org/T89366) (owner: 10Dzahn) [11:35:33] _joe_: heh I commented against using ${::site} in PS2 [11:35:44] <_joe_> godog: why? [11:36:05] "I'm a little wary of using variables for things that can potentially affect data distribution TBH [11:36:08] " [11:36:19] <_joe_> well, that is a "fact" if you want [11:36:23] <_joe_> not a real variable [11:36:36] <_joe_> and ok, it is a variable, but it's fixed at top-scope [11:36:48] <_joe_> it will be a fact in a not-too-distant future btw [11:37:07] s/variable/something that changes without version control/ [11:37:23] potentially, but you get what I mean [11:37:45] <_joe_> godog: well, it won't [11:39:00] <_joe_> godog: the alternative is using 'eqiad' there and let role/codfw/restbase.yaml do the overriding later [11:39:10] <_joe_> so, your patch, you decide [11:39:19] I stand by what I said :) [11:40:30] <_joe_> (my point is that sooner than later we'll use the $::environment variable instead of $::site, and $::environment is defined in puppet in a very clear way) [11:40:59] <_joe_> godog: yeah, np, just put that in role/common/restbase.yaml and not in the regexes, like I showed [11:42:10] I see what you mean though, by using role/$::site/restbase.yaml in practice that doesn't change things [11:42:24] "things" being my concern [11:43:53] <_joe_> godog: we do base all our config on that, basically [11:44:05] <_joe_> $::site is something we set in realm.pp [11:44:20] <_joe_> if we start thinking of it as non-trustable, we should rewrite all of our puppet code [11:44:25] <_joe_> literally, all [11:45:22] <_joe_> and no, it won't change without code review [11:45:35] <_joe_> while the hostname of a server may be changed without that [11:46:04] <_joe_> (using regexes you are using the hostname as a variable to choose on) [11:47:44] <_joe_> I really don't see your point, sorry. [11:52:54] perhaps it doesn't apply with $::site but with other variables/fact it would [11:53:06] but yeah given our $::site usage and the fact that in role/$::site/ is tied to that effectively it doesn't change much [11:53:21] <_joe_> that was my point all along :) [11:55:04] (03CR) 10Filippo Giunchedi: [C: 031] cassandra: set rack/dc/cluster name [puppet] - 10https://gerrit.wikimedia.org/r/191339 (https://phabricator.wikimedia.org/T89657) (owner: 10Filippo Giunchedi) [11:55:07] alright, +1 [11:58:25] _joe_: good to go? [12:00:14] <_joe_> yep [12:06:19] etherpad.wikimedia.org just went down for couple of seconds [12:09:43] Just got an EEXIST, symlink '../esprima/bin/esparse.js' on jenkins for npm install [12:10:15] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1049580 (10ArielGlenn) Nemo, that's pretty weird and I can't think of a good reason we'd see that behavior now and not in the past, as it's always been the case that when certain fi... [12:13:55] (03PS2) 10Alexandros Kosiaris: Use network::constants to populate url_downloader ACLs [puppet] - 10https://gerrit.wikimedia.org/r/191385 [12:20:33] (03PS3) 10Alexandros Kosiaris: Use network::constants to populate url_downloader ACLs [puppet] - 10https://gerrit.wikimedia.org/r/191385 [12:20:48] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: CX: Add Minangkabau (min) in target language [puppet] - 10https://gerrit.wikimedia.org/r/191544 (https://phabricator.wikimedia.org/T87163) (owner: 10KartikMistry) [12:26:17] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1472 bytes in 0.380 second response time [12:26:57] (03PS4) 10Alexandros Kosiaris: Use network::constants to populate url_downloader ACLs [puppet] - 10https://gerrit.wikimedia.org/r/191385 [12:32:07] (03PS5) 10Alexandros Kosiaris: Use network::constants to populate url_downloader ACLs [puppet] - 10https://gerrit.wikimedia.org/r/191385 [12:33:16] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This needs a bit more thinking as the class is instantiated from site.pp with an argument that is used as well for setting up the url-down" [puppet] - 10https://gerrit.wikimedia.org/r/191379 (owner: 10Alexandros Kosiaris) [12:37:27] 3operations: Upload python-gear 0.5.5-1 to Debian project - https://phabricator.wikimedia.org/T89952#1049604 (10hashar) 3NEW a:3akosiaris [12:37:56] 3operations: Upload python-gear 0.5.5-1 to Debian project - https://phabricator.wikimedia.org/T89952#1049604 (10hashar) [12:38:25] do Debian packaging stuff have any specific tag in Phabricator or #operations is enough? [12:39:23] (03PS6) 10Alexandros Kosiaris: Use network::constants to populate url_downloader ACLs [puppet] - 10https://gerrit.wikimedia.org/r/191385 [12:40:06] hashar: there has been a discussion on this without any actual result. I 'd say put in operations for now [12:40:11] and quite possibly ever [12:40:16] did just that :] [12:59:02] akosiaris: i need your help a sec please. what is the logic in modules/base/manifests/init.pp line 60 ? [12:59:19] that class looks weird to me [13:02:48] PROBLEM - Apache HTTP on mw1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50925 bytes in 0.057 second response time [13:03:27] PROBLEM - HHVM rendering on mw1027 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50944 bytes in 0.165 second response time [13:13:56] matanya: to include the class with the corresponding parameters depending on whether the realm is labs or production [13:14:26] so that labs machines have virt100 as their puppetmaster and everything else has puppet(.site.wmnet implied) [13:14:36] wouldn't it make more sense not do use two cases? [13:14:47] a simple if should do, i would guess [13:15:25] or a case outside the class declaration [13:15:56] it is kind of weird due to the selector being in the class declaration [13:16:00] that can be fixed [13:16:09] otherwise the class is fine [13:16:17] ok, i'll try to revise that logic [13:16:37] the logic doesn't need revision, it's more of a lint change [13:16:47] my logic is lint :) [13:16:50] ahaha [13:16:51] ok [13:17:12] as for the 503 fix we spoke about yesterday, where should that be fixed ? [13:17:56] that is a good question [13:18:10] it is emited by varnish and I think there is already a ticket for that [13:26:18] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1445 bytes in 0.186 second response time [13:27:16] (03PS1) 10Matanya: base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 [13:28:01] (03CR) 10jenkins-bot: [V: 04-1] base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 (owner: 10Matanya) [13:31:37] PROBLEM - HTTP on uranium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:34:37] RECOVERY - HTTP on uranium is OK: HTTP OK: HTTP/1.1 302 Found - 426 bytes in 0.017 second response time [13:34:48] oh, according to docs, i can't do that [13:42:23] (03PS2) 10Matanya: base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 [14:03:56] (03PS5) 10Filippo Giunchedi: cassandra: set rack/dc/cluster name [puppet] - 10https://gerrit.wikimedia.org/r/191339 (https://phabricator.wikimedia.org/T89657) [14:04:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: set rack/dc/cluster name [puppet] - 10https://gerrit.wikimedia.org/r/191339 (https://phabricator.wikimedia.org/T89657) (owner: 10Filippo Giunchedi) [14:07:02] (03PS2) 10Filippo Giunchedi: remove cassandra-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/191512 (https://phabricator.wikimedia.org/T89366) (owner: 10Dzahn) [14:07:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove cassandra-test-roots [puppet] - 10https://gerrit.wikimedia.org/r/191512 (https://phabricator.wikimedia.org/T89366) (owner: 10Dzahn) [14:09:14] (03PS1) 10Filippo Giunchedi: use restbase-roots with role restbase [puppet] - 10https://gerrit.wikimedia.org/r/191591 (https://phabricator.wikimedia.org/T89366) [14:10:18] PROBLEM - puppet last run on restbase1004 is CRITICAL: CRITICAL: puppet fail [14:11:27] PROBLEM - puppet last run on restbase1005 is CRITICAL: CRITICAL: puppet fail [14:11:33] sigh [14:17:48] PROBLEM - puppet last run on restbase1003 is CRITICAL: CRITICAL: puppet fail [14:18:13] mhh Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Error from DataBinding 'hiera' while looking up 'restbase::logging_level': undefined method `empty?' for nil:NilClass on node restbase1004.eqiad.wmnet [14:21:08] PROBLEM - puppet last run on praseodymium is CRITICAL: CRITICAL: puppet fail [14:24:37] PROBLEM - puppet last run on cerium is CRITICAL: CRITICAL: puppet fail [14:25:36] PROBLEM - puppet last run on xenon is CRITICAL: CRITICAL: puppet fail [14:25:44] trying to figure out what that error means [14:29:33] 3ops-codfw, operations: take a look at fdb2001 (in fundraising rack) and see whether it actually has a bad hdd - https://phabricator.wikimedia.org/T89407#1049720 (10Jgreen) 5Open>3Resolved OK: checkHPSA [P420i/slot0: OK, log_1: 3.3TB,RAID1+0 OK] Looks good now, thanks! [14:31:03] akosiaris: I still don't agree with you. It's meant to be octal, and without a leading zero the base is ambiguous. :-) [14:35:17] PROBLEM - Apache HTTP on mw1032 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50925 bytes in 0.053 second response time [14:36:07] PROBLEM - HHVM rendering on mw1032 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50944 bytes in 0.141 second response time [14:54:05] (03PS1) 10Filippo Giunchedi: Revert "remove cassandra-test-roots" [puppet] - 10https://gerrit.wikimedia.org/r/191607 [14:54:20] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "remove cassandra-test-roots" [puppet] - 10https://gerrit.wikimedia.org/r/191607 (owner: 10Filippo Giunchedi) [14:55:47] RECOVERY - puppet last run on restbase1003 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [14:57:47] <_joe_> godog: why the revert? [14:58:28] I just realized empty files make hiera choke (before I reverted) [14:58:50] _joe_ lmk when/if you wanna do the rest today [14:59:01] <_joe_> cmjohnson1: hey! [14:59:07] RECOVERY - puppet last run on praseodymium is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [14:59:22] <_joe_> I'd say let's start now [14:59:32] sounds good to me [14:59:51] mc1007 is next on the list [15:01:20] (03PS1) 10Filippo Giunchedi: remove cassandra-test-roots group from cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/191617 (https://phabricator.wikimedia.org/T89366) [15:01:38] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] remove cassandra-test-roots group from cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/191617 (https://phabricator.wikimedia.org/T89366) (owner: 10Filippo Giunchedi) [15:02:38] RECOVERY - puppet last run on cerium is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [15:03:46] RECOVERY - puppet last run on xenon is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [15:05:27] (03PS1) 10Filippo Giunchedi: restbase: grant access to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/191618 (https://phabricator.wikimedia.org/T89366) [15:05:37] PROBLEM - HHVM rendering on mw1087 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.011 second response time [15:05:47] PROBLEM - Apache HTTP on mw1087 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.017 second response time [15:05:52] (03Abandoned) 10Filippo Giunchedi: use restbase-roots with role restbase [puppet] - 10https://gerrit.wikimedia.org/r/191591 (https://phabricator.wikimedia.org/T89366) (owner: 10Filippo Giunchedi) [15:06:19] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: grant access to restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/191618 (https://phabricator.wikimedia.org/T89366) (owner: 10Filippo Giunchedi) [15:08:23] (03PS2) 10Giuseppe Lavagetto: move mc1007 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191274 [15:08:48] RECOVERY - puppet last run on restbase1004 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [15:09:56] RECOVERY - puppet last run on restbase1005 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:10:51] 3operations: empty hiera yaml file makes lookup fail - https://phabricator.wikimedia.org/T89957#1049780 (10fgiunchedi) 3NEW [15:10:58] <_joe_> !log powering down mc1007,mc1008 [15:11:02] _joe_: this bug ^ [15:11:06] Logged the message, Master [15:11:07] <_joe_> godog: thanks I was about to ask that [15:11:13] <_joe_> because that's not supposed to happen [15:11:22] <_joe_> I inherited the bug from puppetlabs :P [15:13:47] PROBLEM - Apache HTTP on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.011 second response time [15:13:58] PROBLEM - HHVM rendering on mw1103 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.012 second response time [15:15:56] PROBLEM - Apache HTTP on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.020 second response time [15:16:07] PROBLEM - HHVM rendering on mw1078 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.008 second response time [15:16:08] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 67154 bytes in 0.410 second response time [15:17:03] <_joe_> !log restarting hhvm on mw1027, TC full [15:17:06] Logged the message, Master [15:17:46] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.148 second response time [15:19:16] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [15:19:27] RECOVERY - HHVM rendering on mw1103 is OK: HTTP OK: HTTP/1.1 200 OK - 67153 bytes in 0.149 second response time [15:20:17] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.090 second response time [15:20:37] RECOVERY - HHVM rendering on mw1078 is OK: HTTP OK: HTTP/1.1 200 OK - 67154 bytes in 0.224 second response time [15:21:48] <_joe_> !log restarting hhvm on mw1103, mw1078,mw1032 - TC full as well. [15:21:53] Logged the message, Master [15:21:57] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 2.469 second response time [15:22:47] RECOVERY - HHVM rendering on mw1032 is OK: HTTP OK: HTTP/1.1 200 OK - 67167 bytes in 0.143 second response time [15:23:58] RECOVERY - HHVM rendering on mw1087 is OK: HTTP OK: HTTP/1.1 200 OK - 67167 bytes in 0.257 second response time [15:24:06] RECOVERY - Apache HTTP on mw1087 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [15:24:44] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1007 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191274 (owner: 10Giuseppe Lavagetto) [15:25:06] (03PS2) 10Giuseppe Lavagetto: move mc1008 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191275 [15:26:22] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1008 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191275 (owner: 10Giuseppe Lavagetto) [15:31:23] (03PS1) 10Giuseppe Lavagetto: hhvm: raise the amount of cold TC cache available [puppet] - 10https://gerrit.wikimedia.org/r/191620 [15:31:50] <^d> _joe_: Will that help with what I'm talking about in #-core? [15:32:08] <^d> (T89958) [15:32:25] <^d> Yes, looks like it [15:32:50] <_joe_> ^d: wait, what wil help with what? [15:32:57] <_joe_> :P [15:33:11] <^d> raising cold TC cache [15:33:39] <_joe_> ^d: for sure [15:34:36] 3MediaWiki-Core-Team, operations: Unexpected N4HPHP13DataBlockFullE - https://phabricator.wikimedia.org/T89958#1049847 (10Chad) p:5Triage>3Normal [15:34:55] 3MediaWiki-Core-Team, operations: Unexpected N4HPHP13DataBlockFullE - https://phabricator.wikimedia.org/T89958#1049853 (10Joe) From my analysis, we have exhausted our cold cache on a few appservers. I have proposed https://gerrit.wikimedia.org/r/191620 as a solution. [15:35:55] (03CR) 10Chad: [C: 031] hhvm: raise the amount of cold TC cache available [puppet] - 10https://gerrit.wikimedia.org/r/191620 (owner: 10Giuseppe Lavagetto) [15:35:59] ottomata: around in next 1 hour? [15:36:14] 3operations: removing admin::groups from hiera doesn't revoke permissions - https://phabricator.wikimedia.org/T89961#1049862 (10fgiunchedi) 3NEW [15:36:22] ottomata: I'm going to deploy cxserver update, git deploy :) [15:36:41] or whoever with git deploy experiences is fine. [15:37:25] (03PS3) 10Giuseppe Lavagetto: nutcracker: move and label mc1007 [puppet] - 10https://gerrit.wikimedia.org/r/191292 [15:38:32] 3Labs, Wikimedia-Labs-Infrastructure, operations: Make labs/private really private - https://phabricator.wikimedia.org/T89642#1049875 (10BBlack) I suspect the issue here is that there's a 3rd class of data privacy in play: API keys and such that aren't as private as our production-private stuff, but which we'd r... [15:38:53] (03Abandoned) 10Filippo Giunchedi: initial debian/ directory [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188384 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [15:39:10] (03CR) 10Alexandros Kosiaris: [C: 04-1] base: move selector outside resource block (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/191589 (owner: 10Matanya) [15:39:12] (03Abandoned) 10Filippo Giunchedi: import LICENSE and initial jar files [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188385 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [15:39:19] (03Abandoned) 10Filippo Giunchedi: include gitreview [debs/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/188386 (https://phabricator.wikimedia.org/T78514) (owner: 10Filippo Giunchedi) [15:40:34] (03PS3) 10Matanya: base: move selector outside resource block [puppet] - 10https://gerrit.wikimedia.org/r/191589 [15:41:01] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: move and label mc1007 [puppet] - 10https://gerrit.wikimedia.org/r/191292 (owner: 10Giuseppe Lavagetto) [15:41:16] (03PS3) 10Giuseppe Lavagetto: nutcracker: move and label mc1008 [puppet] - 10https://gerrit.wikimedia.org/r/191293 [15:41:45] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] nutcracker: move and label mc1008 [puppet] - 10https://gerrit.wikimedia.org/r/191293 (owner: 10Giuseppe Lavagetto) [15:42:39] (03PS2) 10Giuseppe Lavagetto: move mc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191309 [15:44:08] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1007 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191309 (owner: 10Giuseppe Lavagetto) [15:44:21] (03PS2) 10Giuseppe Lavagetto: move mc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191310 [15:45:27] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1008 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191310 (owner: 10Giuseppe Lavagetto) [15:48:55] 3Services, RESTBase, operations: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1049898 (10fgiunchedi) a:5fgiunchedi>3Cmjohnson moving to @cmjohnson for reseat/diagnosis [15:49:12] !log oblivian Synchronized wmf-config/session.php: mc1007-8 IP change (duration: 00m 06s) [15:49:16] Logged the message, Master [15:50:27] * anomie sees nothing for SWAT this morning [15:50:57] 3Ops-Access-Requests, RESTBase, operations: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1049901 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi this should be fixed, access granted [15:50:58] 3Scrum-of-Scrums, RESTBase, operations, Services: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1049904 (10fgiunchedi) [15:52:26] 3RESTBase-Cassandra, operations: use correct datacenter/rack for cassandra nodes - https://phabricator.wikimedia.org/T89657#1049906 (10fgiunchedi) 5Open>3Resolved now rack and dc are set according to hiera [15:54:03] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Configuring git-fat to work with Archiva [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191573 (owner: 10Filippo Giunchedi) [15:54:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add README.md [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191574 (owner: 10Filippo Giunchedi) [15:54:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add metrics-ganglia and metrics-graphite via git-fat [software/dropwizard-metrics] - 10https://gerrit.wikimedia.org/r/191575 (owner: 10Filippo Giunchedi) [15:56:29] (03CR) 10BBlack: [C: 04-2] "I'm also not thrilled about where this .wiki plan is going. If for the short term we need to turn these domains on so that they're not la" [dns] - 10https://gerrit.wikimedia.org/r/191104 (https://phabricator.wikimedia.org/T88873) (owner: 10Dzahn) [15:57:12] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1049924 (10chasemp) >>! In T89053#1049211, @ArielGlenn wrote: > The problem for me is that we had some folks on the general ops queues in RT a... [15:58:18] could do https://gerrit.wikimedia.org/r/#/c/191540/ :P [15:59:38] <^d> Maybe https://gerrit.wikimedia.org/r/#/c/190106/ [15:59:42] <^d> It's still noisy [16:00:05] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150219T1600). [16:00:26] Haven't we taught jouncebot to count the number of patches yet [16:00:30] (03CR) 10Chad: [C: 032] comment: Fix instance name of $wmgParsoidURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191540 (owner: 10KartikMistry) [16:00:37] (03Merged) 10jenkins-bot: comment: Fix instance name of $wmgParsoidURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191540 (owner: 10KartikMistry) [16:00:47] yeah [16:00:54] (03PS2) 10Giuseppe Lavagetto: move mc1009 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191276 [16:01:18] !log demon Synchronized wmf-config/CommonSettings-labs.php: (no message) (duration: 00m 07s) [16:01:23] Logged the message, Master [16:01:25] <^d> That wasn't even worth a swat [16:01:33] <_joe_> hey deployers, I'm moving servers around. I may need to do a sync file in ~ 20 mins [16:01:45] <_joe_> do you want to swat it? :P [16:01:57] I wonder what's going on in https://gerrit.wikimedia.org/r/#/c/190406/1 [16:02:15] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1009 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191276 (owner: 10Giuseppe Lavagetto) [16:02:41] (03PS2) 10Giuseppe Lavagetto: move mc1010 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191277 [16:03:17] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1010 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191277 (owner: 10Giuseppe Lavagetto) [16:04:02] ^d: Thanks ;) [16:04:06] <^d> yw [16:04:54] (03CR) 10BBlack: [C: 04-2] "Same comments as the other dotwiki patch, except with these the problem grows from x518 to x546. Don't even ask me how we'd manage adding" [dns] - 10https://gerrit.wikimedia.org/r/191109 (https://phabricator.wikimedia.org/T88873) (owner: 10Dzahn) [16:05:32] chasemp: one note on the operations tag ticket: the subject is exactly about ease of use, and not the lost access to stuff. thanks for your effort to explain the currect status. [16:05:53] ^d: no more patches for SWAT right? [16:06:01] <^d> Nothing's on the list [16:06:04] matanya: sure I was mainly responding to apergos and clarifying in general [16:06:11] I can start cx stuff early if you're fine. [16:07:54] (03PS3) 10Giuseppe Lavagetto: nutcracker: move and label mc1009 [puppet] - 10https://gerrit.wikimedia.org/r/191294 [16:13:31] ^d: can we start then? Or should be on time? :) [16:13:57] <^d> No objections from me [16:15:07] ^d: ok. will start cxserver deployment first. I need to do that first before CX. [16:15:20] and then Puppet patch to merge. [16:15:39] Will need someone from Ops with +2 :) [16:16:46] !log started cxserver deployment [16:16:50] Logged the message, Master [16:19:00] blah [16:19:04] ottomata: around? [16:19:37] hiya yup [16:20:06] ottomata: The repository is dirty. Please commit or revert any uncommitted changes. - why? [16:20:20] just did git pull and git deploy sync [16:20:32] submodule problem again? [16:20:43] <^d> What's git status say? [16:21:08] modified: src [16:21:10] <^d> `git submodule update` then? A pull doesn't update submodules [16:21:16] which is fine. [16:21:58] right! [16:22:03] manybubbles, are you swating? [16:22:10] kart, where? [16:22:15] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: move and label mc1009 [puppet] - 10https://gerrit.wikimedia.org/r/191294 (owner: 10Giuseppe Lavagetto) [16:22:19] are you seeing that "repository is dirty"...? [16:22:21] kart_: ^ [16:22:29] (03PS3) 10Giuseppe Lavagetto: nutcracker: move and label mc1010 [puppet] - 10https://gerrit.wikimedia.org/r/191295 [16:22:36] <^d> I was swatting, but we declared swat over. No patches. [16:22:38] ottomata: ys. started by submodule update now. [16:22:40] (03CR) 10Legoktm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) (owner: 10Legoktm) [16:22:47] anomie, ^d, marktraceur, are you swating? [16:22:49] ? [16:22:56] (03CR) 10Giuseppe Lavagetto: [C: 032] nutcracker: move and label mc1010 [puppet] - 10https://gerrit.wikimedia.org/r/191295 (owner: 10Giuseppe Lavagetto) [16:23:04] (03CR) 10Giuseppe Lavagetto: [V: 032] nutcracker: move and label mc1010 [puppet] - 10https://gerrit.wikimedia.org/r/191295 (owner: 10Giuseppe Lavagetto) [16:23:18] yurikR: I'm not, there's nothing requested this morning. [16:23:28] anomie, could i ask you to push out https://gerrit.wikimedia.org/r/#/c/191541/ [16:23:35] !log Updated cxserver to 395be27 [16:23:41] Logged the message, Master [16:23:44] <^d> I let the next deployment window start early since swat was empty [16:23:56] yurikR: What ^d said. [16:24:00] ottomata: done :) [16:24:22] ottomata: can you merge, https://gerrit.wikimedia.org/r/#/c/190990/ [16:24:32] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1049974 (10faidon) Thanks Chase, that's useful. Where do we use #operations used as an ACL object? [16:24:34] ah, np, thx. kart_ let me know when you are done with depl ) [16:24:36] ottomata: alex already has +2 on it. [16:24:44] yurikR: sorry about that. [16:24:51] <^d> Also, it's not even merged or reviewed on master, out of scope for swat unless we're unbreaking something major imho [16:24:53] (03PS2) 10Legoktm: composer.json: Set classmap-authoritative: true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188393 (https://phabricator.wikimedia.org/T85182) [16:24:56] kart_, absolutelly no worries ) [16:24:59] yurikR: but now I have to finish and this is first time deployment :) [16:25:08] (out of mw train) [16:25:15] enjoy ))) [16:25:19] kart_ yes, alex told me you might want that :) [16:25:23] (03PS6) 10Ottomata: Use compact registry format for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/190990 (owner: 10KartikMistry) [16:25:41] (03CR) 10Ottomata: [C: 032 V: 032] Use compact registry format for cxserver [puppet] - 10https://gerrit.wikimedia.org/r/190990 (owner: 10KartikMistry) [16:26:13] oops, _joe_, i just merged two commits of yours [16:27:27] PROBLEM - Cassandra database on restbase1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [16:27:31] kart_: merged [16:27:38] ottomata: thanks. [16:27:48] you want me to force apuppet run anywhere? [16:27:54] cassandra database page? [16:27:57] PROBLEM - puppet last run on restbase1006 is CRITICAL: CRITICAL: Puppet has 1 failures [16:27:58] why the hell is that paging? [16:28:08] yurikR: Punctuality is part of the SWAT rules - not always followed, but good to keep in mind :) [16:28:08] by that I mean, why is it a paging alert [16:28:28] ottomata: me? [16:28:45] marktraceur, punctuality?? what's that? never heard of it ( [16:28:53] <_joe_> ottomata: why did you merge my changes???? [16:29:04] <_joe_> shit just read [16:29:19] didn't expect that alarm on restbase1006, apologies [16:29:23] <_joe_> well, luckily it was correct, or now we'd be down [16:29:57] paravoid: we got that page last night too (my night) but it resolved itself shortly [16:30:02] haha, _joe_, don't merge in gerrit if you don't want them merged! [16:30:09] and no one seemed to be around, not sure who it paged but yeah, it's been happening [16:30:10] * yurikR remembers the "you haven't worked at WMF until you brought down the servers" moto... [16:30:18] ottomata: how much time it will take to take into effect? [16:30:28] <_joe_> ottomata: well, I was about to merge them [16:30:36] <_joe_> we do have a 10 minutes of tolerance right? [16:30:43] kart_ 30ish minutes max? [16:30:44] (03PS1) 10Faidon Liambotis: Remove critical => true from Cassandra alert [puppet] - 10https://gerrit.wikimedia.org/r/191631 [16:30:52] _joe_? [16:31:12] <_joe_> usually I merge and prepare pre-flight checks, then puppet-merge it [16:31:18] (03CR) 10Filippo Giunchedi: [C: 031] Remove critical => true from Cassandra alert [puppet] - 10https://gerrit.wikimedia.org/r/191631 (owner: 10Faidon Liambotis) [16:31:37] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Remove critical => true from Cassandra alert [puppet] - 10https://gerrit.wikimedia.org/r/191631 (owner: 10Faidon Liambotis) [16:31:45] <_joe_> but all's well, so don't worry :) [16:31:57] i should have asked you before I said yes, but somehow was on autopilot right then [16:32:18] but, ja, if you merge in gerrit, and don't want merge on puppet master, then you are blocking others from merging [16:32:28] soooo, i think we shouldn't merge in gerrit unless we are really ready to have stuff live. [16:32:39] but, still, we shoudl avoid merging others stuff unexpectedly if we can [16:35:29] <_joe_> ottomata: the net conclusion is we're working too much [16:37:00] (03PS2) 10Giuseppe Lavagetto: move mc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191311 [16:37:13] ^d: we're good to go. [16:37:31] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191311 (owner: 10Giuseppe Lavagetto) [16:37:47] * ^d hands kart_ the deployment baton [16:37:53] <^d> Go forth and do great things [16:38:04] <_joe_> hey, I'm deploying a change right now [16:38:10] <_joe_> can you hold on for a sec? [16:38:13] _joe_: sure [16:38:20] _joe_: let me know when done :) [16:38:24] (03CR) 10Giuseppe Lavagetto: [V: 032] move mc1009 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191311 (owner: 10Giuseppe Lavagetto) [16:38:47] (03PS2) 10Giuseppe Lavagetto: move mc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191312 [16:39:02] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] move mc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191312 (owner: 10Giuseppe Lavagetto) [16:39:57] !log oblivian Synchronized wmf-config/session.php: mc1009-10 IP change (duration: 00m 05s) [16:40:00] Logged the message, Master [16:40:06] <_joe_> kart_: all yours [16:41:27] * kart_ is on tin now [16:42:45] ^d: you need to merge those two patches, I guess. [16:42:52] <^d> links? [16:42:53] but correct me. [16:43:06] https://gerrit.wikimedia.org/r/#/c/191564/ [16:43:07] and [16:43:18] https://gerrit.wikimedia.org/r/#/c/191568/ [16:43:30] can't you merge them? [16:43:52] Krenair: no +2 on core [16:44:04] someday :) [16:44:05] I can merge, but I think we ought to fix that [16:44:12] <^d> Deployers should have magic rights to wmf/* branches [16:44:22] you should be able to merge them if they set up your deployment permissions correctly. I had to get mine fixed separately [16:44:46] Krenair: enlighten me. [16:44:48] <^d> Gerrit perms are separate from getting tin, etc. [16:44:57] right [16:44:57] <^d> And nobody tells me when we add a deployer [16:45:02] <^d> So this conversation happens :) [16:45:06] <^d> (you have the rights now) [16:45:07] heh [16:45:12] godog: good morning [16:45:17] hey gwicke [16:45:26] hey, just tried the login, but still no dice [16:45:31] kart_: I +2ed the other one, you can try https://gerrit.wikimedia.org/r/#/c/191568/ [16:45:49] ^d: huh, we should fix that (nobody telling you/the process breaking down there)... /me thinks [16:45:58] (still a little early for me thinking though) [16:45:59] gwicke: I'll take a look [16:46:10] <^d> greg-g: If I get added to the phab ticket it can happen easily enough [16:46:15] <^d> (or any gerrit admin, really) [16:46:15] * greg-g nods [16:46:35] Is jenkins dead? [16:46:41] <^d> being restarted [16:46:50] Ah ok [16:47:04] gwicke: on which host? [16:47:41] godog: ohhh, just got into restbase1003 [16:47:44] nm ;) [16:47:59] let me try the others [16:48:22] Nikerabbit: +2ed. Dead Jenkins? [16:48:25] <^d> kart_, Nikerabbit: Went ahead and merged both since jenkins is mid-restart [16:48:35] <^d> You can start your tin work now [16:48:44] works. [16:49:50] ok. Now I need you ^d :) [16:49:56] Permission denied. [16:49:59] godog: only 1001 and 1002 seem to be missing [16:49:59] git fetch [16:50:18] godog: symptom is a hang of the ssh connection [16:50:24] <^d> kart_: Which branch? [16:50:47] godog: but we have access to enough nodes to get started, so are unblocked; thanks! [16:51:07] gwicke: the diagnosis is simple, those two are the ones waiting on memcache move [16:51:13] ^d: wmf17 [16:51:23] godog: ah, okay ;) [16:51:36] I'm in: /srv/mediawiki-staging/php-1.25wmf17 [16:51:40] ^d: ^ [16:52:06] godog: I think from a testing perspective it would be great to make sure that the rack assignments are correct per node [16:52:07] <^d> fetch wfm... [16:52:13] gwicke: they are [16:52:27] sigh [16:52:28] oh [16:52:43] godog: nm then, you were ahead of us there [16:52:56] gwicke: as in rack == row if I remember correctly ? [16:52:58] do you mean in the cassandra config? [16:53:28] ^d: any solution/clue? [16:53:29] yeah, we treat rows as logical racks for replica placement purposes [16:53:33] <_joe_> !log shutting down mc1009 and mc1010 [16:53:35] yeah, cassandra-rackdc.properties gwicke [16:53:36] Logged the message, Master [16:53:40] <^d> kart_: Offhand, no [16:53:55] Nikerabbit: can you try? [16:54:03] godog: indeed [16:54:09] kart_: sure [16:54:32] <^d> I already fetched, but haven't rebased yet [16:54:46] kart_: though, did you use ssh -A tin? [16:55:06] gwicke: I noticed cassandra gets a little bit confused on the first puppet run, however removing its data dir and restarting does the trick since the cluster is empty anyway [16:55:07] Yep [16:55:27] <^d> Same key for prod & gerrit, or both keys in chain? [16:55:50] git fetch works for me [16:56:00] ^d: non-mw deployment works so far. [16:56:08] godog: *nod*, perhaps initialized before the config is updated [16:56:13] <^d> hmm [16:56:50] ^d: we just need git submodule update, right? [16:57:04] gwicke: I think so yeah, it seems racy when initializing the internal databases and then fails to autodetect it should do auto_bootstrap or some such [16:57:24] <^d> kart_: Need to rebase in your commit, then submodule update [16:57:27] gwicke: oh also because it fails on first puppet run because cassandra-env.sh isn't there yet [16:57:28] PROBLEM - Host mc1012 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:50] kart_: want me to do it? [16:58:22] godog: makes sense [16:58:39] Nikerabbit: yes please. [16:59:00] PROBLEM - Host mc1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:04] I doubt not working git fetch will make me move forward. [16:59:29] kart_: well git fetch has been done, rebase should not need connection [16:59:55] (03PS1) 10Ottomata: Refinery related tweaks for otto's bash profile [puppet] - 10https://gerrit.wikimedia.org/r/191637 [17:00:09] but I'll do wmf17 [17:00:15] Nikerabbit: ok. [17:00:35] Nikerabbit: can you list command you run here for my knowledge/or in PM? [17:00:47] (03PS2) 10Ottomata: Refinery related tweaks for otto's bash profile [puppet] - 10https://gerrit.wikimedia.org/r/191637 [17:02:39] (03CR) 10Ottomata: [C: 032 V: 032] Refinery related tweaks for otto's bash profile [puppet] - 10https://gerrit.wikimedia.org/r/191637 (owner: 10Ottomata) [17:05:10] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1050108 (10chasemp) >>! In T89053#1049974, @faidon wrote: > Thanks Chase, that's useful. Where do we use #operations used as an ACL object? S... [17:08:16] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1050110 (10chasemp) The other solution is to make an open project always follow #operations via herald which is ugly and hate it. But technic... [17:10:10] !log kartik Started scap: ContentTranslation update [17:10:12] Logged the message, Master [17:10:36] Nikerabbit: ^d ^^ [17:10:59] * kart_ won't forget to use tmux next time... [17:11:14] good idea [17:12:08] <^d> You guys updated wmf17 & 18 before scap, right? [17:12:12] * ^d only saw 17 in channel [17:13:27] ^d: yes kart_ did 18 [17:13:38] <^d> okie dokie just making sure [17:14:28] (03PS2) 10Giuseppe Lavagetto: move mc1011 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191278 [17:15:07] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1050115 (10faidon) Well, first of all: - I'm not sure if there's much value for //every// opsen to be in #Security. There are a few people in... [17:15:08] (03CR) 10Giuseppe Lavagetto: [C: 032] move mc1011 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191278 (owner: 10Giuseppe Lavagetto) [17:16:11] (03CR) 10Giuseppe Lavagetto: [V: 032] move mc1011 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191278 (owner: 10Giuseppe Lavagetto) [17:17:00] (03PS2) 10Giuseppe Lavagetto: move mc1012 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191279 [17:17:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] move mc1012 to a new rack/row [dns] - 10https://gerrit.wikimedia.org/r/191279 (owner: 10Giuseppe Lavagetto) [17:18:23] <^d> kart_: Did you also prepare a patch for the branch making tool like I said in e-mail? If not, your work today will disappear on Wednesday for wmf19 [17:20:21] RECOVERY - Host mc1012 is UP: PING OK - Packet loss = 0%, RTA = 1.02 ms [17:20:21] RECOVERY - Host mc1011 is UP: PING OK - Packet loss = 0%, RTA = 1.69 ms [17:21:02] back [17:21:15] hey kart_ is everything OK with cx [17:21:17] ? [17:21:22] ottomata: thanks for merging ! [17:22:11] (03PS3) 10Giuseppe Lavagetto: nutcracker: move and label mc1011 [puppet] - 10https://gerrit.wikimedia.org/r/191296 [17:22:35] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] nutcracker: move and label mc1011 [puppet] - 10https://gerrit.wikimedia.org/r/191296 (owner: 10Giuseppe Lavagetto) [17:22:42] ^d: Is jenkins recovering? [17:22:50] <^d> bd808: ^? [17:22:53] <^d> You were poking that [17:22:54] (03PS3) 10Giuseppe Lavagetto: nutcracker: move and label mc1012 [puppet] - 10https://gerrit.wikimedia.org/r/191297 [17:23:19] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] nutcracker: move and label mc1012 [puppet] - 10https://gerrit.wikimedia.org/r/191297 (owner: 10Giuseppe Lavagetto) [17:23:24] ^d: in meeting :( [17:25:31] <^d> !log jenkins stuck communicating to beta, restarting [17:25:34] Logged the message, Master [17:25:41] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1050131 (10chasemp) * No real opinion on all ops folks in #security or not, the idea here is that #security and #operations give policy manage... [17:28:01] akosiaris: yep. Thanks! [17:28:39] ^d: Which patch we've to stick then? The current update? [17:28:51] ^d: that sounds scary :) [17:29:02] <^d> It's pretty easy, especially if you have a tag :) [17:29:56] ^d: for example I recreated wmf17/wmf18 for CX. Will that work? [17:30:02] or need new tag? [17:30:36] <^d> That works, but what you want to do is keep wmf19 from being done from master [17:30:49] <^d> (which will happen without configuration in the make-wmf-branch script) [17:31:35] ^d: I think master is fine actually, this was the only known thing to require lockstep updates with cxserver [17:31:54] <^d> Oh? I thought this was going to be a continuing thing. [17:32:51] ^d: yes for having to do frequent updates [17:33:11] we introduced API versioning to cxserver, that should avoid these kinds of lockstep requirements in the future [17:33:23] <^d> Ah, so then my advice was unneeded :p [17:33:33] kart_: you agree? [17:33:39] Nikerabbit: yep. [17:33:44] <^d> You guys just do a deploy window and update to whatever whenever then [17:34:05] ^d: thanks anyway, it was good to know [17:34:22] (03PS2) 10Giuseppe Lavagetto: move mc1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191313 [17:34:24] <^d> np. It's there if you need it [17:34:46] I agree with Nikerabbit! [17:35:41] !log kartik Finished scap: ContentTranslation update (duration: 25m 31s) [17:35:45] Logged the message, Master [17:36:02] ok. Now we're done! [17:36:18] Nikerabbit: ^d and hopefully nothing breaks :) [17:36:28] I verified es.wikipedia already, it works [17:37:40] <^d> hoo, bd808: jenkins seems back, beta-scap jobs seem to be working again [17:38:36] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1050175 (10Krenair) Maybe members of a project should be a subset of the watchers, rather than the other way around. [17:38:53] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] move mc1011 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191313 (owner: 10Giuseppe Lavagetto) [17:39:08] (03PS2) 10Giuseppe Lavagetto: move mc1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191314 [17:39:43] 3operations, MediaWiki-Core-Team: Unexpected N4HPHP13DataBlockFullE - https://phabricator.wikimedia.org/T89958#1050178 (10Chad) Errors seem to have dissipated from hhvm.log. Temporarily because this can still happen again. Gerrit patch seems like a good idea to me. [17:40:10] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] move mc1012 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191314 (owner: 10Giuseppe Lavagetto) [17:40:29] Thank you ^d ottomata Nikerabbit and akosiaris for this deployment :) [17:40:38] I will poke you all again :D [17:40:48] * Nikerabbit hides [17:41:00] and where is mighty greg-g :) [17:41:07] Thanks ^^ [17:41:19] * kart_ off to bed now. [17:42:08] ? [17:42:09] :) [17:42:42] !log oblivian Synchronized wmf-config/session.php: mc1011-12 IP change (duration: 00m 05s) [17:42:48] Logged the message, Master [17:44:50] 3ops-eqiad, operations, Incident-20150205-SiteOutage: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1050192 (10Joe) All servers have been moved/brought back online. [17:45:01] 3ops-eqiad, operations: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1050196 (10Joe) [17:45:02] 3ops-eqiad, operations, Incident-20150205-SiteOutage: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1050195 (10Joe) 5Open>3Resolved [17:45:37] <_joe_> and that would be all, folks [17:46:15] 3Labs, Wikimedia-Labs-Infrastructure, operations: Make labs/private really private - https://phabricator.wikimedia.org/T89642#1050200 (10coren) >>! In T89642#1049875, @BBlack wrote: > API keys and such that aren't as private as our production-private stuff, but which we'd rather not blast out to the entire plane... [17:46:15] !log demon Synchronized php-1.25wmf17/extensions/DoubleWiki/DoubleWiki_body.php: shut up warnings finally (duration: 00m 05s) [17:46:17] Logged the message, Master [17:46:27] <^d> damn those were getting annoyin [17:48:36] kart_, done deploying? [17:49:45] 3Labs, Wikimedia-Labs-Infrastructure, operations: Make labs/private really private - https://phabricator.wikimedia.org/T89642#1050214 (10KartikMistry) I would like to point that this also applies to Beta Cluster. [17:50:49] greg-g, still deploying stuff? [17:53:42] I'm not, ask the people who were [17:54:59] ^d, are you done with the depl? [17:55:08] kart_ seems to be AFK [17:55:12] <^d> Yes, we're done [17:55:40] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet last ran 6 days ago [17:57:50] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [17:59:27] (03PS2) 10BBlack: Add normalize_path to mobile vcl_recv [puppet] - 10https://gerrit.wikimedia.org/r/191136 [17:59:29] (03PS3) 10BBlack: varnish+jessie filesystem stuff [puppet] - 10https://gerrit.wikimedia.org/r/190610 [18:00:00] 3Labs, Wikimedia-Labs-Infrastructure, operations: Make labs/private really private - https://phabricator.wikimedia.org/T89642#1050241 (10Krenair) I'm assuming you have a particular secret in mind that you want to put in deployment-prep somewhere. What instances would need to be able to access that exactly? Some... [18:00:15] (03CR) 10BBlack: [C: 032 V: 032] Add normalize_path to mobile vcl_recv [puppet] - 10https://gerrit.wikimedia.org/r/191136 (owner: 10BBlack) [18:00:35] (03PS4) 10BBlack: varnish: fix GeoIP's get_relevant_ip function [puppet] - 10https://gerrit.wikimedia.org/r/190964 (owner: 10Faidon Liambotis) [18:00:52] (03CR) 10BBlack: [C: 032 V: 032] varnish: fix GeoIP's get_relevant_ip function [puppet] - 10https://gerrit.wikimedia.org/r/190964 (owner: 10Faidon Liambotis) [18:02:06] godog: thanks for the fix to restbase access [18:02:08] (03PS1) 10Filippo Giunchedi: set permissions on cassandra files [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191651 [18:02:10] (03PS1) 10Filippo Giunchedi: report cassandra metrics with metrics-graphite [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191652 (https://phabricator.wikimedia.org/T78514) [18:02:26] mutante: np, uncovered a couple of bugs in the process too [18:02:58] :) [18:04:31] godog: there was also T89922 systemd-sysv-generator did not generate files for some reason [18:05:00] (but then did after systemctl daemon-reload ) [18:06:14] 3Ops-Access-Requests, operations: access request for researcher to analytics-users in Hadoop - https://phabricator.wikimedia.org/T89264#1050289 (10Ottomata) You'll need this in your .ssh/config ``` ForwardAgent no Host !bast1001.wikimedia.org *.wikimedia.org *.wmnet ProxyCommand ssh -a -W %h:%p bast1001... [18:06:24] mutante: ah, good to know, _joe_ might be interested too [18:09:05] (03PS1) 10Filippo Giunchedi: cassandra: add cassandra::metrics class and deps [puppet] - 10https://gerrit.wikimedia.org/r/191654 (https://phabricator.wikimedia.org/T78514) [18:10:10] 3operations, Incident-20150205-SiteOutage: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1050302 (10faidon) p:5Unbreak!>3High [18:11:20] PROBLEM - puppet last run on amssq56 is CRITICAL: CRITICAL: Puppet has 1 failures [18:14:49] 3Ops-Access-Requests, operations: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1050316 (10Dzahn) I don't think Operations should (automatically) be project creators. There has been quite some discussion about (how much)... [18:19:01] RECOVERY - puppet last run on amssq56 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [18:20:22] 3operations, RESTBase: Detailed cassandra monitoring - https://phabricator.wikimedia.org/T78514#1050347 (10Dzahn) Do we also want Icinga monitoring for things like running processes or is that overkill when we already monitor higher level metrics here? [18:20:49] godog: ping? [18:20:49] mobrovac: ping detected, please leave a message! [18:21:27] godog: i'm getting pubkey denial when trying to log into restbase100[3-6] [18:21:36] ([1,2] are timing out) [18:21:46] mobrovac: I'm out of the door sorry :( but yeah 1-2 are not provisioned [18:21:53] 3operations, RESTBase: Detailed cassandra monitoring - https://phabricator.wikimedia.org/T78514#1050356 (10GWicke) @dzahn, there is already icinga monitoring for cassandra running. I think the detailed metrics should cover a lot of that too though, as a dead cassandra won't be sending any of those. [18:22:00] ah ok, [18:22:13] mobrovac: our onduty might be able to help with that tho! see topic [18:22:15] godog: i;ll ping you tomorrow moring then [18:22:29] i'll be out the door soon too [18:22:42] cheers [18:23:05] 3operations, RESTBase: Detailed cassandra monitoring - https://phabricator.wikimedia.org/T78514#1050361 (10GWicke) @fgiunchedi, that's a lot of detail. Should be useful to learn more about the inner workings that way. [18:23:18] mobrovac: wanna try one more time on 1003 ? [18:23:55] mutante: still pubkey denied [18:24:39] 3Services, RESTBase, operations: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1050366 (10Cmjohnson) reseated /dev/sdc [18:25:24] 3Ops-Access-Requests, Services, operations, Citoid: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1050372 (10Jdforrester-WMF) Approved. [18:27:18] mobrovac: so i see your key is accepted on bast1001 and it also exists on rest1003 but in the authlog over there i dont see your username [18:27:29] something with the ProxyCommand config ? [18:27:43] let me do a ssh -vvvvvvvv [18:31:15] mobrovac: try this in your .ssh/config https://phabricator.wikimedia.org/P312 and then "ssh restbase1003" [18:32:47] mutante: yey, now it works! [18:32:49] cheers [18:32:52] session opened for user mobrovac :) [18:32:53] yw [18:33:05] but, still, how do log in from bast1001? [18:33:18] obviously doing ssh -A from my laptop is not enough [18:33:34] we prefer it if you don't use -A but just proxycommand [18:33:49] so after using the config above you don't need to -A anymore [18:34:07] ah ok [18:34:09] good to know [18:34:13] thnx for the tip mutante [18:34:28] that way you dont have to forward your agent [18:35:26] so i'd say either connect to restbase hosts like that directly "through" the bastion or just to the bastion itself if you need things there [18:39:01] yep [18:39:09] i prefer it that what too [18:39:23] (so I can directly copy my vimrc :P) [18:39:42] (03PS1) 10Ottomata: Puppetizing spark (in YARN) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191665 [18:48:10] (03PS2) 10Ottomata: Puppetizing spark (in YARN) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191665 [18:57:21] PROBLEM - NTP on mc1009 is CRITICAL: NTP CRITICAL: Offset unknown [19:11:54] 3Ops-Access-Requests, operations: access request for researcher to analytics-users in Hadoop - https://phabricator.wikimedia.org/T89264#1050639 (10leila) @Ottomata, I'm sorry to resurrect this task. Can you use Ashwin's new public key here: https://wikitech.wikimedia.org/wiki/User:Ashwinpp instead of the old o... [19:17:37] Mediawiki experts: Should importing a dump as big as a whole wiki cause an outage during the import? I get an outage notice every night during the wikitech-static sync, I’m wondering if that’s a normal effect of importDump.php/rebuildrecentchanges.php or if something interesting is happening [19:17:42] (03PS1) 10Ottomata: Change SSH key for Ashwinpp [puppet] - 10https://gerrit.wikimedia.org/r/191668 (https://phabricator.wikimedia.org/T89264) [19:18:01] PROBLEM - Apache HTTP on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.010 second response time [19:18:01] PROBLEM - HHVM rendering on mw1207 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.012 second response time [19:18:08] (03CR) 10Ottomata: [C: 032 V: 032] Change SSH key for Ashwinpp [puppet] - 10https://gerrit.wikimedia.org/r/191668 (https://phabricator.wikimedia.org/T89264) (owner: 10Ottomata) [19:19:22] (03PS1) 10MaxSem: Enable WikiGrok in repo mode on wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191670 [19:19:38] <^d> hhvm needs kicking on mw1207 [19:19:47] <^d> Same cache exhaustion as other hosts earlier [19:19:56] akosiaris: yt? [19:19:58] actually [19:22:30] <^d> ottomata: Can you bounce the hhvm service on mw1207? [19:22:55] ok [19:23:03] <^d> Thanks [19:23:06] done [19:23:30] <^d> Much better, thanks [19:23:31] RECOVERY - Apache HTTP on mw1207 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.071 second response time [19:23:41] RECOVERY - HHVM rendering on mw1207 is OK: HTTP OK: HTTP/1.1 200 OK - 67154 bytes in 0.234 second response time [19:24:37] 3ops-eqiad, operations: cr1-eqiad power supply fan failure - https://phabricator.wikimedia.org/T89224#1050728 (10Cmjohnson) a:3Cmjohnson [19:26:31] RECOVERY - puppet last run on ruthenium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:28:16] !log ran puppet on ruthenium [19:28:22] Logged the message, Master [19:29:07] (03PS3) 10Ottomata: create shell user for Marielle Volz [puppet] - 10https://gerrit.wikimedia.org/r/190405 (https://phabricator.wikimedia.org/T89057) (owner: 10Dzahn) [19:29:19] (03CR) 10Chad: [C: 031] "When do we want this to go out?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [19:30:15] (03PS4) 10Ottomata: create shell user for Marielle Volz [puppet] - 10https://gerrit.wikimedia.org/r/190405 (https://phabricator.wikimedia.org/T89057) (owner: 10Dzahn) [19:31:11] (03CR) 10Dzahn: [C: 031] "ah:) you added the role to yaml. thanks" [puppet] - 10https://gerrit.wikimedia.org/r/190405 (https://phabricator.wikimedia.org/T89057) (owner: 10Dzahn) [19:31:19] ^d: I'd like at least grudging approval of it from ori but for a group0 only deploy you and faidon not hating it may be enough? [19:32:42] PROBLEM - HHVM rendering on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.033 second response time [19:33:11] PROBLEM - Apache HTTP on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.012 second response time [19:36:48] (03CR) 10Ottomata: [C: 032] create shell user for Marielle Volz [puppet] - 10https://gerrit.wikimedia.org/r/190405 (https://phabricator.wikimedia.org/T89057) (owner: 10Dzahn) [19:38:29] 3Ops-Access-Requests, Services, operations, Citoid: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1050816 (10Ottomata) 5Open>3Resolved Thanks, @Mvolz, you should be good to go. Add this to your .ssh/config: ``` ForwardAgent no Host !bast1001... [19:39:24] <^d> bd808: Grudging? Am I missing something? [19:39:31] * ^d read T88732 [19:40:21] He just has seemed cold to the whole logstash thing since the outage. I may be projecting [19:41:33] 3operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1050832 (10Jgreen) 3NEW a:3Jgreen [19:42:51] PROBLEM - HHVM rendering on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.009 second response time [19:43:41] PROBLEM - Apache HTTP on mw1190 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.021 second response time [19:44:07] 3operations: boron passive checks aren't being collected - https://phabricator.wikimedia.org/T89983#1050852 (10Jgreen) probably relevant -- we recently upgraded boron from precise to trusty, and someone mentioned that nsca may be broken for trusty? [19:45:11] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [19:46:04] bd808: sorry, reviewing now [19:46:25] 3ops-eqiad, operations: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#1050861 (10Dzahn) a:3Dzahn [19:48:04] mobrovac: ^ is that testing or do you want us to check on cassandra there ^ [19:48:32] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [19:49:18] ok:) [19:54:47] !log reinstalling mw1062 after disk has been replaced [19:54:50] Logged the message, Master [20:13:24] (03PS1) 10Hashar: contint: +libdistro-info-perl [puppet] - 10https://gerrit.wikimedia.org/r/191677 [20:14:56] (03CR) 10Ori.livneh: [C: 032] hhvm: raise the amount of cold TC cache available [puppet] - 10https://gerrit.wikimedia.org/r/191620 (owner: 10Giuseppe Lavagetto) [20:15:08] _joe_: +2'd but did not submit [20:15:23] (03CR) 10Ottomata: [C: 032] "Tested in vagrant and labs, woot!" [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191665 (owner: 10Ottomata) [20:15:25] !log readding mw1062 to puppet, signing new cert and salt-key [20:15:28] Logged the message, Master [20:16:06] 3Wikimedia-Fundraising, operations: Need access to PHP error logs on lutetium - https://phabricator.wikimedia.org/T89992#1050935 (10Jgreen) [20:16:09] <_joe_> ori: thanks [20:16:30] <_joe_> ori: that should solve the problem for now [20:17:07] (03PS1) 10Ottomata: Install spark on analytics client nodes [puppet] - 10https://gerrit.wikimedia.org/r/191678 [20:17:15] (03PS1) 10Dzahn: Revert "remove mw1062 from dsh groups - read-only fs" [puppet] - 10https://gerrit.wikimedia.org/r/191679 [20:17:17] (03PS2) 10Ottomata: Install spark on analytics client nodes [puppet] - 10https://gerrit.wikimedia.org/r/191678 [20:17:26] _joe_: thank you [20:17:55] (03PS2) 10Dzahn: Revert "remove mw1062 from dsh groups - read-only fs" [puppet] - 10https://gerrit.wikimedia.org/r/191679 (https://phabricator.wikimedia.org/T86542) [20:18:12] (03PS3) 10Ottomata: Install spark on analytics client nodes [puppet] - 10https://gerrit.wikimedia.org/r/191678 [20:20:44] RECOVERY - Apache HTTP on mw1062 is OK: HTTP OK: HTTP/1.1 200 OK - 11783 bytes in 0.013 second response time [20:20:46] (03CR) 10Ottomata: [C: 032] Install spark on analytics client nodes [puppet] - 10https://gerrit.wikimedia.org/r/191678 (owner: 10Ottomata) [20:20:54] RECOVERY - Apache HTTP on mw1190 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.063 second response time [20:21:15] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 67154 bytes in 0.187 second response time [20:21:23] mw1193 needs kicking [20:21:53] <_joe_> TomDaley: we know [20:22:08] <_joe_> TomDaley: chasemp is on it, I think [20:22:09] * TomDaley is fighting bouncer troubles, losing scrollback, sorry [20:22:29] mutante: nope, that was prod [20:22:36] having problems with data center names [20:22:41] but should be ok now [20:23:10] (03CR) 10Hashar: "Cherry picked on integration puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/191677 (owner: 10Hashar) [20:23:13] PROBLEM - puppet last run on cp3006 is CRITICAL: CRITICAL: puppet fail [20:23:54] RECOVERY - NTP on mw1062 is OK: NTP OK: Offset -0.0215164423 secs [20:24:57] mobrovac: alright [20:25:16] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 (owner: 10Alexandros Kosiaris) [20:26:17] (03CR) 10Hashar: "recheck" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 (owner: 10Alexandros Kosiaris) [20:26:52] (03CR) 10Dzahn: [C: 031] hhvm: raise the amount of cold TC cache available [puppet] - 10https://gerrit.wikimedia.org/r/191620 (owner: 10Giuseppe Lavagetto) [20:27:30] cmjohnson1 and Coren, are we all together and ready to send an outage notice for Tuesday? [20:27:33] (03CR) 10Hashar: "recheck" [debs/contenttranslation/cg3] - 10https://gerrit.wikimedia.org/r/163579 (owner: 10KartikMistry) [20:27:43] andrewbogott: works for me [20:27:43] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 67154 bytes in 0.210 second response time [20:27:48] andrewbogott: I believe so as well. [20:27:54] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [20:27:58] ok, I will write [20:28:07] two hours? [20:28:10] (03PS1) 10Ottomata: Fix spark log4j.properties name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191683 [20:28:29] andrewbogott: Plan for no less than two reboots while I fiddle with the raid settings and do some tests. I'd feel safer to schedule 3. [20:28:34] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1051044 (10Cmjohnson) a:3Cmjohnson [20:28:53] ok [20:29:07] (03CR) 10Ottomata: [C: 032] Fix spark log4j.properties name [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191683 (owner: 10Ottomata) [20:29:13] coren virt1002 will be done tomorrow. These ciscos are tricky when it comes to disk replacements [20:29:30] Define "tricky" in context? [20:29:33] andrwbogott: yeah 2 hours, hopefully less [20:29:41] (03PS1) 10Ottomata: Update cdh module with spark log4j.properties fix [puppet] - 10https://gerrit.wikimedia.org/r/191684 [20:29:51] (03PS2) 10Ottomata: Update cdh module with spark log4j.properties fix [puppet] - 10https://gerrit.wikimedia.org/r/191684 [20:29:55] determining the actual disk that needs to be replaced is the main problem [20:30:01] "Fun". [20:30:12] (03CR) 10Hashar: "recheck" [debs/contenttranslation/hfst] - 10https://gerrit.wikimedia.org/r/179153 (https://phabricator.wikimedia.org/T76984) (owner: 10KartikMistry) [20:30:19] It's ID 0 on the scsi bus if that helps any. [20:30:20] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with spark log4j.properties fix [puppet] - 10https://gerrit.wikimedia.org/r/191684 (owner: 10Ottomata) [20:30:20] godog spent some time on it and I forgot what he did...so I need to wait for him [20:30:26] that doesn't help [20:30:35] (03CR) 10Hashar: "recheck" [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/167763 (owner: 10Alexandros Kosiaris) [20:30:38] will do first thing [20:30:58] cmjohnson1: Another thing that may help is that the disk has been dropped of the arrays and so should be the only one that is completely idle. [20:31:33] you would think that would change the LED display but it doesn't they all just randomly blink...very frustrating. [20:31:56] virt1002# eject /dev/sda [20:31:57] :-) [20:33:24] RECOVERY - DPKG on mw1062 is OK: All packages OK [20:33:34] RECOVERY - nutcracker process on mw1062 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [20:33:34] RECOVERY - Disk space on mw1062 is OK: DISK OK [20:33:34] RECOVERY - RAID on mw1062 is OK: OK: no RAID installed [20:33:54] RECOVERY - dhclient process on mw1062 is OK: PROCS OK: 0 processes with command name dhclient [20:33:54] RECOVERY - configured eth on mw1062 is OK: NRPE: Unable to read output [20:34:15] RECOVERY - salt-minion processes on mw1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:34:24] RECOVERY - HHVM processes on mw1062 is OK: PROCS OK: 1 process with command name hhvm [20:34:24] RECOVERY - nutcracker port on mw1062 is OK: TCP OK - 0.000 second response time on port 11212 [20:35:30] (03PS1) 10GWicke: Set the cassandra localDc parameter [puppet] - 10https://gerrit.wikimedia.org/r/191686 [20:36:41] Coren: can I get a quick summary of the benefits of said outage? Moar Storage, and… ??? [20:36:50] (03PS2) 10GWicke: Set the cassandra localDc parameter [puppet] - 10https://gerrit.wikimedia.org/r/191686 [20:37:27] andrewbogott: Moar storage is the primary one; this is necessary to install the new shelf. But also, we'll be using the outage to do the overdue reboot because security. [20:37:53] 3ops-eqiad, operations: cr1-eqiad Control Board error - https://phabricator.wikimedia.org/T89999#1051077 (10Cmjohnson) 3NEW a:3Cmjohnson [20:37:57] Coren: ok. But this doesn’t get us any closer to having crash-resistant NFS, right? [20:38:50] 3ops-eqiad, operations: cr1-eqiad power supply fan failure - https://phabricator.wikimedia.org/T89224#1051091 (10Cmjohnson) 5Open>3Resolved Reseating PEM1 cleared the alarm. However a new alarm was triggered on CB1. A new task has been created. [20:38:55] ... no. [20:39:32] Besides, I don't think that's /possible/ - with NFS the aim is "fast recovery" not HA. [20:41:16] (03PS1) 10Dzahn: revoke demon's key [puppet] - 10https://gerrit.wikimedia.org/r/191688 [20:41:56] cmjohnson1: which task? [20:42:20] 3ops-eqiad, operations: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1051106 (10Cmjohnson) the service tag on the server after the system board swap does not match the correct service tag. Dell tech support sent me an iso to boot from that should allow me to change the service tag to the... [20:42:39] paravoid: task for what? [20:42:44] (03PS2) 10Dzahn: revoke demon's key [puppet] - 10https://gerrit.wikimedia.org/r/191688 [20:42:59] juniper cb1 alarm? https://phabricator.wikimedia.org/T89999 [20:43:15] RECOVERY - puppet last run on cp3006 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:44:23] 3ops-esams, operations: esams power capacity issues - https://phabricator.wikimedia.org/T90000#1051108 (10BBlack) 3NEW [20:44:23] (03CR) 10Dzahn: [C: 032] revoke demon's key [puppet] - 10https://gerrit.wikimedia.org/r/191688 (owner: 10Dzahn) [20:45:07] (03CR) 10Hashar: "Not sure what is the debian-glue failure is, unrelated to this change imho." [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 (owner: 10Alexandros Kosiaris) [20:45:14] 3ops-esams, operations: esams power capacity issues - https://phabricator.wikimedia.org/T90000#1051127 (10BBlack) [20:45:15] 3operations: Expand HTTP frontend clusters with new hardware - https://phabricator.wikimedia.org/T86663#973608 (10BBlack) [20:45:15] (03PS1) 10Ottomata: Use spark.yarn.jar instead of SPARK_JAR env var, SPARK_JAR is deprecated [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191691 [20:45:47] andrewbogott: In the meantime, the LDAP issue is getting increasingly worse. [20:45:54] (03PS2) 10Ottomata: Use spark.yarn.jar instead of SPARK_JAR env var, SPARK_JAR is deprecated [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191691 [20:45:56] (03CR) 10Hashar: [C: 031] Ignore .gitreview when building source [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/167759 (owner: 10Alexandros Kosiaris) [20:46:08] Coren: LDAP? [20:46:11] (03CR) 10Ottomata: [C: 032] Use spark.yarn.jar instead of SPARK_JAR env var, SPARK_JAR is deprecated [puppet/cdh] - 10https://gerrit.wikimedia.org/r/191691 (owner: 10Ottomata) [20:46:23] Oh, you mean, the new instances not getting ldap entries? [20:46:35] Dang, I haven’t see that for any of my instances and thought it had stopped happening :( [20:46:38] (03PS1) 10Ottomata: Update cdh with spark.yarn.jar fix [puppet] - 10https://gerrit.wikimedia.org/r/191692 [20:46:40] 3ops-eqiad, operations: cr1-eqiad Control Board error - https://phabricator.wikimedia.org/T89999#1051131 (10faidon) That's likely this: http://kb.juniper.net/InfoCenter/index?page=content&id=KB26731. We'll need to either do a: - a RE graceful switchover (traffic affecting, hopefully not much); or - `restart cha... [20:46:42] andrewbogott: I've been able to hand-fix most of what has been messed up, but either it's getting worse or now that I know what to look for I notice more problems. [20:46:45] (03CR) 10jenkins-bot: [V: 04-1] Update cdh with spark.yarn.jar fix [puppet] - 10https://gerrit.wikimedia.org/r/191692 (owner: 10Ottomata) [20:46:47] (03PS2) 10Ottomata: Update cdh with spark.yarn.jar fix [puppet] - 10https://gerrit.wikimedia.org/r/191692 [20:46:50] 3ops-eqiad, operations: cr1-eqiad Control Board error - https://phabricator.wikimedia.org/T89999#1051132 (10faidon) [20:46:59] andrewbogott: Not just instances, new users also end up missing some LDAP entries apparently. [20:47:00] (03CR) 10Ottomata: [C: 032] Update cdh with spark.yarn.jar fix [puppet] - 10https://gerrit.wikimedia.org/r/191692 (owner: 10Ottomata) [20:47:01] Coren: so you think if I create five instances, two of them won’t get arecs? [20:47:02] 3ops-eqiad, operations: cr1-eqiad Control Board error - https://phabricator.wikimedia.org/T89999#1051135 (10faidon) p:5Unbreak!>3High [20:47:06] cmjohnson1: updated [20:47:07] (03CR) 10Ottomata: [V: 032] Update cdh with spark.yarn.jar fix [puppet] - 10https://gerrit.wikimedia.org/r/191692 (owner: 10Ottomata) [20:47:25] Coren: ok, if we regard this as two problems rather than one... [20:47:36] ori, if you have a moment for a small restbase tweak: https://gerrit.wikimedia.org/r/#/c/191686/ [20:47:36] andrewbogott: I haven't found a pattern yet - sometimes you can create many instances with no issue, and sometimes 4-5 in a row will fail [20:47:40] Is there a chance that the instance creation issue is fixed? Or have you still been seeing that one this week as well? [20:47:42] ok [20:47:55] I’ll look in OSM and see what’s going on with error handling [20:48:02] 3ops-esams, operations: esams power capacity issues - https://phabricator.wikimedia.org/T90000#1051137 (10BBlack) [20:48:38] andrewbogott: I'm increasingly convinced that your initial hunch/diagnosis was correct and that there's something funky going on with jobs. [20:48:58] I don’t think that the jobqueue is involved with user creation though [20:49:01] !log restart ntp on mw1009 [20:49:04] That would also explain why SMW has been increasingly crappy at updating itself. [20:49:05] Logged the message, Master [20:49:08] paravoid: thanks for the update, yep anything that requires restarting will be up to you :-) no pressure! [20:49:32] andrewbogott: Is setting email synchronous? [20:49:47] Coren: Not sure… I will look after I finish this outage notice [20:49:55] andrewbogott: kk. [20:50:28] (03PS3) 10GWicke: Set the cassandra localDc parameter [puppet] - 10https://gerrit.wikimedia.org/r/191686 [20:50:42] andrewbogott: Other things that have been funky is project group membership. I actually had a user granted shell and tool labs access that ended up /not/ being added to project-bastion [20:50:44] PROBLEM - puppet last run on ms-be2011 is CRITICAL: CRITICAL: Puppet has 1 failures [20:51:03] But does have +shell [20:51:30] Hm… I suspect that ldap requests are erroring out and we just aren’t handling the failures properly. [20:51:38] maybe [20:52:11] That'd be two distinct issues then; the error handling is bad but the fails are no better. [20:52:38] I hadn't considered digging into the opendj logs yet though - I was presuming the jobs didn't run (as opposed to quietly failed). I shall do that [20:54:16] 3operations: icinga-admin certificate expires 2015-02-26 - replace or depreciate? - https://phabricator.wikimedia.org/T90002#1051153 (10RobH) 3NEW [20:54:32] Coren: sent you a draft of the outage notice, please review [20:56:02] Coren: you can also turn on logging in OSM to see what wikitech thinks. https://wikitech.wikimedia.org/wiki/Wikitech [20:56:04] RECOVERY - NTP on mc1009 is OK: NTP OK: Offset -0.006601214409 secs [20:56:13] Unfortunately things in the jobqueue do not log [20:56:35] andrewbogott: Yeah, that annoyed me as much as it did you. [20:56:53] I tried to fix it by passing in the auth rather than using it as a global, but… no dice for some reason [20:57:40] andrewbogott: The notice is perfectly cromulent. [20:57:50] ori, godog, akosiaris, paravoid: https://gerrit.wikimedia.org/r/191686 [20:57:51] ok, I will send [20:58:13] /cc ottomata [20:58:19] um… *ensenden [20:58:57] 3operations: icinga-admin certificate expires 2015-02-26 - replace or depreciate? - https://phabricator.wikimedia.org/T90002#1051163 (10RobH) a:3RobH I'm going to link in the patchsets to disable this, but I'd like to get feedback from other opsen before implementation. [20:59:40] (03PS1) 10RobH: remove support for icinga-admin.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/191699 [21:01:26] (03PS2) 10RobH: remove support for icinga-admin.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/191699 [21:01:40] (03PS1) 10RobH: remove support for icinga-admin.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/191700 [21:02:23] (03CR) 10RobH: "Please note that when this is merged and run on neon, the existing ssl certificate and key must be shredded manually." [puppet] - 10https://gerrit.wikimedia.org/r/191699 (owner: 10RobH) [21:03:17] 3operations: icinga-admin certificate expires 2015-02-26 - replace or depreciate? - https://phabricator.wikimedia.org/T90002#1051198 (10RobH) Patchsets: https://gerrit.wikimedia.org/r/#/c/191699/ https://gerrit.wikimedia.org/r/#/c/191700/ [21:03:40] (03CR) 10RobH: "Also note the private key must be git rm'd from the private repo." [puppet] - 10https://gerrit.wikimedia.org/r/191699 (owner: 10RobH) [21:07:11] 3ops-eqiad, operations: dysprosium failed idrac - https://phabricator.wikimedia.org/T88129#1051205 (10Cmjohnson) I loaded the utility given my Dell, changed the service tag to the correct one and restarted the server. When i check the racadm getsysinfo the asset tag did not change. A new system board is probab... [21:07:35] RECOVERY - puppet last run on ms-be2011 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [21:08:56] greg-g: Are you around? I need to deploy a Wikidata update [21:11:06] hoo: what is it? [21:12:32] (03CR) 10Ori.livneh: [C: 04-2] "There is needless duplication of handlers. If two log groups have the same level requirement or sampling factor, this would create a handl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191259 (https://phabricator.wikimedia.org/T88732) (owner: 10BryanDavis) [21:13:21] greg-g: https://phabricator.wikimedia.org/T89903 then a fix for dispatching of deleted pages (https://gerrit.wikimedia.org/r/191644) and https://gerrit.wikimedia.org/r/191642 [21:13:31] I still need a few minutes to prepare the build etc. [21:14:49] The last of those patches actually just sets a class on an input, the change looks a bit bloated because people decided to fix style issues as well [21:15:01] 3operations: icinga-admin certificate expires 2015-02-26 - replace or depreciate? - https://phabricator.wikimedia.org/T90002#1051225 (10faidon) Yeah, I think that's fine. We might move away from Icinga within the next year entirely, anyway. [21:19:25] (03CR) 10Ori.livneh: [C: 031] Set the cassandra localDc parameter [puppet] - 10https://gerrit.wikimedia.org/r/191686 (owner: 10GWicke) [21:19:35] Coren: I don’t see any evidence that user creation is async. User creation might get logged to some other logfile though, I’m still digging in the code [21:20:02] andrewbogott: Perhaps not creation, but is project membership? [21:20:08] 3ops-eqiad, operations: db1054 MCE errors logged for CPU temperature - https://phabricator.wikimedia.org/T89801#1051250 (10Cmjohnson) a:3Cmjohnson This appears to be hardware related [21:20:33] Oh, that’s in a totally different extension, hang on... [21:20:48] !log cleanly re-initialized prod cassandra cluster after puppet run; picked up local dc from property file [21:20:53] Logged the message, Master [21:21:03] Coren: specifically addition to bastion? [21:21:22] Yeah, when doing the +shell thang [21:22:13] hoo: ok, do the needful [21:22:30] Thank you :) [21:23:13] Coren: it’s done via a UserRights hook [21:23:35] I’m not sure when that gets called, maybe there’s a grace period between when shell is checked and that hook gets called [21:23:37] * andrewbogott looks [21:24:31] Hm, no. 'UserRights': After a user's group memberships are changed [21:25:40] meh, it’s all synchronous [21:26:14] Should happen as part of the click on ‘Save user groups’ [21:26:36] So that pretty much guarantees that the issue is LDAP failing and not some flaw in the jobqueue. [21:27:05] yeah [21:27:53] Error: /Stage[main]/Mediawiki::Php/File[/etc/php5/conf.d/fss.ini]/ensure: change from absent to file failed: Could not set 'file' on ensure: No such file or directory [21:27:54] (03PS4) 10Ottomata: Set the cassandra localDc parameter [puppet] - 10https://gerrit.wikimedia.org/r/191686 (owner: 10GWicke) [21:28:15] do we know that one yet (fresh install) [21:28:22] !log restbase now up on all live (3 of 6) prod nodes [21:28:27] Logged the message, Master [21:28:28] _joe_: is it because php5/ is still a package? [21:28:34] *still in the package? [21:28:44] <_joe_> ori: what? [21:28:51] re: mutante's thing above [21:29:05] <_joe_> oh shit, yes [21:29:25] <_joe_> we should move that to /etc/php5/mods-available... [21:29:41] Coren: not all that long ago I moved ldap off of virt1000 and on to a dedicated server. That’s probably the trigger for this issue. [21:30:08] (03CR) 10Ottomata: [C: 032] Set the cassandra localDc parameter [puppet] - 10https://gerrit.wikimedia.org/r/191686 (owner: 10GWicke) [21:30:59] ottomata: one last thing.. I'm wondering about the status of the internal lvs added in https://gerrit.wikimedia.org/r/#/c/190786/5/modules/lvs/manifests/configuration.pp [21:31:44] ah, nm- it uses port 7231 as well [21:31:47] it's working [21:32:40] 24k req/s on three boxes [21:32:49] cool [21:33:14] k, gotta run, ttyl! [21:33:54] grr, and of course the default timeout value is undocumented [21:34:35] ottomata: have fun with the monsters! [21:35:46] fyi, if people start complaining about not being able to connect to gerrit/phab/whatever, ask if they're on comcast, a few of us in the bay area are having stupid routing issues right now [21:37:15] greg-g: what kind of issues? [21:37:43] unable to connect to hosts [21:37:50] paravoid: I can't connect to phabricator (and a few other websites) but gerrit and ssh to terbium works fine [21:37:59] * greg-g is vpn'ing into the office now [21:38:27] legoktm: try "ls -lR /" while logged in on terbium [21:38:38] does it scroll fine? [21:38:57] or does the session hang? [21:39:48] paravoid: scrolls fine [21:41:17] (sec, meeting) [21:42:32] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1051361 (10Dzahn) [21:44:37] ok, so it sounds like ecmp failure [21:44:43] can you run traceroutes to both good/bad? [21:46:10] sure [21:47:34] (03CR) 10Dzahn: [C: 04-1] "has been reinstalled but we should fix T90005 before reactivating it" [puppet] - 10https://gerrit.wikimedia.org/r/191679 (https://phabricator.wikimedia.org/T86542) (owner: 10Dzahn) [21:48:18] 3ops-eqiad, operations: mw1062 needs a disk replacement - https://phabricator.wikimedia.org/T86542#1051403 (10Dzahn) >>! In T86542#980487, @Cmjohnson wrote: > Disk replaced...need to reinstall I reinstalled it, signed new puppet cert , salt-key, ran puppet but then ran into T90005 [21:50:14] PROBLEM - Cassandra database on restbase1003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [21:52:00] ori: wtf [21:52:06] you realized I'm about to deploy? [21:52:24] RECOVERY - Cassandra database on restbase1003 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [21:52:40] hoo: i'm not deploying yet; these are ext repo updates, not core submodule updates [21:52:50] oh, I'm stupid [21:52:55] ye [21:52:56] s [21:52:56] (03PS1) 10GWicke: Update jamm dependency path for cassandra 2.1.3 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191778 [21:54:01] paravoid: http://fpaste.org/187852/24382821/raw/ [21:54:30] ori: ^^ [21:56:21] legoktm: comcast internal crap, I think [21:56:26] :/ [21:56:37] hard to say, possibly could be GTT [21:57:20] !log hoo Synchronized php-1.25wmf17/extensions/Wikidata/: Update Wikibase to fix langlink updates in the client API et al (duration: 00m 14s) [21:57:26] Logged the message, Master [21:57:43] !log hoo Synchronized php-1.25wmf18/extensions/Wikidata/: Update Wikibase to fix langlink updates in the client API et al (duration: 00m 12s) [21:57:45] do you have mtr installed? [21:57:46] Logged the message, Master [21:57:53] if so, try running mtr to those two hosts [21:58:03] and leave it a bit until all possible paths appear and it stabilizes [21:58:06] for both [21:59:22] no, installing it now... [21:59:35] if it's debian/ubuntu, install mtr-tiny [21:59:37] easier :) [22:00:52] I'm on fedora/osx :P [22:01:37] (03PS1) 10Chad: New pubkey for myself [puppet] - 10https://gerrit.wikimedia.org/r/191782 [22:02:02] he-1-5-0-0-cr01.sanjose.ca.ibone.comcast.net has 99.1% loss [22:02:34] that's probably it, but i can double check if you want [22:02:40] That could explain my various issues today too [22:02:58] http://www.reddit.com/r/IAmA/comments/103a3s/i_work_for_comcast_and_it_is_ruining_my_life_ama/ [22:03:23] Ok, looks like we're all good :) [22:03:26] * hoo is done [22:03:36] do the main sites work for you? are you getting ulsfo from DNS? does it work for ulsfo? [22:04:34] (03PS4) 10BBlack: varnish+jessie filesystem stuff [puppet] - 10https://gerrit.wikimedia.org/r/190610 [22:04:58] paravoid: I'm not having trouble between myself and WMF, just myself and Linode. [22:05:00] But I'm still pretty sure it's Comcast shenanigans [22:05:09] it's shenanigans with them all the way down [22:06:02] enwp, mw.o work, I can't get to meta or wikidata (logged in on all of them) [22:06:20] I also can't get to the comcast website to figure out who to complain to [22:06:35] legoktm: traceroute to enwp, does it go to ulsfo or eqiad? [22:06:49] (03PS2) 10Rush: Update jamm dependency path for cassandra 2.1.3 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191778 (owner: 10GWicke) [22:08:26] legoktm, that's great as that means they can't disregard you with "if other pages work then that website must be down" [22:10:04] PROBLEM - Cassandra database on restbase1004 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [22:10:21] paravoid: traceroute isn't making it that far :/, mtr says ulsfo [22:10:34] RECOVERY - puppet last run on mw1062 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:10:53] via GTT? [22:11:15] yup [22:12:23] RECOVERY - Cassandra database on restbase1004 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [22:14:53] (03CR) 10Dzahn: [C: 032] "https://office.wikimedia.org/w/index.php?title=User:^demon/new-public-key&oldid=131948" [puppet] - 10https://gerrit.wikimedia.org/r/191782 (owner: 10Chad) [22:15:15] woot, got my AT&T connection to work [22:17:55] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1051551 (10Dzahn) tmp. fix on mw1062 was just to "mkdir /etc/php5/conf.d" and then run puppet again. [22:21:19] deployers, how do you deploy to just a single appserver [22:21:51] change the dsh group? [22:24:26] !next [22:25:24] mutante, https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Trying_tin.27s_code_on_testwiki [22:25:34] (03CR) 10Rush: [C: 032] Update jamm dependency path for cassandra 2.1.3 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191778 (owner: 10GWicke) [22:26:20] Krenair: thanks. checking [22:26:22] (03CR) 10Rush: [V: 032] Update jamm dependency path for cassandra 2.1.3 [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/191778 (owner: 10GWicke) [22:27:33] Krenair: it's slightly different, i don't want to test a new thing, i just need to sync an empty server to what everybody else has. sync-common: command not found so far [22:28:32] mutante, nothing at /usr/local/bin/sync-common ? [22:29:13] Krenair: no [22:29:26] apache-status apt2xml cgroup-mediawiki-clean check-raid.py furl hhvmadm phaste [22:29:29] just that [22:29:55] weird [22:30:13] it doesn't look like that file is referenced in puppet.git [22:30:28] but it definitely exists on mw1017 [22:30:59] and a different place on mw1016. interesting. [22:31:09] (03PS1) 10GWicke: Update cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/191790 [22:31:32] mutante, what about /srv/deployment/scap/scap/bin/sync-common ? [22:32:06] Krenair: yes :) and that worked, it was just not in path [22:32:08] (03CR) 10Rush: [C: 032 V: 032] Update cassandra submodule [puppet] - 10https://gerrit.wikimedia.org/r/191790 (owner: 10GWicke) [22:32:16] wo "which sync-common" wouldn't find [22:32:24] 22:31:49 Copying to mw1062.eqiad.wmnet from tin.eqiad.wmnet [22:32:52] I wonder why I can't find it in puppet, it's in a different place on mw1017, and it's not in the path on a new app server [22:35:09] hoo: I got the go-ahead from greg-g to go after you, so let me know when you're done. [22:35:44] ori: I finished half an hour ago ;) [22:36:01] KK, thanks. [22:37:18] bd808, please see above [22:39:24] comcast is going to call me in 24 hours...yippee [22:40:17] legoktm, that's an expensive timer [22:40:35] :P [22:41:35] (03PS3) 10Dzahn: Revert "remove mw1062 from dsh groups - read-only fs" [puppet] - 10https://gerrit.wikimedia.org/r/191679 (https://phabricator.wikimedia.org/T86542) [22:42:57] (03CR) 10Dzahn: [C: 032] "work-around is mkdir /etc/php5/conf.d and running puppet again. ran sync-common to get it up-to-date" [puppet] - 10https://gerrit.wikimedia.org/r/191679 (https://phabricator.wikimedia.org/T86542) (owner: 10Dzahn) [22:46:29] 3Labs, ops-eqiad, operations: virt1002 broken disk? - https://phabricator.wikimedia.org/T88923#1051614 (10Jgreen) As has been discussed elsewhere, check-raid.py only checks the first RAID variant it finds, in this case it's reporting mdadm status. However even if it were making it throught to the mpt check, it... [23:05:12] !log ori Synchronized php-1.25wmf18/extensions/VisualEditor: (no message) (duration: 00m 06s) [23:05:19] !log ori Synchronized php-1.25wmf18/extensions/WikimediaEvents: (no message) (duration: 00m 07s) [23:05:19] Logged the message, Master [23:05:21] Logged the message, Master [23:06:05] !log ori Synchronized php-1.25wmf17/extensions/VisualEditor: (no message) (duration: 00m 06s) [23:06:09] Logged the message, Master [23:06:12] !log ori Synchronized php-1.25wmf17/extensions/WikimediaEvents: (no message) (duration: 00m 06s) [23:06:14] ori: what if i have an appserver where HHVM rendering is HTTP WARNING: HTTP/1.1 404, should i better not activate it then? it is "just" a WARN though? [23:06:14] Logged the message, Master [23:10:46] mutante: just a sec [23:12:26] <_joe_> mutante: restart apache [23:16:34] RECOVERY - HHVM rendering on mw1062 is OK: HTTP OK: HTTP/1.1 200 OK - 67079 bytes in 0.497 second response time [23:19:20] _joe_: worked, :) [23:20:19] <_joe_> mutante: did you had puppet run at least twice? [23:20:31] <_joe_> open a bug for the cron.d thing [23:20:37] <_joe_> conf.d sorry [23:20:43] _joe_: yea, i opened a bug for that already [23:20:51] <_joe_> assign it to me [23:20:56] <_joe_> I'll do it tomorrow [23:20:56] worked around it with mkdir /etc/php5/conf.d/ [23:21:03] ran puppet twice until it was fine [23:21:07] added back to dsh groups [23:21:10] synced [23:21:24] _joe_: ok, assigning [23:21:24] <_joe_> that was rough :) [23:21:32] <_joe_> ok, night! [23:21:40] good night [23:22:07] 3operations: move mediawiki php config files to /etc/php5/mods-available - https://phabricator.wikimedia.org/T90005#1051726 (10Dzahn) a:3Joe [23:25:43] PROBLEM - puppet last run on ruthenium is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [23:34:49] ori: can you connect to bast1001? [23:35:05] RECOVERY - mediawiki-installation DSH group on mw1062 is OK: OK [23:35:29] (03PS1) 10Legoktm: Temporarily remove 'm' from metawiki's $wgLocalInterwikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/191812 (https://phabricator.wikimedia.org/T89916) [23:35:30] AaronS: from the cluster or from elsewhere? I use iron, the roots bastion, so I'm not supposed to log in to bast1001 from outside the cluster. [23:35:59] ah, nvm then [23:36:32] AaronS: comcast is having connection issues [23:36:53] I can connect to gerrit, all sorts of stuff, but not that [23:37:11] though technically it's not the connect phase, but further along the protocol [23:37:25] can you get to phabricator? I can't from comcast [23:37:26] t'was working last night [23:37:44] internet has be fine for me [23:39:10] debug1: expecting SSH2_MSG_KEX_ECDH_REPLY [23:39:11] Connection closed by 208.80.154.149 [23:40:33] o.O [23:43:06] putty doesn't work either [23:45:48] _joe_: is there anything obvious going on with ssh on bast1001? [23:46:19] there are some suggestions here: https://bugs.launchpad.net/ubuntu/+source/openssh/+bug/1254085 [23:48:04] heh, I already tried mtu 1500=>1200 [23:48:20] and it worked last night at 3AM, odd that it would break now [23:49:01] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1051809 (10Chmarkine) [23:53:34] (03PS1) 10BryanDavis: Add missing colon on syslogtag for hhvm-fatal logging [puppet] - 10https://gerrit.wikimedia.org/r/191815 [23:53:41] ori: ^ [23:55:26] (03PS2) 10Ori.livneh: Add missing colon on syslogtag for hhvm-fatal logging [puppet] - 10https://gerrit.wikimedia.org/r/191815 (owner: 10BryanDavis) [23:55:39] (03CR) 10Ori.livneh: [C: 032 V: 032] Add missing colon on syslogtag for hhvm-fatal logging [puppet] - 10https://gerrit.wikimedia.org/r/191815 (owner: 10BryanDavis) [23:55:51] bd808: thanks [23:59:01] AaronS: for me (using ATT DSL) ssh via bastion is often *really* slow [23:59:09] as in 10s latency