[00:30:22] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: puppet fail [00:49:32] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [01:20:02] I wonder what's causing those "Lost connection to MySQL server during query" exceptions [01:40:37] (03CR) 10Alex Monk: "Is this supposed to be scheduled for the SWAT window on the 17th?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190450 (https://phabricator.wikimedia.org/T89450) (owner: 10KartikMistry) [01:48:28] (03CR) 10Alex Monk: "When shall we do this? 1 week after the notification to Project:Current_issues would be Wednesday morning SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) (owner: 10Florianschmidtwelzow) [01:58:12] (03PS3) 10Alex Monk: Set $wgBabelCategoryNames true at outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190686 (https://phabricator.wikimedia.org/T89484) (owner: 10Gerardduenas) [02:02:09] (03PS3) 10Alex Monk: Create 'autopatrolled' user group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190721 (https://phabricator.wikimedia.org/T89346) (owner: 10Gerardduenas) [02:03:55] (03PS2) 10Alex Monk: Enable UploadWizard on idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190744 (https://phabricator.wikimedia.org/T88918) (owner: 10Gerardduenas) [02:13:10] PROBLEM - haproxy failover on dbproxy1002 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [02:14:25] !log l10nupdate Synchronized php-1.25wmf16/cache/l10n: (no message) (duration: 00m 02s) [02:14:34] Logged the message, Master [02:15:32] !log LocalisationUpdate completed (1.25wmf16) at 2015-02-16 02:14:29+00:00 [02:15:36] Logged the message, Master [02:16:20] RECOVERY - haproxy failover on dbproxy1002 is OK: OK check_failover servers up 2 down 0 [02:17:05] !log db1046 restart, table maintenance [02:17:08] Logged the message, Master [02:18:21] (03PS4) 10Alex Monk: Set $wgBabelCategoryNames true at outreachwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190686 (https://phabricator.wikimedia.org/T89484) (owner: 10Gerardduenas) [02:28:35] !log l10nupdate Synchronized php-1.25wmf17/cache/l10n: (no message) (duration: 00m 01s) [02:28:41] Logged the message, Master [02:29:42] !log LocalisationUpdate completed (1.25wmf17) at 2015-02-16 02:28:39+00:00 [02:29:45] Logged the message, Master [02:47:16] (03PS1) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:49:28] (03PS2) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:52:04] (03PS3) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:52:54] (03PS4) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:54:02] (03PS5) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:55:01] (03PS6) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:55:33] (03PS7) 10Springle: Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 [02:56:25] (03CR) 10Springle: [C: 032] Switch db1046 to a dedicated eventlogging role, soon to be a master. [puppet] - 10https://gerrit.wikimedia.org/r/190763 (owner: 10Springle) [03:00:16] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00333333333333 [03:05:16] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [03:20:06] PROBLEM - puppet last run on amssq45 is CRITICAL: CRITICAL: puppet fail [03:28:28] (03CR) 10KartikMistry: "Yes. Added in SWAT list." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190450 (https://phabricator.wikimedia.org/T89450) (owner: 10KartikMistry) [03:29:32] (03CR) 10Santhosh: [C: 031] CX: Publishing to Main namespace for idwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190450 (https://phabricator.wikimedia.org/T89450) (owner: 10KartikMistry) [03:39:18] RECOVERY - puppet last run on amssq45 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:19:06] (03PS1) 10Springle: depool db1065 T88084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190768 [04:19:33] (03CR) 10Springle: [C: 032] depool db1065 T88084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190768 (owner: 10Springle) [04:19:38] (03Merged) 10jenkins-bot: depool db1065 T88084 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190768 (owner: 10Springle) [04:20:01] !log springle Synchronized wmf-config/db-eqiad.php: depool db1065 (duration: 00m 06s) [04:20:08] Logged the message, Master [04:26:21] 8Blocked-on-Operations, 2Ops-Access-Requests, 2RESTBase, and 1 other: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1040747 (10GWicke) [04:28:05] legoktm: ^ unusable yellow is unusable [04:28:43] looks great on black [04:28:59] white on black terminals for life [04:29:44] :P [04:30:52] I like it because it's kind of the "yo dudes, pay attention to this" project/tag and yellow on black is good for that [04:42:41] Yuvi|Vacation: you just need a theme that maps the mIRC color codes to something that doesn't suck -- https://github.com/bd808/Textual-Theme-bd808/blob/master/src/styles/design.less#L307-L341 [04:45:24] Yuvi|Vacation: also, you need to care less about your irc colors while *on vacation* ;) [04:45:39] I went for a week without using my computer! [04:45:47] brave lad [04:45:54] but I have one more week left... [04:46:02] and I wont’ have much internet for this week. [04:51:14] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Feb 16 04:50:10 UTC 2015 (duration 50m 9s) [04:51:21] Logged the message, Master [05:22:55] (03PS1) 10Springle: Eventlogging backfilling consistency checks. [software] - 10https://gerrit.wikimedia.org/r/190772 [05:23:31] (03CR) 10Springle: [C: 032] Eventlogging backfilling consistency checks. [software] - 10https://gerrit.wikimedia.org/r/190772 (owner: 10Springle) [05:26:11] (03PS1) 10KartikMistry: Content Translation: Use Beta Parsoid for Beta [puppet] - 10https://gerrit.wikimedia.org/r/190773 [05:34:43] springle: working today? [05:35:03] oh, wait, nevermind, this is a much simpler problem than I thought :) [05:35:54] :) [05:35:55] (thought I was seeing db error but it’s just a full disk) [05:36:18] heh.. well DB hates those. if it's silver or virt1000 i'll run away now [05:36:29] wikitech-static [05:37:00] Worst case I can just start over since it’s just an image of silver. [05:38:53] springle: every day wikitech-static does an import of a dump that’s 99% identical to the one from the day before. I am… hoping… that it doesn’t store a daily duplicate of every wiki page. [05:40:52] andrewbogott: it dumps the DB and reloads it daily? [05:41:18] let me make sure that’s actually true... [05:42:00] silver does: /usr/local/bin/mwscript maintenance/dumpBackup.php labswiki --full --uploads [05:42:02] And -static does: [05:42:27] php maintenance/importDump.php [05:42:30] So, that looks like ‘yes' [05:43:15] # ls -ltrah /var/lib/mysql/ibdata1 [05:43:16] -rw-rw---- 1 mysql mysql 14G Feb 16 05:43 /var/lib/mysql/ibdata1 [05:43:19] (from -static) [05:43:45] making it replicate 24h behind would be easier i guess [05:44:08] Looks like it isn’t gobbling disk space, so the import must do something sensible about duplicates. [05:44:36] Replicating would be nice, although I don’t have any intuition about if that’s more fragile than a big wget. [05:44:43] Also keep in mind that -static is… elsehwere. [05:44:47] *elsewhere [05:44:51] hosted by rackspace [05:45:29] we replicated production over an ssh tunnel to toolserver for years :) wikitech is tiny in comparison [05:45:59] but just $0.02 [05:46:22] The other question is how resilient that is if silver goes haywire. I guess if replication breaks it won’t break mysql entirely... [05:46:27] shall I make you a phab task? :) [05:46:51] if you think it will solve something, sure [05:47:39] replication breaking wouldn't affect silver, providing the tunnel was configured to reattempt connection [05:47:43] (03CR) 10KartikMistry: [C: 04-1] "Not really. CX can't load articles while using Parsoid from Beta." [puppet] - 10https://gerrit.wikimedia.org/r/190773 (owner: 10KartikMistry) [05:48:05] ok, I’ll assess the disk space situation once I fix this… obvious mistake I made that filled up the disk. [05:48:06] and if silver went haywire we would be dumping/reloading anyway, which is no different to now :) [05:48:13] sounds good [05:50:18] The dump file is 3g, and -static has 8g available. Not a ton of room but the status quo will keep working for a while yet… [05:53:18] oh, hm, I didn’t count images. This is going to get very tight [06:28:47] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:57] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:27] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:47] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:38] springle: I'm around, if you need me to do anything with EL [06:34:27] !log messing with phab boolean fulltext syntax T89274 [06:34:30] ori: tnx [06:34:33] Logged the message, Master [06:45:56] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:46:37] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:44] (03PS1) 10Springle: phabricator using mysql fulltext T89274, tweaked for mariadb/aria [puppet] - 10https://gerrit.wikimedia.org/r/190775 [06:46:47] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:11:18] <_joe_> morning [07:12:07] <_joe_> ori: did you ever took a look at what's broken in our grafana installation? [07:12:55] s/took/take. I think the ElasticSearch backend (which is the Logstash cluster) may be down? [07:12:59] Why, what are you seeing? [07:13:51] <_joe_> ori: no just the fact that I can't add graphs to new dashboards from scratch [07:14:17] I never really figured that out, I always started by modifying the default dashboard [07:14:22] <_joe_> "Error: a.datasource.query is not a function" and all that [07:14:27] but I thought that was just me being dumb [07:14:31] <_joe_> ori: k thanks [07:14:43] <_joe_> oh I never really looked into it as well :) [07:15:00] could be a function of the commit we're on? I had us track HEAD for a while because the latest release wasn't usable [07:16:18] <_joe_> maybe [07:26:24] (03PS2) 10Springle: phabricator using mysql fulltext T89274, tweaked for mariadb/aria [puppet] - 10https://gerrit.wikimedia.org/r/190775 [07:29:37] (03PS3) 10Springle: phabricator using mysql fulltext T89274, tweaked for mariadb/aria [puppet] - 10https://gerrit.wikimedia.org/r/190775 [07:33:28] 13operations, 13Phabricator, and 1 other: Mysql search issues flagged by Phabricator setup - https://phabricator.wikimedia.org/T89274#1040891 (10Springle) The ft_boolean_syntax fix for default AND behavior has been applied as it doesn't technically need a DB restart. The ft_min_word_len, stopwords, and table... [07:34:33] * _joe_ is having a slow start of the morning [07:38:27] (03PS5) 10Giuseppe Lavagetto: base: add the service_unit init wrapper [puppet] - 10https://gerrit.wikimedia.org/r/189753 [07:40:55] 13operations, 13Phabricator, and 1 other: Mysql search issues flagged by Phabricator setup - https://phabricator.wikimedia.org/T89274#1040892 (10Springle) >>! In T89274#1034896, @thiemowmde wrote: > Question: Is the fact that the search ignores my attempts to type `AND` a bug I should report? The technical... [07:41:54] _joe_: sounds like a normal morning to me :) [07:42:37] 1/ wake up. 2/ connect coffee drip. 3/ ... 4/ be useful [07:44:43] <_joe_> springle: I'm at 3/ [07:45:01] :D [07:45:50] <_joe_> usually some hard-paced punk-rock helps me wake up. I'll try that [07:47:31] _joe_: to fulfill a request from translatewiki folks, who want to know the most popular message keys, i'm going to run tcpdump on two app servers (one regular, one api), grepping outgoing tcp dst port 11211 traffic for gets and redirecting the output into a file. An hour of this should be enough, and I'll !log the hostnames I run it on. Is that cool with you? [07:48:11] <_joe_> ori: it's cool, it seems a hard way to find this out though :) [07:48:20] <_joe_> just !log it btw [07:48:46] maybe i'll just locally hack a wfDebug() statement into two machines [07:49:07] how would you do it? [07:49:58] <_joe_> lemme check which version of memcached are we running [07:51:28] <_joe_> mmmmh my memory is failing me [07:52:00] I think I know how to do this [07:52:20] <_joe_> yeah tcpdump still seems like a lot of work for you in post-processing; I'd hack a way to keep counters of keys [07:52:21] if ( hostname === 'A' or hostname === 'B' ) $wgHooks['MessageCache::get'][] = function ( &$key ) { log_the_key(); } [07:52:31] <_joe_> yeah that's a good alternative [07:52:33] i can commit that to wmf-config and then revert [07:52:41] that way it's really clean [07:53:01] <_joe_> I think it's going to make your life easier [07:53:27] nod [07:54:39] https://github.com/etsy/mctop/ seems interesting [07:54:49] (never seen it before .. just noting) [07:56:14] <_joe_> springle: ok, but it doesn't report hot keys [07:56:37] <_joe_> (that's what I looked up, that and memcached-top, I kinda remembered they had such a function) [07:57:17] _joe_: mmm yeah wondering about that. the original etsy blog report was specifically about looking for hot keys https://codeascraft.com/2012/12/13/mctop-a-tool-for-analyzing-memcache-get-traffic/ [07:57:27] <_joe_> springle: yeah, you're right [07:57:31] but maybe it got off track in the excitement of OSS-ing a cool new tool [07:57:31] <_joe_> d'oh [07:57:41] <_joe_> springle: no no reading the docs now again [07:57:44] <_joe_> so, ori [07:57:59] do we use text or binary MC protocol? [07:58:14] text iirc [07:58:21] i don't think twemproxy supports the binary protocol [07:58:38] <_joe_> and, mctop is not in ubuntu [07:58:52] _joe_: morning, who's on duty this week ? [07:59:27] <_joe_> but if this is not just a one off and we may need it again, it could be a good idea to pack it [08:00:04] Dear anthropoid, the time has come. Please deploy US Holiday (Presidents' Day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150216T0800). [08:00:11] <_joe_> ?? [08:00:21] <_joe_> oh it's midnight in the US [08:00:25] nice! :) [08:00:39] <_joe_> matanya: I don't know off-hand, is it urgent? [08:01:09] not it all, just interested in getting some reviews for my patches [08:01:26] <_joe_> mh I don't think that counts as oncall duty [08:01:52] so who takes that responsibillity ? [08:01:56] <_joe_> !log repooling mw1018 [08:02:03] Logged the message, Master [08:09:22] (03PS1) 10KartikMistry: CX: Do not use $wmgParsoidURL for Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) [08:23:18] (03CR) 10Nikerabbit: [C: 04-1] CX: Do not use $wmgParsoidURL for Beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) (owner: 10KartikMistry) [08:37:46] (03PS2) 10KartikMistry: CX: Do not use $wmgParsoidURL from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) [08:38:20] (03PS3) 10KartikMistry: CX: Do not use $wmgParsoidURL from Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) [08:41:17] (03PS1) 10Ori.livneh: Temporarily log message key lookups on four app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190777 [08:47:26] (03Abandoned) 10KartikMistry: Content Translation: Use Beta Parsoid for Beta [puppet] - 10https://gerrit.wikimedia.org/r/190773 (owner: 10KartikMistry) [08:47:41] greetings [08:52:40] (03CR) 10Filippo Giunchedi: trebuchet: use salt to check on salt-minion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/190501 (owner: 10Ori.livneh) [08:54:16] (03CR) 10Nikerabbit: [C: 031] "Code looks good to me. I am unable to say about the volume of data this logs, but probably quite a lot." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190777 (owner: 10Ori.livneh) [08:56:14] (03PS2) 10Nikerabbit: Temporarily log message key lookups on four app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190777 (https://phabricator.wikimedia.org/T65416) (owner: 10Ori.livneh) [09:04:59] (03PS3) 10Nemo bis: Temporarily log message key lookups on four app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190777 (https://phabricator.wikimedia.org/T65416) (owner: 10Ori.livneh) [09:05:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [09:06:02] (03PS4) 10Nemo bis: Temporarily log message key lookups on four app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190777 (https://phabricator.wikimedia.org/T65416) (owner: 10Ori.livneh) [09:17:23] 13operations, 2Ops-Access-Requests, 2RESTBase, and 2 others: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1040965 (10fgiunchedi) I think it makes sense for the services team to be able to debug this as root, yet we need auditing/logging/etc for system-level operations.... [09:18:17] new colors? [09:18:24] <_joe_> seems so [09:18:42] <_joe_> much less readable with my color scheme [09:18:43] my eyes! my eyes! [09:18:53] same here [09:19:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:19:09] yep, i alredy had to disable colors on my side [09:19:21] <_joe_> ouch trying to run rspec on precise is revealing to be hard [09:19:22] yeah blue on dark background is no go [09:20:03] who coul have guessed that anyone still uses text mode? [09:20:36] paravoid, _joe_, MaxSem: FYI https://phabricator.wikimedia.org/T89632 [09:27:01] thanks Nikerabbit [09:34:17] 13operations, 2RESTBase, and 2 others: Detailed cassandra monitoring - https://phabricator.wikimedia.org/T78514#1040978 (10fgiunchedi) yep I will give it a try on the test cluster today or tomorrow at the latest [09:40:23] 13operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#1040991 (10fgiunchedi) a:3fgiunchedi sorry for the lack of updates, this is on me to do the rolling restart (I was looking at T88354 instead) we are 1/3 in, the last fe... [09:55:36] (03PS6) 10Giuseppe Lavagetto: base: add the service_unit init wrapper [puppet] - 10https://gerrit.wikimedia.org/r/189753 [09:57:10] (03CR) 10Giuseppe Lavagetto: [C: 032] base: add the service_unit init wrapper [puppet] - 10https://gerrit.wikimedia.org/r/189753 (owner: 10Giuseppe Lavagetto) [10:00:16] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [10:02:37] 13operations, and 1 other: Setup memcached cluster in codfw - https://phabricator.wikimedia.org/T86888#1041001 (10Joe) [10:02:39] 13operations, and 1 other: Create a service_unit custom type for puppet that supports systemd - https://phabricator.wikimedia.org/T89086#1041000 (10Joe) 5Open>3Resolved [10:03:20] !log resume elasticsearch rolling restart - elastic1012 -> elastic1022 in turn [10:03:27] Logged the message, Master [10:06:44] (03CR) 10Alexandros Kosiaris: [C: 032] interface: fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/190669 (owner: 10Matanya) [10:07:13] (03CR) 10Alexandros Kosiaris: [C: 032] eventlogging: fix file mode [puppet] - 10https://gerrit.wikimedia.org/r/190671 (owner: 10Matanya) [10:07:54] (03CR) 10Alexandros Kosiaris: [C: 032] webserver: fix string containing only a variable [puppet] - 10https://gerrit.wikimedia.org/r/190667 (owner: 10Matanya) [10:08:16] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [10:08:33] (03CR) 10Alexandros Kosiaris: [C: 032] localssl: fix optional parameter listed before required parameter [puppet] - 10https://gerrit.wikimedia.org/r/190666 (owner: 10Matanya) [10:08:58] (03CR) 10Alexandros Kosiaris: [C: 032] wikimetrics: 4 digit file mode [puppet] - 10https://gerrit.wikimedia.org/r/190193 (owner: 10Matanya) [10:12:28] (03Restored) 10Alexandros Kosiaris: fix all 'variable not enclosed by {}' [puppet] - 10https://gerrit.wikimedia.org/r/189898 (owner: 10Dzahn) [10:12:34] (03PS4) 10Alexandros Kosiaris: fix all 'variable not enclosed by {}' [puppet] - 10https://gerrit.wikimedia.org/r/189898 (owner: 10Dzahn) [10:12:40] 13operations, 2RESTBase, 13Services: setup LVS for restbase eqiad servers - https://phabricator.wikimedia.org/T89636#1041017 (10fgiunchedi) 3NEW a:3fgiunchedi [10:13:53] 13operations, 2RESTBase, and 2 others: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1041025 (10fgiunchedi) we'd need to point parsoid varnishes to `restbase.svc.eqiad.wmnet` too, part of this task I think [10:14:03] 13operations, 2RESTBase, 13Services: setup LVS for restbase eqiad servers - https://phabricator.wikimedia.org/T89636#1041026 (10fgiunchedi) [10:15:26] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [10:17:17] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 10 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [10:17:29] <_joe_> wat? [10:17:34] <_joe_> There are 10 unmerged changes [10:17:43] lol [10:17:50] 5 to be precise [10:18:02] 5 are the merges and 5 the actual ones [10:18:05] (03PS1) 10Nemo bis: $wgTranslateBlacklist: "en" conditional to the wiki being in English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190783 [10:18:07] <_joe_> ohhh ok [10:18:11] (03CR) 10jenkins-bot: [V: 04-1] $wgTranslateBlacklist: "en" conditional to the wiki being in English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190783 (owner: 10Nemo bis) [10:18:14] it was all the lints I merged from matanya [10:18:17] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [10:18:18] matanya: thanks btw! [10:18:28] and of course strontium and puppet-merge :-( [10:18:37] thank you for the review and merge! :) [10:20:11] <_joe_> too much niceness [10:20:55] (03CR) 10Alexandros Kosiaris: [C: 032] "In general, I support what Antoine says. Small commits, per module that can be reviewed/debugged/reverted way more easily than one huge co" [puppet] - 10https://gerrit.wikimedia.org/r/189898 (owner: 10Dzahn) [10:21:44] let's see what that might break [10:21:50] <_joe_> akosiaris: I won't have merged that [10:21:54] <_joe_> even if it's correct [10:22:02] (03PS2) 10Nemo bis: $wgTranslateBlacklist: "en" conditional to the wiki being in English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190783 [10:22:07] <_joe_> we should /never/ do such changes if not strictly needed [10:22:14] linting ? [10:22:20] of the scope of it ? [10:22:22] or* [10:22:25] <_joe_> the scope [10:22:34] yeah I agree [10:22:56] I just decided that Daniel's time can probably be better spent that splitting that into like 20 commits [10:23:02] than* [10:23:09] <_joe_> I agree [10:23:11] that is very considering [10:23:25] but as I've said in the commit time, it's one time off [10:23:37] I won't be merging another one like that :-) [10:26:37] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [10:30:12] (03PS1) 10Filippo Giunchedi: restbase: allocate LVS service ip [dns] - 10https://gerrit.wikimedia.org/r/190784 (https://phabricator.wikimedia.org/T89636) [10:31:05] (03PS3) 10Filippo Giunchedi: es-tool: output cluster status during fast-restart [puppet] - 10https://gerrit.wikimedia.org/r/190475 [10:31:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] es-tool: output cluster status during fast-restart [puppet] - 10https://gerrit.wikimedia.org/r/190475 (owner: 10Filippo Giunchedi) [10:34:45] 13operations, 13Phabricator, and 1 other: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1041070 (10Aklapper) >>! In T87467#992259, @mmodell wrote: > https://gerrit.wikimedia.org/r/#/c/186533/ Patch (3 lines changed) sti... [10:39:12] (03PS1) 10Filippo Giunchedi: restbase: add LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/190786 (https://phabricator.wikimedia.org/T89636) [10:39:32] 13operations, and 1 other: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#1041078 (10Aklapper) > Please update this task if you continue seeing this error. @anomie: Can this task be closed as fixed? Or still seeing this issue? [10:39:59] (03CR) 10jenkins-bot: [V: 04-1] restbase: add LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/190786 (https://phabricator.wikimedia.org/T89636) (owner: 10Filippo Giunchedi) [10:42:48] my eeeys [10:45:20] (03PS1) 10Matanya: annualreport: move apache site from template to file + minor lint [puppet] - 10https://gerrit.wikimedia.org/r/190787 [10:45:32] (03PS2) 10Filippo Giunchedi: restbase: add LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/190786 (https://phabricator.wikimedia.org/T89636) [10:49:38] 13operations, 2Datasets-General-or-Unknown: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1041102 (10ArielGlenn) can you go with three connections all downloading and see how that is? Or is that with multiple connections? [10:51:16] akosiaris: hey [10:51:26] hey [10:51:48] morning [10:52:28] 13operations, 2Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1041107 (10ArielGlenn) This was held up due to directory permissions but is running now. [10:53:04] akosiaris: What's wrong with https://gerrit.wikimedia.org/r/#/c/186538/ ? [10:53:22] Must pass yandex_api_key to Class[Role::Cxserver] [10:53:24] 13operations, 2RESTBase, and 2 others: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1041110 (10mobrovac) @fgiunchedi yup, just a simple FWD for port 7231 should be enough. [10:53:27] we're doing it. [10:54:32] kart_: not following. Care to explain ? [10:56:00] 13operations, 2Datasets-General-or-Unknown: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1041113 (10Kelson) New mirrored files are now available in the mirror manager, see for example: http://download.kiwix.org/zim/wikipedia/wikipedia_ar_all_2015-01.zim.mirrorlist [11:00:24] (03PS1) 10ArielGlenn: datasets: add link to kiwix files from index html pages [puppet] - 10https://gerrit.wikimedia.org/r/190790 [11:01:47] PROBLEM - puppet last run on zirconium is CRITICAL: CRITICAL: Puppet has 1 failures [11:03:09] (03CR) 10ArielGlenn: [C: 032] datasets: add link to kiwix files from index html pages [puppet] - 10https://gerrit.wikimedia.org/r/190790 (owner: 10ArielGlenn) [11:04:07] akosiaris: Yandex support patch for CX. [11:04:07] PROBLEM - RAID on restbase1006 is CRITICAL: CRITICAL: Active: 8, Working: 8, Failed: 1, Spare: 0 [11:05:22] kart_: I got that part, what is the problem though ? [11:05:43] akosiaris: as I said, puppet fails with that error. [11:05:59] akosiaris: Must pass yandex_api_key to Class[Role::Cxserver] [11:06:00] kart_: no, you haven't actually said that [11:06:09] now it is more clear to me [11:06:10] oh :) [11:06:34] well, that's encouraging (restbase1006) [11:06:42] godog: yeah, I noticed [11:07:06] taking a look [11:07:07] kart_: commenting on the change now [11:08:01] akosiaris: thanks! [11:08:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] cxserver: Add Yandex support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [11:10:34] (03PS1) 10ArielGlenn: datasets: fix typo in html directory name [puppet] - 10https://gerrit.wikimedia.org/r/190792 [11:11:42] (03CR) 10ArielGlenn: [C: 032] datasets: fix typo in html directory name [puppet] - 10https://gerrit.wikimedia.org/r/190792 (owner: 10ArielGlenn) [11:12:38] (03CR) 10KartikMistry: cxserver: Add Yandex support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [11:13:28] slow gerrit. [11:13:55] (03PS16) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [11:16:32] 13operations, 2RESTBase, 13Services: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1041147 (10fgiunchedi) 3NEW a:3fgiunchedi [11:16:57] wait, mpt2sas? [11:17:06] PROBLEM - puppet last run on cp3011 is CRITICAL: CRITICAL: puppet fail [11:18:17] paravoid: yep I think that's what the hp servers ship with [11:18:23] sigh [11:19:29] 2308, ok [11:19:58] 8Blocked-on-Operations: Puppet failing on zirconium due to inability to git pull Transparency Report - https://phabricator.wikimedia.org/T89640#1041158 (10akosiaris) 3NEW a:3Dzahn [11:20:19] http://serverfault.com/questions/407703/deciphering-continuing-mpt2sas-syslog-messages some leads here too, perhaps the ssd [11:20:30] 13operations: Puppet failing on zirconium due to inability to git pull Transparency Report - https://phabricator.wikimedia.org/T89640#1041168 (10akosiaris) [11:22:04] well the disk is broken, that's for sure [11:22:24] akosiaris: have you created ::password::cxserver? [11:23:00] paravoid: could be cabling/seating too no? [11:23:15] (03CR) 10ArielGlenn: [C: 032] datasets: rsync from kiwix once an hour instead of every 15 minutes [puppet] - 10https://gerrit.wikimedia.org/r/190795 (owner: 10ArielGlenn) [11:23:22] (03CR) 10Alexandros Kosiaris: [C: 032] annualreport: move apache site from template to file + minor lint [puppet] - 10https://gerrit.wikimedia.org/r/190787 (owner: 10Matanya) [11:23:41] maybe [11:24:05] database boxes without a BBU [11:24:10] this is going to be interesting [11:24:11] go ahead and merge me akosiaris [11:24:23] apergos: done, thanks! [11:24:27] thank you [11:26:21] akosiaris: because now we've: Error 400 on SERVER: Could not find class ::passwords::cxserver for i-000006a2.eqiad.wmflabs on node i-000006a2.eqiad.wmflabs [11:26:41] need to add in Private puppet repo? [11:27:03] kart_: in production I have added it already [11:27:18] in labs, no [11:27:29] in labs you can checkout the labs/private repo [11:27:35] and add a dummy one yourself [11:28:12] this is a public repo despite the name btw. Don't go around adding actually confidential/private data [11:29:03] akosiaris: ah. Thanks. [11:29:05] paravoid: I'm going to launch a smart long test, but I think we're better off with DOA disk [11:29:45] 13operations, 2RESTBase, 13Services: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1041173 (10mobrovac) Might be a HW issue, either with the controller or the disk or somewhere in between. Maybe try to rewire the disks differently to see if... [11:31:14] 13operations: Our custom php packages need to create some conf.d links - https://phabricator.wikimedia.org/T89157#1041174 (10Joe) I added fixes for all our php extensions here: - https://gerrit.wikimedia.org/r/#/c/190789/ - https://gerrit.wikimedia.org/r/#/c/190794/ - https://gerrit.wikimedia.org/r/#/c/190796... [11:31:26] 13operations, and 1 other: Our custom php packages need to create some conf.d links - https://phabricator.wikimedia.org/T89157#1041175 (10Joe) [11:33:08] 13operations, 2RESTBase, 13Services: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1041179 (10fgiunchedi) found some info here too http://serverfault.com/questions/407703/deciphering-continuing-mpt2sas-syslog-messages and launched a smart l... [11:36:47] RECOVERY - puppet last run on cp3011 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [11:44:42] akosiaris: for labs, https://gerrit.wikimedia.org/r/190798 [11:45:19] (03CR) 10KartikMistry: "Now depends on Dummy key for Beta: https://gerrit.wikimedia.org/r/190798" [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [11:51:44] 13operations, 2OTRS: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1041222 (10tommorris) If one is using tethered mobile broadband from Three (a UK mobile operator) using an Android handset to provide a personal hotspot, every slight jolt in mobile connectivity led to a ne... [11:56:33] 13operations, 2Ops-Access-Requests, 2RESTBase, and 2 others: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041230 (10fgiunchedi) to clarify, I am proposing that we allow all commands except shells to be executed through sudo with the intent to have an audit trail of wha... [12:02:07] akosiaris: can you check why patch still fails? [12:02:11] Same error. [12:02:41] kart_: check where ? [12:07:46] akosiaris: deployment-cxserver03 [12:08:05] puppet agaent -tv [12:18:50] kart_: probably the puppet master on Betalas has not pulled the change yet. Wait it a bit [12:22:17] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: add Icinga check_procs check for Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/190688 (owner: 10Ori.livneh) [12:22:25] <_joe_> or you need to add the relevant things to labs/private? [12:22:42] _joe_: that is already done [12:23:00] <_joe_> maybe beta doesn't auto-update that repo [12:23:14] _joe_: hmm [12:23:40] I see a git::clone for that reapo in modules/puppet/manifests/self/gitclone.pp [12:23:44] repo* [12:26:07] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Actually, I take back that +2. It's fine as a change, just in the wrong role. It needs to be in role::cassandra, not role::restbase" [puppet] - 10https://gerrit.wikimedia.org/r/190688 (owner: 10Ori.livneh) [12:26:49] <_joe_> well, that doesn't update itself by default I guess [12:27:02] <_joe_> or, puppet is failing to run there ;) [12:28:14] it is running, but it maybe indeed does not update by default although the repo has some very recent changes [12:28:42] akosiaris: still :/ [12:29:57] _joe_: I cherry picked patch to deployment-salt and ran puppet agent -tv on deployment-cxserver03 [12:30:10] usually, it is few minutes stuff. [12:32:43] <_joe_> why cherry-pick [12:32:59] <_joe_> oh your patch you mean? [12:34:28] so _joe_ I did a #/var/lib/git/labs/private# GIT_SSH=../../ssh git pull [12:34:34] and it got the changes [12:34:36] kart_: ^ [12:35:04] !log GIT_SSH=../../ssh git pull to update labs/private on deployment-salt [12:35:08] Logged the message, Master [12:35:44] need to ask hashar about this [12:36:35] lunch, bbl [12:36:55] akosiaris: cool. [12:41:25] akosiaris: more issue, will debug later. [12:42:12] 13operations, 2Ops-Access-Requests, 2RESTBase, and 2 others: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041255 (10mobrovac) >>! In T89366#1041230, @fgiunchedi wrote: > I am proposing that we allow all commands except shells to be executed through sudo with the intent... [12:55:27] 13operations, 2Ops-Access-Requests, 2RESTBase, and 2 others: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041261 (10fgiunchedi) correct, interactive things like su, bash and so on. scripts with shebangs would be fine since the kernel interprets the shebang not sudo or... [12:59:49] 13operations, 2Ops-Access-Requests, 2RESTBase, and 2 others: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041264 (10mobrovac) Ah, now I see, you meant direct root access :) Ok, this looks good to me. [13:17:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [13:27:54] 13operations, 2CirrusSearch: Poke ops about an elasticsearch cluster restart - https://phabricator.wikimedia.org/T88354#1041329 (10fgiunchedi) [13:27:54] 13operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#1041330 (10fgiunchedi) [13:31:14] grrrit-wm: ? [13:31:25] dead in the water eh [13:32:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:32:57] 13operations, 2Ops-Access-Requests, 2RESTBase, and 2 others: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041353 (10faidon) OK, first off let me say that I'm generally open to the idea of (local) root. We've already done this for Parsoid without much extra thinking. I... [13:42:00] (03CR) 10ArielGlenn: "The image and css should be ok; I just verified that it can be loaded fine, so please add that back in." [puppet] - 10https://gerrit.wikimedia.org/r/190683 (https://phabricator.wikimedia.org/T87328) (owner: 10Hoo man) [13:44:03] wtf [13:44:07] why does that work now? [13:44:16] I've bounced it [13:44:43] No, I'm talking about apergos's comment above [13:45:01] maybe you tested it with the old links [13:45:33] Probably [13:45:53] the version I looked at had a http url embedded that also 404ed [13:46:15] $ curl --head http://bits.wikimedia.org/skins/MonoBook/headbg.jpg [13:46:15] HTTP/1.1 200 OK [13:46:19] but the new link works [13:46:25] meh [13:46:29] I'll revert those [13:46:51] weird... did you change that this morning? Becuase I think I made that change on Saturday against the deployed version [13:46:54] paravoid: *waves*. Reverting seems to have worked [13:47:02] valhallasw`cloud: hi! [13:47:06] thanks! [13:47:10] you rock :) [13:47:19] it's your fault you know, you made the bot too useful :P [13:47:31] paravoid: I thought I didn't have my ssh key, until I realized I did have my password manager... which also is my ssh agent :D [13:47:42] I didn't [13:47:46] mh [13:47:46] that's what you get for working on a different computer [13:48:21] mut ante made a sweep through and changed a bunch of them a while ago [13:48:41] paravoid: also, please fill in your irc nick on phabricator :D [13:48:56] because I always forget and have to google it [13:49:32] apergos: mh... but I'm 100% sure that link was 404 then I made that change :P [13:49:46] (03PS2) 10Hoo man: dataset: Add link to other/wikidata/ from other/ [puppet] - 10https://gerrit.wikimedia.org/r/190683 (https://phabricator.wikimedia.org/T87328) [13:49:54] doesn't matter now [13:49:58] Not the link in puppet though... but the one that was live [13:49:59] yep [13:50:02] updated the change [13:50:08] only adds the link now [13:50:46] you coulda rebased that :-P I'll do it [13:50:54] (03PS3) 10ArielGlenn: dataset: Add link to other/wikidata/ from other/ [puppet] - 10https://gerrit.wikimedia.org/r/190683 (https://phabricator.wikimedia.org/T87328) (owner: 10Hoo man) [13:51:21] Don't rebase and change in the same PS... that's what I've been told ;) [13:52:04] (03CR) 10ArielGlenn: [C: 032] dataset: Add link to other/wikidata/ from other/ [puppet] - 10https://gerrit.wikimedia.org/r/190683 (https://phabricator.wikimedia.org/T87328) (owner: 10Hoo man) [13:55:01] live. thanks! [13:57:18] Thank you! :) [14:05:51] 3Datasets-General-or-Unknown, operations: Mirror more Kiwix downloads directories - https://phabricator.wikimedia.org/T57503#1041414 (10ArielGlenn) excerpt from discussion on irc: (06:01:59 μμ) Kelson: andre__: apergos: we currently working on a solution (with wmflabs) to create, each month, new version of all... [14:08:15] 3operations: Our custom php packages need to create some conf.d links - https://phabricator.wikimedia.org/T89157#1028676 (10Joe) [14:11:03] (03PS17) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [14:17:33] 3Ops-Access-Requests, operations: Access to stat1003 (statistics-users) for Ananth Ramakrishnan - https://phabricator.wikimedia.org/T85828#1041472 (10Ottomata) [14:25:57] akosiaris: ping me when free :) [14:26:05] (03PS18) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [14:28:42] ACKNOWLEDGEMENT - RAID on restbase1006 is CRITICAL: CRITICAL: Active: 8, Working: 8, Failed: 1, Spare: 0 Filippo Giunchedi T89639 [14:35:50] (03PS19) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [14:36:26] PROBLEM - puppet last run on restbase1004 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [14:39:40] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041537 (10mobrovac) @faidon, thank you for raising these important concerns (and taking the time to describe them so clearly and thoroughly). I'll try to chime in and give my two cents.... [14:48:24] (03PS3) 10Amire80: Enable EducationProgram in the Hebrew Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190357 (https://phabricator.wikimedia.org/T89393) [14:48:32] (03PS4) 10Amire80: Enable EducationProgram in the Hebrew Wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190357 (https://phabricator.wikimedia.org/T89393) [14:53:28] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#1041556 (10Anomie) I haven't seen it, let's close this. [14:53:36] 3operations: Scap error on mw1111: "Error reading response length from authentication socket." - https://phabricator.wikimedia.org/T86545#1041557 (10Anomie) 5Open>3Resolved [14:54:44] !log shutting down hadoop cluster, starting upgrade to CDH 5.3.1 [14:54:51] Logged the message, Master [14:55:56] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [14:57:37] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:58:07] PROBLEM - Hadoop DataNode on analytics1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:58:07] PROBLEM - Hadoop ResourceManager on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [14:58:16] PROBLEM - Hadoop DataNode on analytics1036 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:58:55] (03CR) 10Nikerabbit: [C: 031] $wgTranslateBlacklist: "en" conditional to the wiki being in English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190783 (owner: 10Nemo bis) [15:00:44] daw, i missed a downtime on one [15:00:51] or 2 [15:00:52] ! [15:05:57] PROBLEM - DPKG on analytics1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:06:17] (03PS20) 10Alexandros Kosiaris: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [15:06:57] RECOVERY - Hadoop ResourceManager on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [15:07:06] (03CR) 10jenkins-bot: [V: 04-1] cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [15:07:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 7 below the confidence bounds [15:08:07] RECOVERY - DPKG on analytics1001 is OK: All packages OK [15:12:09] (03PS21) 10Alexandros Kosiaris: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [15:12:27] PROBLEM - puppet last run on restbase1004 is CRITICAL: CRITICAL: Puppet has 1 failures [15:15:01] restbase is me btw [15:15:06] RECOVERY - Disk space on stat1002 is OK: DISK OK [15:16:45] (03PS22) 10Alexandros Kosiaris: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) (owner: 10KartikMistry) [15:19:06] PROBLEM - Hadoop ResourceManager on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [15:21:16] RECOVERY - Hadoop DataNode on analytics1017 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:21:17] RECOVERY - Hadoop DataNode on analytics1036 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [15:21:32] (03CR) 10Mvolz: [C: 031] create shell user for Marielle Volz [puppet] - 10https://gerrit.wikimedia.org/r/190405 (https://phabricator.wikimedia.org/T89057) (owner: 10Dzahn) [15:22:26] RECOVERY - Hadoop ResourceManager on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.resourcemanager.ResourceManager [15:25:57] PROBLEM - puppet last run on elastic1021 is CRITICAL: CRITICAL: puppet fail [15:29:46] 3Services, RESTBase, operations: /dev/sdc offline in restbase1006, recurring mpt2sas message in dmesg - https://phabricator.wikimedia.org/T89639#1041580 (10mobrovac) [15:29:48] 3RESTBase, Scrum-of-Scrums, Services, operations: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1041579 (10mobrovac) [15:34:33] 3Services, operations, Citoid: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1041592 (10Mvolz) [15:37:06] 3Services, operations, Citoid: Zotero not running in production - https://phabricator.wikimedia.org/T76308#1041602 (10Mvolz) [15:38:09] 3Services, operations, Citoid: Zotero not running in production - https://phabricator.wikimedia.org/T76308#795651 (10Mvolz) [15:39:37] (03PS23) 10KartikMistry: cxserver: Add Yandex support [puppet] - 10https://gerrit.wikimedia.org/r/186538 (https://phabricator.wikimedia.org/T88512) [15:42:35] 3operations, RESTBase-Cassandra: use correct datacenter/rack for cassandra nodes - https://phabricator.wikimedia.org/T89657#1041613 (10fgiunchedi) 3NEW a:3fgiunchedi [15:42:45] 3operations, RESTBase-Cassandra: use correct datacenter/rack for cassandra nodes - https://phabricator.wikimedia.org/T89657#1041623 (10fgiunchedi) p:5Triage>3High [15:43:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [15:44:46] RECOVERY - puppet last run on elastic1021 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:46:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 21.43% of data above the critical threshold [500.0] [15:52:07] 3OTRS, operations: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1041629 (10DoRD) I am also having issues remaining logged in to OTRS, and I often see, but not always, an "invalid session" message. I have a stable AT&T U-verse connection, which began allowing IPv6 just a few... [15:54:53] !log Updated Wikidata property suggester with data from today's dump [15:54:58] Logged the message, Master [15:55:00] sjoerddebruin: ^ ;) [15:55:13] Oh, will test. [15:55:20] or will it take some time? [15:55:32] No, should be there now [15:55:40] but I didn't yet update the configuration [15:56:02] What was the last time we updated btw? [15:56:17] January 5 [15:56:21] ok [15:56:22] https://github.com/wmde/wbs_propertypairs [15:56:35] that's where aude / I track the data we upload to it [15:57:51] (03CR) 10Nikerabbit: [C: 04-1] CX: Do not use $wmgParsoidURL from Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) (owner: 10KartikMistry) [15:58:35] Seems like we're already good at updating every month. [16:00:12] sjoerddebruin: Yeah... it's not that time eating... you just have to ask :P [16:00:53] (03PS1) 10Filippo Giunchedi: cassandra: deprecate cassandra::defaults class [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/190813 (https://phabricator.wikimedia.org/T76149) [16:00:53] Apparently today is a holiday in the US, thus no deploys :( [16:01:01] Stupid holidays. :) [16:02:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:02:29] 3operations, RESTBase-Cassandra: use correct datacenter/rack for cassandra nodes - https://phabricator.wikimedia.org/T89657#1041661 (10fgiunchedi) best to fix T76149 first since we need that for proper hiera usage anyway [16:02:54] 3operations, RESTBase-Cassandra: use correct datacenter/rack for cassandra nodes - https://phabricator.wikimedia.org/T89657#1041662 (10fgiunchedi) [16:02:55] 3operations, RESTBase-Cassandra: Make the cassandra module use hiera properly - https://phabricator.wikimedia.org/T76149#1041663 (10fgiunchedi) [16:02:57] hoo: That only some americans observe [16:04:01] But apparently to many to push out configuration changes today [16:04:27] (03PS1) 10Giuseppe Lavagetto: service_unit: allow custom init script in a single initsystem [puppet] - 10https://gerrit.wikimedia.org/r/190815 [16:04:29] (03PS1) 10Giuseppe Lavagetto: memcached: systemd compatibility [puppet] - 10https://gerrit.wikimedia.org/r/190816 [16:06:15] hoo: Don't see shocking changes yet. Not so weird with just a month [16:06:46] sjoerddebruin: mh... how's it doing on Properties, btw? [16:07:03] We're offtopic here, let's move this to #wikidata [16:07:10] ok [16:09:01] (03PS4) 10KartikMistry: CX: Do not use internal $wmgParsoidURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) [16:09:11] (03CR) 10jenkins-bot: [V: 04-1] CX: Do not use internal $wmgParsoidURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) (owner: 10KartikMistry) [16:11:45] (03PS3) 10BryanDavis: Revert "Revert "beta: switch logstash transport from redis to syslog"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190629 (https://phabricator.wikimedia.org/T88870) [16:12:19] (03CR) 10BryanDavis: [C: 032] Revert "Revert "beta: switch logstash transport from redis to syslog"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190629 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [16:12:23] (03Merged) 10jenkins-bot: Revert "Revert "beta: switch logstash transport from redis to syslog"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190629 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [16:12:27] (03PS1) 10Alexandros Kosiaris: Move url_downloader_ip configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/190817 [16:13:04] (03PS5) 10KartikMistry: CX: Do not use internal $wmgParsoidURL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190776 (https://phabricator.wikimedia.org/T89558) [16:13:07] 3operations, RESTBase-Cassandra: Make the cassandra module use hiera properly - https://phabricator.wikimedia.org/T76149#1041687 (10fgiunchedi) a:3fgiunchedi [16:13:11] !log bd808 Synchronized wmf-config/logging-labs.php: Switch beta to syslog logging, try #2 (45d25e2) (duration: 00m 05s) [16:13:15] Logged the message, Master [16:16:02] manybubbles ^d ottomata FYI each es-tool restart-fast takes between 45m and 55m, so yeah a full bounce is ~24h [16:17:15] oo [16:17:26] sounds like restart-SLOW [16:18:20] hehe indeed [16:34:29] 3operations, MediaWiki-Core-Team, Wikimedia-Logstash, Incident-20150205-SiteOutage: Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash - https://phabricator.wikimedia.org/T88870#1041691 (10bd808) This code path is running in beta now and logs are showing up in Kibana as exp... [16:35:06] (03PS5) 10Filippo Giunchedi: Make `es-tool ban-node` handle both IP addressses and hostnames [puppet] - 10https://gerrit.wikimedia.org/r/180210 (owner: 10Chad) [16:35:26] PROBLEM - puppet last run on cp1052 is CRITICAL: CRITICAL: Puppet has 1 failures [16:36:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Make `es-tool ban-node` handle both IP addressses and hostnames [puppet] - 10https://gerrit.wikimedia.org/r/180210 (owner: 10Chad) [16:37:10] 3operations, MediaWiki-Core-Team, Wikimedia-Logstash, Incident-20150205-SiteOutage: Prototype Monolog and rsyslog configuration to ship log events from MediaWiki to Logstash - https://phabricator.wikimedia.org/T88870#1041701 (10bd808) The full set of changes requires several changes that will be included in 1.25... [16:37:36] RECOVERY - puppet last run on cp1052 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [16:38:02] (03CR) 10Filippo Giunchedi: [C: 031] dsh: create files based on exported resources [puppet] - 10https://gerrit.wikimedia.org/r/179121 (owner: 10Giuseppe Lavagetto) [16:40:26] (03CR) 10BryanDavis: "Working nicely in beta to parse direct syslog udp datagrams sent from Monolog in MediaWiki. Safe to merge in prod at any time. Note that i" [puppet] - 10https://gerrit.wikimedia.org/r/190231 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [16:43:20] (03PS2) 10Alexandros Kosiaris: Move url_downloader_ip configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/190817 [16:44:27] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [16:44:57] (03PS1) 10Gilles: Set up beacon endpoint for virtual media views [puppet] - 10https://gerrit.wikimedia.org/r/190821 (https://phabricator.wikimedia.org/T89088) [16:48:44] 3Wikimedia-Logstash, operations, Incident-20150205-SiteOutage: Decouple logging infrastructure failures from MediaWiki logging - https://phabricator.wikimedia.org/T88732#1041725 (10bd808) I have everything working in beta to ship log events to Logstash using syslog UDP datagrams sent directly from MediaWiki to t... [16:52:33] 3operations: Our custom php packages need to create some conf.d links - https://phabricator.wikimedia.org/T89157#1041741 (10Joe) 5Open>3stalled [17:01:13] 3ops-eqiad, operations: Rack Setup new diskshelf for labstore1001 - https://phabricator.wikimedia.org/T88802#1041758 (10Andrew) Is this waiting on Yuvi because we need his help with the expansion, or because without him we're just too busy for any additional self-inflicted breakage? [17:01:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] restbase: allocate LVS service ip (032 comments) [dns] - 10https://gerrit.wikimedia.org/r/190784 (https://phabricator.wikimedia.org/T89636) (owner: 10Filippo Giunchedi) [17:02:09] (03CR) 10Alexandros Kosiaris: "Also add the word "internal" to the commit subject to make it more clear please" [dns] - 10https://gerrit.wikimedia.org/r/190784 (https://phabricator.wikimedia.org/T89636) (owner: 10Filippo Giunchedi) [17:07:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Premise looks good, minor comments" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/190786 (https://phabricator.wikimedia.org/T89636) (owner: 10Filippo Giunchedi) [17:08:59] (03CR) 10Alexandros Kosiaris: [C: 032] Move url_downloader_ip configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/190817 (owner: 10Alexandros Kosiaris) [17:09:35] 3Labs, operations: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1041786 (10Andrew) Thanks to the wikitech move to silver, we've now gone several days without this happening! Still, virt1000 is ailing a bit. I propose throwing more memory at the problem, as per T89266 [17:10:00] 3hardware-requests, Labs, ops-eqiad, operations: virt1000 memory upgrade - https://phabricator.wikimedia.org/T89266#1031811 (10Andrew) [17:10:01] 3Labs, operations: MySQL on wikitech keeps dying - https://phabricator.wikimedia.org/T88256#1041789 (10Andrew) [17:11:18] 3Labs, operations: OOM on virt1000 - https://phabricator.wikimedia.org/T88256#1041793 (10Andrew) p:5Unbreak!>3High [17:20:27] 3Ops-Access-Requests, operations, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1041817 (10GWicke) [17:20:47] 3hardware-requests, Labs, ops-eqiad, operations: virt1000 memory upgrade - https://phabricator.wikimedia.org/T89266#1041819 (10Andrew) Thursday is fine. Chris, if you can pick a window that's not already blocked on the deployment calendar, I'll notify greg and the labs list. The good news is, this outage won't... [17:35:26] (03PS1) 10Ottomata: Some fixes for Hive in CDH 5.3.1 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190827 [17:36:10] (03CR) 10Ottomata: [C: 032] Some fixes for Hive in CDH 5.3.1 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190827 (owner: 10Ottomata) [17:37:21] (03PS1) 10Ottomata: Use SNAPPY as default hive parquet compression, automatically load unversioned symlink of hcatalog jar [puppet] - 10https://gerrit.wikimedia.org/r/190828 [17:39:08] (03CR) 10Ottomata: [C: 032] Use SNAPPY as default hive parquet compression, automatically load unversioned symlink of hcatalog jar [puppet] - 10https://gerrit.wikimedia.org/r/190828 (owner: 10Ottomata) [17:41:06] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /data 1519044 MB (3% inode=99%): [17:41:08] (03PS1) 10Ottomata: Update yarn-env.sh.erb with new (commented) setting for CDH 5.3.1 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190829 [17:41:40] (03PS2) 10Filippo Giunchedi: restbase: allocate LVS internal service ip [dns] - 10https://gerrit.wikimedia.org/r/190784 (https://phabricator.wikimedia.org/T89636) [17:41:45] (03CR) 10Ottomata: [C: 032] Update yarn-env.sh.erb with new (commented) setting for CDH 5.3.1 [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190829 (owner: 10Ottomata) [17:42:14] (03PS1) 10Ottomata: Fix typo in parquet_compression parameter [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190830 [17:42:17] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [17:42:30] (03CR) 10Ottomata: [C: 032] Fix typo in parquet_compression parameter [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190830 (owner: 10Ottomata) [17:42:46] (03PS1) 10Ottomata: Update cdh module with typo fix [puppet] - 10https://gerrit.wikimedia.org/r/190832 [17:42:54] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with typo fix [puppet] - 10https://gerrit.wikimedia.org/r/190832 (owner: 10Ottomata) [17:44:16] PROBLEM - puppet last run on amssq44 is CRITICAL: CRITICAL: Puppet has 1 failures [17:45:27] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:46:27] PROBLEM - puppet last run on amssq32 is CRITICAL: CRITICAL: Puppet has 1 failures [17:47:01] (03CR) 10Filippo Giunchedi: restbase: add LVS configuration (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/190786 (https://phabricator.wikimedia.org/T89636) (owner: 10Filippo Giunchedi) [17:47:30] (03PS3) 10Filippo Giunchedi: restbase: add internal LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/190786 (https://phabricator.wikimedia.org/T89636) [17:58:28] PROBLEM - Disk space on dataset1001 is CRITICAL: DISK CRITICAL - free space: /data 1519277 MB (3% inode=99%): [18:00:39] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:48] RECOVERY - puppet last run on amssq44 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [18:04:57] RECOVERY - puppet last run on amssq32 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:08:19] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 59741 bytes in 0.148 second response time [18:21:37] 3ops-eqiad, operations: Rack Setup new diskshelf for labstore1001 - https://phabricator.wikimedia.org/T88802#1041959 (10coren) I don't think it's waiting on Yuvi; Chris may just have misunderstood my comment he being on vacation. @cmjohnson: feel free to give us plausible windows for this with a couple days' ad... [18:57:27] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [18:58:37] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [18:58:43] (03PS1) 10Ottomata: CDH 5.3.1 includes a fix for HUE-1398, so we no longer need this init.d script hack [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190843 [18:59:08] (03CR) 10Bmansurov: [C: 04-1] Adding original language of this work campaign for WikiGrok (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/188731 (owner: 10Kaldari) [18:59:11] (03CR) 10Ottomata: [C: 032 V: 032] CDH 5.3.1 includes a fix for HUE-1398, so we no longer need this init.d script hack [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190843 (owner: 10Ottomata) [18:59:31] (03PS5) 10Ori.livneh: Temporarily log message key lookups on four app servers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190777 (https://phabricator.wikimedia.org/T65416) [18:59:33] (03PS1) 10Ottomata: Update cdh module with Hue fix [puppet] - 10https://gerrit.wikimedia.org/r/190844 [18:59:57] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with Hue fix [puppet] - 10https://gerrit.wikimedia.org/r/190844 (owner: 10Ottomata) [19:02:37] PROBLEM - puppet last run on analytics1002 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:02:47] PROBLEM - puppet last run on analytics1019 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:02:57] PROBLEM - puppet last run on analytics1028 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:03:46] RECOVERY - puppet last run on analytics1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:04:06] RECOVERY - puppet last run on analytics1028 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [19:04:57] RECOVERY - puppet last run on analytics1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:06:07] PROBLEM - puppet last run on analytics1041 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:07] PROBLEM - puppet last run on analytics1014 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:07] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:07] PROBLEM - puppet last run on analytics1035 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:16] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:17] PROBLEM - puppet last run on analytics1034 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:17] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:17] PROBLEM - puppet last run on analytics1013 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:26] PROBLEM - puppet last run on analytics1016 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:27] PROBLEM - puppet last run on analytics1030 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:27] PROBLEM - puppet last run on analytics1032 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:36] PROBLEM - puppet last run on analytics1038 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:37] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:37] PROBLEM - puppet last run on analytics1040 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:46] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:56] PROBLEM - puppet last run on analytics1017 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:06:56] PROBLEM - puppet last run on analytics1020 is CRITICAL: CRITICAL: Puppet last ran 4 hours ago [19:07:16] RECOVERY - puppet last run on analytics1041 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [19:07:16] RECOVERY - puppet last run on analytics1014 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [19:07:16] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [19:07:16] RECOVERY - puppet last run on analytics1035 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:07:17] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [19:07:17] RECOVERY - puppet last run on analytics1034 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:07:26] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:07:26] RECOVERY - puppet last run on analytics1013 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [19:07:27] RECOVERY - puppet last run on analytics1016 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:07:27] RECOVERY - puppet last run on analytics1030 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [19:07:36] RECOVERY - puppet last run on analytics1032 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:07:37] RECOVERY - puppet last run on analytics1038 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [19:07:37] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [19:07:47] RECOVERY - puppet last run on analytics1040 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:07:47] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [19:07:57] RECOVERY - puppet last run on analytics1017 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [19:07:57] RECOVERY - puppet last run on analytics1020 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [19:08:03] WAaahhhh [19:08:46] looking gooOOoood [19:08:47] :) [19:09:53] 3Ops-Access-Requests, RESTBase, operations: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1042112 (10GWicke) @Faidon, we all agree that this is a project that ops & services are doing together. Even more than with Parsoid, we'll be responsible for preventing & dealing with iss... [19:49:36] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [19:54:29] ^ Feb 16 19:54:09 cp3022 varnishkafka[27708]: PRODUCE: Failed to produce Kafka message (seq 1051486696): No buffer space available (500000 messages in outq) [19:54:35] spamming incessantly to syslog [19:55:29] ottomata: ^ [19:56:07] it was the webrequest instance, which I think we're not using anymore? [19:56:34] I stopped it, gonna see if puppet restarts it or not. it could be that it's deconfigured in puppet but still alive on some hosts [19:56:50] HMM [19:56:55] bblack, we are using webrequest [19:56:57] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [19:57:01] that is an eternal esams problem. [19:57:17] possibly fixable, but i've been told to wait til mark does some stuff with switches before I expend energy on it [19:57:28] yeah puppet restarted it [19:58:45] bblack, fyi, it happens more when analytics1021 drops out of leadership, which happens occasionally. [19:58:49] it is a compound problem. [19:59:10] ok [19:59:11] i just ran a replica election, so an21 should be back in the list of leaders now, which should reduce the produce errors from esams [19:59:32] i have some plans to try and fix these, but they are a bit down the road...:/ [20:00:03] i mean, other than the tons of time i've spent on those already. i feel like this should be fixable... [20:01:57] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 6038.01959976 [20:24:44] (03PS1) 10Ottomata: Ensure that CDH zookeeper package is installed on Hadoop nodes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190855 [20:25:00] (03CR) 10Ottomata: [C: 032] Ensure that CDH zookeeper package is installed on Hadoop nodes [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190855 (owner: 10Ottomata) [20:25:29] (03PS1) 10Ottomata: Update cdh module with fix for zookeeper class not found exception [puppet] - 10https://gerrit.wikimedia.org/r/190856 [20:25:40] (03CR) 10Ottomata: [C: 032 V: 032] Update cdh module with fix for zookeeper class not found exception [puppet] - 10https://gerrit.wikimedia.org/r/190856 (owner: 10Ottomata) [20:33:14] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1042223 (10Henrik) No, about 100KB/sec is with a single connection. I can code up a parallel downloader to try, but fortunately there's no pressing need at the moment: it's caught u... [21:08:53] (03PS1) 10Reedy: Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190930 [21:09:54] !log updated Parsoid to version 86e76a30 [21:09:58] Logged the message, Master [21:11:31] (03CR) 10Reedy: [C: 032] Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190930 (owner: 10Reedy) [21:11:36] (03Merged) 10jenkins-bot: Update size related dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190930 (owner: 10Reedy) [21:12:34] !log reedy Synchronized database lists: Update size related dblists (duration: 00m 06s) [21:12:37] Logged the message, Master [21:23:36] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1042289 (10Nemo_bis) Now, with three parallerl downloads, the speed is about 1.3 MB/s for each of them, from a server in Finland. [21:30:28] (03PS1) 10Andrew Bogott: Invoke our first parted with 'script -c' [puppet] - 10https://gerrit.wikimedia.org/r/190934 [21:32:18] (03CR) 10Andrew Bogott: [C: 032] Invoke our first parted with 'script -c' [puppet] - 10https://gerrit.wikimedia.org/r/190934 (owner: 10Andrew Bogott) [21:33:38] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1042316 (10ori) Disk I/O is saturated: ``` root@dataset1001:~# iostat -h -m -t -x -d 5 dm-0 Linux 3.2.0-75-generic (dataset1001) 02/16/2015 _x86_64_ (4 CPU) 02/16/2015 09:33:01... [21:35:22] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1042320 (10Reedy) >>! In T45647#1042316, @ori wrote: > Disk I/O is saturated: I guess that explains the erratic speeds I saw to a server in the UK - from over 2MB/s to barely a few... [21:39:20] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1042333 (10faidon) >>! In T45647#1042316, @ori wrote: > Disk I/O is saturated: This is known; this is the reason bandwidth & connection limits were instituted (this was troubleshot... [21:42:54] Hi paravoid :) any feedback on the impact of our temporary "fix" for Special:RecordImpression? Just curious, np if you're busy with other stuff, of course :) [21:44:56] AndyRussG: I don't see the amount of hits I was seeing before, no [21:45:17] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00666666666667 [21:45:19] but I wish this won't be put back on the backburner now that we have a workaround [21:45:26] ori: ^^ btw [21:45:34] paravoid: yeah... [21:45:58] paravoid: any idea how substantial the decrease is? [21:47:01] AndyRussG: what was the fix? Can you summarize the status quo for me? I am a bit behind [21:47:08] ori: sure! [21:47:52] We turned Special:RecordImpression down to 1/100th of its previous volume for users that are not targeted by any campaign [21:48:11] This is determined after all filtering criteria are applied, including those that are applied on the client [21:48:26] The exact sampling rate can be tuned by a config variable [21:49:53] For users that are in a campaign, we're still sending Special:RecordImpressions for everyone, including users that have the banner hidden by in-banner Javascript logic (for example, logic that waits a few page views to show the banner, or avoids showing it more than x number of times even if they haven't clicked on the close button) [21:50:18] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [21:50:55] !log catrope Synchronized php-1.25wmf17/includes/MediaWiki.php: I34028206 (duration: 00m 06s) [21:50:56] AndyRussG: is there a reason for such special pages to be called from wiki/Special rather than w/index.php?title=Special ? [21:51:05] Logged the message, Master [21:51:11] The main issue will be when massive anon-only campaigns are turned up again (for example, WLM or our December FR campaign), since the temporary fix would incluse those in non-sampled group [21:51:46] PROBLEM - puppet last run on db1004 is CRITICAL: CRITICAL: Puppet has 1 failures [21:51:57] Nemo_bis: good question: there's actually a fix in the pipeline to change that, because the current setup causes an extra round trip (redirect) on mobile [21:52:18] (03PS1) 10Ori.livneh: dumps: improve nginx disk utilisation via directio [puppet] - 10https://gerrit.wikimedia.org/r/190940 [21:52:25] ^ paravoid [21:52:31] Nemo_bis: We just deployed the first change needed to fix that, so now it's just a config change [21:52:37] AndyRussG: sounds like an improvement over the previous status quo [21:53:04] why would we do direct I/O? [21:53:04] AndyRussG: nice to hear [21:53:32] Nemo_bis: thanks! did you have any concerns specifically about the wiki/Special URL format? [21:54:11] ori: I think so! yeah something is better than nothing. We also reduced some JS execution for users excluded from a campaign by server-side criteria [21:54:48] AndyRussG: it's non-standard to build such URLs manually, it should never happen [21:55:00] paravoid: the files are too large to cache in ram anyway [21:55:08] We have things like {{fullurl}} and the URL global functions for a reason [21:55:10] Nemo_bis: ah OK, mmm it's from the config... [21:55:44] Among other things, it causes problems to pageview stats and to any wiki which doesn't use the Wikimedia-specific shortURL config [21:55:57] in their entirety, yes, but why not cache on a best effort basis [21:56:00] (if it's in the code) [21:56:10] Ah hmmm [21:56:13] popular files that get requested by multiple users can be still served from pagecache instead of hitting disk [21:56:22] 3Ops-Access-Requests, RESTBase, operations: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1042351 (10RobLa-WMF) Hi folks, I'm generally supportive of Marko and Gabriel getting root on these machines, but I share many of Faidon's concerns. Planning to use root to disable Puppe... [21:56:52] the machine isn't running out of memory, it's running out of disk I/O [21:56:57] paravoid: ori: I agree that this shouldn't be on the back burner... though there is other work on the table that a more permanent solution could be rolled in with... So I think we should decide on that based on what timeline agree to for a more permanent S:RI solution [21:57:28] If you have a sec to post on the Phab task some (even very general) results about the impact of the temporary solution, that'd be really fantastic, BTW :) [21:58:54] paravoid: http://unix.stackexchange.com/a/40774 [21:59:17] AndyRussG: OK, but I'll need a few hours to dig into the changes and wrap my head around exactly what you guys did. [22:00:05] paravoid: in other words: I don't think the file cache is helping here, because the dumps are so huge [22:00:20] ori: you set the threshold to 4k [22:00:39] so the cache pollution part does not count [22:00:44] and there's no mysql or anything on the box [22:01:00] ori: sure! thanks much, no rush, feel free to ping if u have any questions [22:01:04] thanks also paravoid :) [22:01:06] you can set the threshold higher; it doesn't really matter [22:01:12] pagecache not helping much here is probably true (although I can imagine that there may be popular files?); the question is whether it hurts, though :) [22:01:39] let's try it and see [22:01:47] if you want, you can experiment on the box itself [22:01:52] I don't mind [22:01:55] * ori does [22:01:59] (experiments) [22:02:04] try restarting nginx gracefully [22:02:22] those users aren't happy already, let's not cut their downloads in half :) [22:02:26] !log disabled puppet on dataset1001 to experiment w/ https://gerrit.wikimedia.org/r/190940 [22:02:31] Logged the message, Master [22:04:22] !log reloading nginx on dataset1001 for same [22:04:29] Logged the message, Master [22:06:45] no aio for nginx 1.1.19 [22:07:30] paravoid (or anyone), is there a puppet fix for https://phabricator.wikimedia.org/T87309 or is the package broken? I’m only slowly adjusting to systemd. [22:07:55] oh dammit, I never updated that ticket [22:08:06] I've debugged this and told both Yuvi/Coren already in person [22:08:16] I'm replying now [22:09:47] RECOVERY - puppet last run on db1004 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [22:11:05] paravoid: thanks! [22:20:29] 3Datasets-General-or-Unknown, operations: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1042390 (10Nemo_bis) The speed of downloads from dumps varies considerably depending on the *file* you download, probably due to different disks involved. If some disk/partition hap... [22:23:15] I wonder at what point bandwidth caps start making the machine *slower*, because each file takes longer and disk has to jump more [22:23:47] (03PS1) 10Hashar: contint: hhvm-dev on Trusty slaves [puppet] - 10https://gerrit.wikimedia.org/r/190946 (https://phabricator.wikimedia.org/T89649) [22:29:08] PROBLEM - HTTP on dataset1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:30:56] ori? [22:31:13] hm, sec [22:31:35] not sure why it complained; i hadn't stopped it [22:31:48] that's usually indicative of nginx being blocked on I/O [22:32:07] logical, because there's no aio setting [22:32:07] RECOVERY - HTTP on dataset1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5503 bytes in 0.002 second response time [22:32:25] i.e. the machine is so I/O starved that nginx workers get blocked and can't respond to network I/O [22:32:31] yeah, aio would probably help here [22:32:45] yeah, i had aio in the patch, but it's not supported by the version of nginx we have on dumps [22:33:11] !log reloaded nginx on dumps with original config; re-enabled puppet. [22:33:16] Logged the message, Master [22:34:12] mutante: you there? [22:34:19] ori: can we replace the apache graceful dsh script with slat stuff ? [22:34:45] !log catrope Synchronized php-1.25wmf16/includes/MediaWiki.php: I34028206 (duration: 00m 05s) [22:34:49] Logged the message, Master [22:35:00] i think we have already, but mutante had some reason for keeping the script around [22:35:03] *salt [22:35:13] (03PS1) 10Andrew Bogott: On debian, ensure that idmapd is running on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/190948 [22:35:29] mutante: please enlighten us at some point :) [22:35:32] matanya: https://gerrit.wikimedia.org/r/#/c/177080/ [22:36:17] gmta [22:36:22] thanks ori [22:36:27] gmta? [22:37:15] got my thetans aligned? [22:37:26] greater memphis transportation authority? [22:37:57] great minds think alike [22:37:57] great minds think alike [22:38:13] whoa two great minds right there ^^ [22:38:33] * bd808 googled it [22:39:01] google may translate acronyms [22:39:35] I thought it is one of the well knowns, like iirc or afaik [22:40:03] ok, my battery is dying, back soon [22:42:20] who would review lvs stuff ? [22:49:58] (03CR) 10Faidon Liambotis: [C: 04-1] On debian, ensure that idmapd is running on labs instances. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/190948 (owner: 10Andrew Bogott) [22:51:32] (03PS1) 10Ottomata: Parameterize secure_proxy_ssl_header for hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190951 [22:51:55] (03CR) 10Ottomata: [C: 032] Parameterize secure_proxy_ssl_header for hue [puppet/cdh] - 10https://gerrit.wikimedia.org/r/190951 (owner: 10Ottomata) [23:09:07] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: puppet fail [23:22:24] andrewbogott: https://gerrit.wikimedia.org/r/190192 please ? [23:23:06] 3Ops-Access-Requests, RESTBase, operations: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1042509 (10Manybubbles) >>! In T89366#1041353, @faidon wrote: > * //"disable puppet"// — that's a bad idea. Puppet being disabled is a problem; we should never do that as a matter of proc... [23:24:32] (03CR) 10Andrew Bogott: [C: 032] toollabs: selector out of resource [puppet] - 10https://gerrit.wikimedia.org/r/190192 (owner: 10Matanya) [23:24:50] matanya: can you verify that that applies cleanly, or direct me to a system that’s affected? [23:25:54] (03PS2) 10Andrew Bogott: On debian, ensure that idmapd is running on labs instances. [puppet] - 10https://gerrit.wikimedia.org/r/190948 [23:25:59] andrewbogott: I think any tools box wouldn't be able to send mail if i broke this [23:26:16] matanya: yeah, but that’s not a feature I use so I won’t notice [23:27:30] i'll test with mailx [23:27:52] thanks [23:28:18] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [23:33:34] andrewbogott: didn't break [23:33:48] no surprise :) But thanks for checking. [23:34:02] thanks for review and merge :)