[00:00:22] Ok, just wrote a mail to my WMDE colleagues not to use the maint. scripts that caused the troubles earlier [00:01:37] what scripts? did you cc ops? [00:01:40] I'll write more about this tomorrow or sometime this week and open bugs accordingly [00:02:01] Krenair: Nope... it's very Wikibase related, so no one else is going to touch it anyway [00:02:08] ok [00:04:48] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [00:05:56] stil seeing consistent core dumps from hhvm, not sure what to do about them... [00:13:29] ebernhardson, we can try https://gerrit.wikimedia.org/r/#/c/211926/1 to see if the errors stop [00:13:50] the stream thing must be related since it was only occasional in the prior day log [00:14:16] sure, cant hurt to try it [00:14:31] should that return too? [00:14:52] like the PHP_SAPI block below does? [00:15:03] it does return [00:15:21] yeah that /1 link was stale [00:15:21] Why don't you? [00:15:27] *nod* [00:17:52] ebernhardson, I was thinking of just cherry picking and leaving master [00:18:12] AaronSchulz: oh, i suppose i've never done that but it makes complete sense [00:26:27] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [00:36:52] !log ebernhardson Synchronized php-1.26wmf6/includes/jobqueue/JobQueueGroup.php: Undefer push() in lazyPush() temporarily (duration: 00m 12s) [00:36:57] Logged the message, Master [00:37:52] !log ebernhardson Synchronized php-1.26wmf5/includes/jobqueue/JobQueueGroup.php: Undefer push() in lazyPush() temporarily (duration: 00m 12s) [00:37:55] Logged the message, Master [00:38:08] PROBLEM - puppet last run on mw1123 is CRITICAL puppet fail [00:38:58] AaronSchulz: the constant stream of errors has stopped. Will wait a few minutes and see if the core dumps stop as well [00:43:08] (03PS1) 10Aaron Schulz: Removed "refreshLinks" from $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211930 [00:46:28] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:46:34] core dumps still coming out :S [00:48:19] i'm mildly suspicious of the constant ' [00:48:30] May 19 00:47:59 mw1053: #012Fatal error: unexpected N4HPHP13DataBlockFullE: Attempted to emit 1 byte(s) into a 209715200 byte DataBlock with 0 bytes available. This almost certainly means the TC is full.' messages, but they seem to have been ocuring for days so not likely the culrit [00:50:41] ebernhardson: That's an HHVM JIT tuning error [00:51:33] and I think that cache filling up can lead to core dump crashes [00:53:55] yeah [00:54:02] that is not necessarily related to deployments [00:54:10] is there more than one from today from the same host? [00:55:53] ori: looks like those messages have only come from 4 hosts today, but the core dumps have been from 241 hosts, so indeed not the culprit [00:56:04] ori: although, 55 times its been mw1237 [00:56:11] (the TC is full message) [00:56:33] the core dumps are about half the servers 1 core dump, and half the servers 2 core dumps [00:56:46] * ori looks [00:57:01] better news: http://graphite.wikimedia.org/render/?width=586&height=308&_salt=1431990359.151&target=mw.performance.save.median [00:57:21] (that's the echo change AaronSchulz and you reviewed, ebernhardson) [00:57:27] *AaronSchulz wrote [00:57:48] thats excellent :) [00:58:56] May 19 00:52:57 mw1200: #0 CirrusSearch\Hooks::onLinksUpdateCompleted(Object of class LinksUpdate could not be converted to string) called at [/srv/mediawiki/php-1.26wmf6/includes/Hooks.php:209] [00:58:56] May 19 00:52:57 mw1200: #1 Hooks::run(LinksUpdateComplete, Array) called at [/srv/mediawiki/php-1.26wmf6/includes/deferred/LinksUpdate.php:146] [00:58:56] May 19 00:52:57 mw1200: #2 LinksUpdate->doUpdate() called at [/srv/mediawiki/php-1.26wmf6/includes/deferred/DataUpdate.php:101] [00:58:58] May 19 00:52:57 mw1200: #3 DataUpdate::runUpdates(Array) called at [/srv/mediawiki/php-1.26wmf6/includes/page/WikiPage.php:2203] [00:59:00] is this one known? [01:00:13] ori: havn't seen it before, but i'm just slowly getting into cirrussearch stuff manybubbles really knows whats going on there. i'll check it out [01:01:28] oddly, on the couple servers i've checked /var/log/hhvm/stacktrace*.log are all 0 length files [01:01:55] but thats not a new occurance [01:10:48] so what's the status of the job queue defer push thing? [01:10:51] is it reverted? [01:11:00] has it been reverted, I mean [01:11:13] ori: lazyPush() now calls push() and returns, and that killed the 1500-2000/minute messages to hhvm.log [01:11:34] doesn't sound very lazy to me :P [01:11:46] do we know what broke? [01:11:54] i dont think so [01:15:48] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [01:24:34] ori, there is one bug I am aware of, we'll see if that fixes all of it [01:32:05] more curious than anything ... but i see in the puppet that hhvm fatals get logged, and i see handling for them in the logstash config, but i dont see anything for type:hhvm-fatal in logstash.wikimedia.org. Where should i be looking? [01:32:18] err, not fatals but stack traces, via HHVM_TRACE_HANDLER [01:37:13] (03PS1) 10Gage: Install-server: fix partman/logstash.cfg [puppet] - 10https://gerrit.wikimedia.org/r/211931 [01:42:42] ebernhardson, I guess https://gerrit.wikimedia.org/r/#/c/211930/ goes to next swat then [01:43:58] AaronSchulz: make sense, nothing ended up getting deployed today in SWAT [01:45:13] heh [01:48:51] hm gerrit linked to the phab task i mentioned in that ^ change, but the phab task didn't get updated with "patch-for-review" tag or a link to the change. [01:49:07] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1295632 (10Gage) p:5Unbreak!>3Normal [01:49:27] PROBLEM - puppet last run on mw2097 is CRITICAL puppet fail [01:49:47] jgage: it has to be in the format "Bug: T###" at the bottom of the commit message [01:50:00] (03PS2) 10Manybubbles: Enable the CirrusSearch per-user pool counter everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210622 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [01:50:07] (03CR) 10Manybubbles: "Bumped limit to 15. Should be a ton." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210622 (https://phabricator.wikimedia.org/T76497) (owner: 10EBernhardson) [01:50:14] hmm ok, thanks legoktm [01:51:24] (03PS2) 10Gage: Install-server: fix partman/logstash.cfg [puppet] - 10https://gerrit.wikimedia.org/r/211931 (https://phabricator.wikimedia.org/T98620) [01:52:45] hmph still no update. i guess it doesn't watch for revised commit messages. [01:55:03] 6operations: Degraded RAID-1 arrays on new logstash hosts: [UU__] - https://phabricator.wikimedia.org/T98620#1295633 (10Gage) Patch: https://gerrit.wikimedia.org/r/#/c/211931/ [01:59:33] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Implement a big IPsec off switch - https://phabricator.wikimedia.org/T88536#1295634 (10Gage) 5Open>3Resolved Deployed, tested, documented: https://wikitech.wikimedia.org/wiki/IPsec#Emergency_shutdown [02:01:49] ebernhardson: The syslog hhvm-fatal messages should end up in logstash as "type:hhvm AND level:Fatal" [02:02:01] (03CR) 10Gage: [C: 031] add_ip6_mapped: enable token-based SLAAC for all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/202725 (https://phabricator.wikimedia.org/T94417) (owner: 10BBlack) [02:03:04] ebernhardson: But I'm not 100% certain that those are actually working right. [02:05:14] (03PS3) 10Legoktm: Install-server: fix partman/logstash.cfg [puppet] - 10https://gerrit.wikimedia.org/r/211931 (https://phabricator.wikimedia.org/T98620) (owner: 10Gage) [02:05:23] jgage: no, it's just picky about where you put it [02:06:05] jgage: https://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines#Auto-linking_and_cross-referencing [02:07:02] bd808: that gets the ones that show up in /a/mw-log/hhvm.log, but not the ones in /a/mw-log/hhvm-fatal.log. not sure why [02:07:37] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:08:06] ebernhardson: yeah. I think there is something wrong with my logstash rules for the hhvm-fatal messages forwarded from rsyslog [02:08:18] they worked at one point but don't seem to any more [02:08:54] anyways the fatals now look to have calmed down, only one new fatal in last 20 minutes. 36 servers core dumped once, 194 core dumped twice over a two and a half hour period [02:09:03] legoktm: ah, thanks! i see the problem was the blank line between Bug: and Change-Id:. [02:21:24] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 06m 11s) [02:21:38] Logged the message, Master [02:25:08] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (92345s 90000s) [02:25:32] ebernhardson: Continuing deployment? [02:25:38] What's status of SWAT [02:25:52] It's been 4 hours. Somewhat confused [02:26:08] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-19 02:25:05+00:00 [02:26:12] Logged the message, Master [02:29:02] Krinkle: canceled after 200 hhvm instances core dumped [02:29:29] ebernhardson: Yes, because one of the commits had a side effect and was reverted since, right? [02:29:30] I saw that. [02:29:42] Or do we not know what caused it yet? [02:30:15] Krinkle: well, yes and no. They were all caused by the hhvm translation cache filling up. But ori bumped that cache from 100Mb to 200Mb about 10 days ago, so not sure we should just bump it again [02:30:35] (not i18n translation, but JIT translation) [02:31:41] there seems to have been a similar spat of core dumps on the 14th, and nothing before that. I'm not sure when we updated to hhvm 3.6.1, but that might be the culprit [02:32:52] i'll jsut start a thread on -ops [02:35:58] PROBLEM - puppet last run on db2050 is CRITICAL puppet fail [02:37:11] ebernhardson: Is it save to run sync-file? [02:37:18] What's our state. [02:37:34] I'd like to fix the logging of the data gathering I'm doing for RL performance. [02:37:38] Minor fixup commit. [02:37:57] Krinkle: i'd say go for it [02:42:00] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 05m 43s) [02:42:05] Logged the message, Master [02:43:32] !log krinkle Synchronized php-1.26wmf6/includes/resourceloader/ResourceLoader.php: Ic0df4fb5cff (duration: 00m 12s) [02:43:36] Logged the message, Master [02:46:21] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-19 02:45:17+00:00 [02:46:25] Logged the message, Master [02:54:08] RECOVERY - puppet last run on db2050 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:04:55] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1295682 (10Gage) I followed this: https://git.wikimedia.org/blob/operations%2Fpuppet.git/2cdd08f9686b040816bd0dd8e63e712f4b084a7a/modules%2Fpackage_builder%2FREADM... [03:05:08] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 62.50% of data above the critical threshold [35.0] [03:05:16] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1295683 (10Gage) [03:39:36] 6operations, 6Services, 10Traffic, 7Service-Architecture: Proxying new services through RESTBase - https://phabricator.wikimedia.org/T96688#1295702 (10GWicke) @bblack, some answers below: >>! In T96688#1294919, @BBlack wrote: > So here, are the outstanding questions about all related things: > 1. Are all... [05:04:39] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (5526 90000s) [05:04:58] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [05:04:59] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue May 19 05:03:56 UTC 2015 (duration 3m 55s) [05:05:05] Logged the message, Master [06:00:58] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [06:29:27] PROBLEM - puppet last run on mw2095 is CRITICAL puppet fail [06:30:58] PROBLEM - puppet last run on cp1056 is CRITICAL Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 1 failures [06:32:47] PROBLEM - puppet last run on cp3003 is CRITICAL puppet fail [06:32:48] PROBLEM - puppet last run on cp3016 is CRITICAL Puppet has 1 failures [06:33:57] PROBLEM - puppet last run on cp3034 is CRITICAL puppet fail [06:33:58] PROBLEM - puppet last run on db1002 is CRITICAL Puppet has 1 failures [06:34:09] PROBLEM - puppet last run on db2059 is CRITICAL Puppet has 1 failures [06:34:39] PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures [06:34:58] PROBLEM - puppet last run on db2019 is CRITICAL Puppet has 1 failures [06:35:27] PROBLEM - puppet last run on lvs3001 is CRITICAL Puppet has 1 failures [06:35:37] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 1 failures [06:35:58] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:36:28] PROBLEM - puppet last run on mw1088 is CRITICAL Puppet has 1 failures [06:36:39] PROBLEM - puppet last run on mw1150 is CRITICAL Puppet has 1 failures [06:36:48] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:36:58] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:36:58] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:37:38] PROBLEM - puppet last run on mw2073 is CRITICAL Puppet has 1 failures [06:43:19] RECOVERY - puppet last run on db2019 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:44:58] RECOVERY - puppet last run on mw1150 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:44:59] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:45:19] RECOVERY - puppet last run on lvs3001 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:45:28] RECOVERY - puppet last run on db1002 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:45:37] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on cp1056 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on db2059 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:45:58] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:45:58] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:08] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:28] RECOVERY - puppet last run on mw1088 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:58] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:46:58] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:28] RECOVERY - puppet last run on mw2095 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:47:37] RECOVERY - puppet last run on mw2073 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:49:18] RECOVERY - puppet last run on cp3003 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:51:49] (03PS1) 10Jcrespo: Add some prompt coloring and parsing for user jynus [puppet] - 10https://gerrit.wikimedia.org/r/211950 [06:51:58] RECOVERY - puppet last run on cp3034 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:54:49] (03CR) 10Jcrespo: [C: 032] Add some prompt coloring and parsing for user jynus [puppet] - 10https://gerrit.wikimedia.org/r/211950 (owner: 10Jcrespo) [07:26:08] (03PS1) 10Giuseppe Lavagetto: hhvm: add monitoring for translation cache space [puppet] - 10https://gerrit.wikimedia.org/r/211952 [07:36:38] 6operations, 7database: On a maintenance window, upgrade db1063 to 14.04 and its MariaDB package to 10.0.16 - https://phabricator.wikimedia.org/T99520#1295799 (10jcrespo) @Moritzm provided some concerns about the update procedure, so I will temporary stall this to get some feedback. [07:40:03] (03PS2) 10Giuseppe Lavagetto: hhvm: add monitoring for translation cache space [puppet] - 10https://gerrit.wikimedia.org/r/211952 [07:40:47] <_joe_> ori: still around? [07:40:58] <_joe_> if so, would you mind taking a look? [07:41:21] <_joe_> I already tested the script, btw [07:51:36] (03PS3) 10Giuseppe Lavagetto: hhvm: add monitoring for translation cache space [puppet] - 10https://gerrit.wikimedia.org/r/211952 [07:52:19] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: add monitoring for translation cache space [puppet] - 10https://gerrit.wikimedia.org/r/211952 (owner: 10Giuseppe Lavagetto) [07:57:42] (03PS1) 10Giuseppe Lavagetto: hhvm: fix path for nrpe TC check [puppet] - 10https://gerrit.wikimedia.org/r/211953 [07:58:53] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: fix path for nrpe TC check [puppet] - 10https://gerrit.wikimedia.org/r/211953 (owner: 10Giuseppe Lavagetto) [07:59:48] PROBLEM - puppet last run on mw1104 is CRITICAL Puppet has 1 failures [07:59:58] PROBLEM - puppet last run on mw1207 is CRITICAL Puppet has 1 failures [08:00:08] PROBLEM - puppet last run on mw1255 is CRITICAL Puppet has 1 failures [08:00:09] PROBLEM - puppet last run on mw1107 is CRITICAL Puppet has 1 failures [08:00:17] PROBLEM - puppet last run on mw1128 is CRITICAL Puppet has 1 failures [08:00:47] PROBLEM - puppet last run on mw1037 is CRITICAL Puppet has 1 failures [08:00:47] PROBLEM - puppet last run on mw1103 is CRITICAL Puppet has 1 failures [08:00:57] PROBLEM - puppet last run on mw1047 is CRITICAL Puppet has 1 failures [08:00:57] PROBLEM - puppet last run on mw2142 is CRITICAL Puppet has 1 failures [08:00:57] PROBLEM - puppet last run on mw2176 is CRITICAL Puppet has 1 failures [08:00:57] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [08:00:57] PROBLEM - puppet last run on mw2211 is CRITICAL Puppet has 1 failures [08:00:58] PROBLEM - puppet last run on mw2024 is CRITICAL Puppet has 1 failures [08:00:58] PROBLEM - puppet last run on mw2039 is CRITICAL Puppet has 1 failures [08:00:58] PROBLEM - puppet last run on mw2052 is CRITICAL Puppet has 1 failures [08:00:59] PROBLEM - puppet last run on mw2060 is CRITICAL Puppet has 1 failures [08:01:00] PROBLEM - puppet last run on mw2047 is CRITICAL Puppet has 1 failures [08:01:00] PROBLEM - puppet last run on mw2084 is CRITICAL Puppet has 1 failures [08:01:01] PROBLEM - puppet last run on mw2110 is CRITICAL Puppet has 1 failures [08:01:17] PROBLEM - puppet last run on mw1075 is CRITICAL Puppet has 1 failures [08:01:17] PROBLEM - puppet last run on mw1179 is CRITICAL Puppet has 1 failures [08:01:18] PROBLEM - puppet last run on mw1085 is CRITICAL Puppet has 1 failures [08:01:18] RECOVERY - puppet last run on mw1107 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:01:26] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [08:01:27] PROBLEM - puppet last run on mw1230 is CRITICAL Puppet has 1 failures [08:01:36] PROBLEM - puppet last run on mw1199 is CRITICAL Puppet has 1 failures [08:01:56] PROBLEM - puppet last run on mw1019 is CRITICAL Puppet has 1 failures [08:02:07] PROBLEM - puppet last run on mw1196 is CRITICAL Puppet has 1 failures [08:02:07] PROBLEM - puppet last run on mw1017 is CRITICAL Puppet has 1 failures [08:02:07] PROBLEM - puppet last run on mw1078 is CRITICAL Puppet has 1 failures [08:02:17] PROBLEM - puppet last run on mw2178 is CRITICAL Puppet has 1 failures [08:02:17] PROBLEM - puppet last run on mw2035 is CRITICAL Puppet has 1 failures [08:02:17] PROBLEM - puppet last run on mw2061 is CRITICAL Puppet has 1 failures [08:02:17] PROBLEM - puppet last run on mw2098 is CRITICAL Puppet has 1 failures [08:02:17] PROBLEM - puppet last run on mw2174 is CRITICAL Puppet has 1 failures [08:02:18] PROBLEM - puppet last run on mw2025 is CRITICAL Puppet has 1 failures [08:02:27] PROBLEM - puppet last run on mw2070 is CRITICAL Puppet has 1 failures [08:02:27] PROBLEM - puppet last run on mw2131 is CRITICAL Puppet has 1 failures [08:02:27] PROBLEM - puppet last run on mw2053 is CRITICAL Puppet has 1 failures [08:02:27] PROBLEM - puppet last run on mw2138 is CRITICAL Puppet has 1 failures [08:02:36] PROBLEM - puppet last run on mw1234 is CRITICAL Puppet has 1 failures [08:02:38] PROBLEM - puppet last run on mw2195 is CRITICAL Puppet has 1 failures [08:02:46] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures [08:02:47] PROBLEM - puppet last run on mw1184 is CRITICAL Puppet has 1 failures [08:02:47] PROBLEM - puppet last run on mw2101 is CRITICAL Puppet has 1 failures [08:02:48] PROBLEM - puppet last run on mw1073 is CRITICAL Puppet has 1 failures [08:02:57] PROBLEM - puppet last run on mw1020 is CRITICAL Puppet has 1 failures [08:02:57] PROBLEM - puppet last run on mw1096 is CRITICAL Puppet has 1 failures [08:02:58] PROBLEM - puppet last run on mw1127 is CRITICAL Puppet has 1 failures [08:02:58] PROBLEM - puppet last run on mw1083 is CRITICAL Puppet has 1 failures [08:02:58] PROBLEM - puppet last run on mw1094 is CRITICAL Puppet has 1 failures [08:03:06] PROBLEM - puppet last run on mw1137 is CRITICAL Puppet has 1 failures [08:03:07] PROBLEM - puppet last run on mw2037 is CRITICAL Puppet has 1 failures [08:03:17] PROBLEM - puppet last run on mw1015 is CRITICAL Puppet has 1 failures [08:03:27] PROBLEM - Translation cache space on mw1058 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:27] PROBLEM - Translation cache space on mw1047 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:27] PROBLEM - Translation cache space on mw1075 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:27] PROBLEM - Translation cache space on mw1199 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:27] PROBLEM - puppet last run on mw2135 is CRITICAL Puppet has 1 failures [08:03:38] PROBLEM - puppet last run on mw2049 is CRITICAL Puppet has 1 failures [08:03:38] PROBLEM - puppet last run on mw2111 is CRITICAL Puppet has 1 failures [08:03:38] PROBLEM - puppet last run on mw2119 is CRITICAL Puppet has 1 failures [08:03:38] PROBLEM - puppet last run on mw2133 is CRITICAL Puppet has 1 failures [08:03:46] PROBLEM - Translation cache space on mw1135 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:47] PROBLEM - Translation cache space on mw1194 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:47] PROBLEM - puppet last run on mw1138 is CRITICAL Puppet has 1 failures [08:03:47] PROBLEM - puppet last run on mw1136 is CRITICAL Puppet has 1 failures [08:03:47] PROBLEM - Translation cache space on mw1207 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:47] PROBLEM - puppet last run on mw2202 is CRITICAL Puppet has 1 failures [08:03:47] PROBLEM - puppet last run on mw1095 is CRITICAL Puppet has 1 failures [08:03:48] PROBLEM - puppet last run on mw2055 is CRITICAL Puppet has 1 failures [08:03:48] PROBLEM - puppet last run on mw2077 is CRITICAL Puppet has 1 failures [08:03:58] PROBLEM - Translation cache space on mw1017 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:03:58] PROBLEM - Translation cache space on mw1102 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:06] PROBLEM - Translation cache space on mw2060 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:07] PROBLEM - Translation cache space on mw2101 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:07] PROBLEM - puppet last run on mw2177 is CRITICAL Puppet has 1 failures [08:04:07] PROBLEM - puppet last run on mw2203 is CRITICAL Puppet has 1 failures [08:04:16] PROBLEM - puppet last run on mw1070 is CRITICAL Puppet has 1 failures [08:04:16] PROBLEM - puppet last run on mw1169 is CRITICAL Puppet has 1 failures [08:04:17] PROBLEM - Translation cache space on mw1127 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:17] PROBLEM - Translation cache space on mw2047 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:17] PROBLEM - Translation cache space on mw2024 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:26] PROBLEM - puppet last run on mw1035 is CRITICAL Puppet has 1 failures [08:04:27] PROBLEM - puppet last run on mw2198 is CRITICAL Puppet has 1 failures [08:04:28] PROBLEM - Translation cache space on mw1015 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:28] PROBLEM - Translation cache space on mw1085 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:36] PROBLEM - Translation cache space on mw2039 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:36] PROBLEM - Translation cache space on mw2111 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:38] PROBLEM - puppet last run on mw2120 is CRITICAL Puppet has 1 failures [08:04:47] PROBLEM - Translation cache space on mw1095 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:47] PROBLEM - puppet last run on mw1036 is CRITICAL Puppet has 1 failures [08:04:47] PROBLEM - Translation cache space on mw1101 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:47] PROBLEM - puppet last run on mw1191 is CRITICAL Puppet has 1 failures [08:04:47] PROBLEM - Translation cache space on mw2049 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:47] PROBLEM - Translation cache space on mw2052 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:48] PROBLEM - Translation cache space on mw2055 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:49] PROBLEM - Translation cache space on mw2084 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:49] PROBLEM - Translation cache space on mw2119 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:49] PROBLEM - Translation cache space on mw2176 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:50] PROBLEM - Translation cache space on mw1104 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:50] PROBLEM - Translation cache space on mw2029 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:51] PROBLEM - Translation cache space on mw2129 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:51] PROBLEM - Translation cache space on mw2182 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:57] PROBLEM - Translation cache space on mw1020 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:04:58] PROBLEM - Translation cache space on mw1070 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:06] PROBLEM - puppet last run on mw1232 is CRITICAL Puppet has 1 failures [08:05:06] PROBLEM - Translation cache space on mw2070 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:06] PROBLEM - Translation cache space on mw2077 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:06] PROBLEM - Translation cache space on mw2131 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:16] PROBLEM - Translation cache space on mw1169 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:17] PROBLEM - puppet last run on mw1214 is CRITICAL Puppet has 1 failures [08:05:17] PROBLEM - puppet last run on mw1246 is CRITICAL Puppet has 1 failures [08:05:17] PROBLEM - Translation cache space on mw2018 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:17] PROBLEM - Translation cache space on mw2110 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:17] PROBLEM - Translation cache space on mw2165 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:27] PROBLEM - Translation cache space on mw1073 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:27] PROBLEM - Translation cache space on mw1179 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:27] PROBLEM - puppet last run on mw1182 is CRITICAL Puppet has 1 failures [08:05:28] PROBLEM - puppet last run on mw2005 is CRITICAL Puppet has 1 failures [08:05:28] PROBLEM - puppet last run on mw2029 is CRITICAL Puppet has 1 failures [08:05:28] PROBLEM - Translation cache space on mw2198 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:28] PROBLEM - puppet last run on mw2172 is CRITICAL Puppet has 1 failures [08:05:28] PROBLEM - Translation cache space on mw2195 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:37] PROBLEM - Translation cache space on mw1128 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:37] PROBLEM - Translation cache space on mw1137 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:46] PROBLEM - Translation cache space on mw1255 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:46] PROBLEM - Translation cache space on mw2178 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:46] PROBLEM - puppet last run on mw2072 is CRITICAL Puppet has 1 failures [08:05:46] PROBLEM - Translation cache space on mw2207 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:46] PROBLEM - puppet last run on mw2181 is CRITICAL Puppet has 1 failures [08:05:47] PROBLEM - Translation cache space on mw1019 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:48] PROBLEM - Translation cache space on mw1078 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:48] PROBLEM - Translation cache space on mw1103 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:56] PROBLEM - Translation cache space on mw2025 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:56] PROBLEM - Translation cache space on mw2061 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:56] PROBLEM - Translation cache space on mw2120 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:56] PROBLEM - Translation cache space on mw2142 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:05:57] PROBLEM - puppet last run on mw2165 is CRITICAL Puppet has 1 failures [08:06:06] PROBLEM - Translation cache space on mw1018 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:06:07] PROBLEM - Translation cache space on mw1037 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:06:07] PROBLEM - Translation cache space on mw1230 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:06:07] PROBLEM - Translation cache space on mw2211 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:06:17] PROBLEM - Translation cache space on mw2138 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:06:17] PROBLEM - Translation cache space on mw2203 is CRITICAL: NRPE: Command check_hhvm_tc_space not defined [08:12:04] <_joe_> that's my fault apparently [08:12:50] (03PS1) 10Giuseppe Lavagetto: hhvm:fix name of the check [puppet] - 10https://gerrit.wikimedia.org/r/211954 [08:13:40] <_joe_> uhm it should have worked... [08:13:55] <_joe_> well, disregard those errors while I get what I did wrong [08:15:41] <_joe_> ah! just puppet ran before I fixed it [08:16:26] RECOVERY - Translation cache space on mw2129 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:16:37] RECOVERY - Translation cache space on mw2070 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:16:47] RECOVERY - Translation cache space on mw2018 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:16:57] RECOVERY - Translation cache space on mw1135 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:16:57] RECOVERY - Translation cache space on mw1194 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:16:57] RECOVERY - Translation cache space on mw1207 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:16:58] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:17:07] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:17:08] RECOVERY - Translation cache space on mw2195 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:16] RECOVERY - Translation cache space on mw1137 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:17] RECOVERY - Translation cache space on mw1255 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:17] RECOVERY - puppet last run on mw1194 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:17:17] RECOVERY - Translation cache space on mw2207 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:17] RECOVERY - puppet last run on mw2070 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:17:17] RECOVERY - Translation cache space on mw2060 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:18] RECOVERY - puppet last run on mw1207 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:17:27] RECOVERY - Translation cache space on mw1103 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:27] RECOVERY - Translation cache space on mw2120 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:27] RECOVERY - Translation cache space on mw2142 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:27] RECOVERY - puppet last run on mw1255 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:17:28] RECOVERY - puppet last run on mw2195 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [08:17:36] RECOVERY - Translation cache space on mw2047 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:36] RECOVERY - Translation cache space on mw2024 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:37] RECOVERY - Translation cache space on mw1018 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:37] RECOVERY - Translation cache space on mw1037 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:37] RECOVERY - puppet last run on mw1073 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:17:37] RECOVERY - Translation cache space on mw1230 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:46] RECOVERY - Translation cache space on mw2211 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:47] RECOVERY - Translation cache space on mw1085 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:47] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:17:47] RECOVERY - puppet last run on mw1128 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:17:47] RECOVERY - Translation cache space on mw2039 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:47] RECOVERY - Translation cache space on mw2111 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:56] RECOVERY - puppet last run on mw1230 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:17:56] RECOVERY - puppet last run on mw1137 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:17:57] RECOVERY - puppet last run on mw2120 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:17:57] RECOVERY - Translation cache space on mw2203 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:57] RECOVERY - puppet last run on mw1199 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:17:57] RECOVERY - Translation cache space on mw1101 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:58] RECOVERY - Translation cache space on mw2055 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:58] RECOVERY - Translation cache space on mw2119 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:58] RECOVERY - Translation cache space on mw2052 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:59] RECOVERY - Translation cache space on mw2084 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:17:59] RECOVERY - Translation cache space on mw2049 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:06] RECOVERY - Translation cache space on mw2176 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:06] RECOVERY - Translation cache space on mw1104 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:06] RECOVERY - puppet last run on mw1015 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [08:18:06] RECOVERY - Translation cache space on mw2029 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:07] RECOVERY - Translation cache space on mw2182 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:16] RECOVERY - Translation cache space on mw1020 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:16] RECOVERY - Translation cache space on mw1070 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:17] RECOVERY - Translation cache space on mw1058 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:17] RECOVERY - Translation cache space on mw1047 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:17] RECOVERY - Translation cache space on mw2077 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:17] RECOVERY - Translation cache space on mw1075 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:17] RECOVERY - Translation cache space on mw2131 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:18] RECOVERY - Translation cache space on mw1199 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:18] RECOVERY - puppet last run on mw1037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:19] RECOVERY - puppet last run on mw1103 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:27] RECOVERY - Translation cache space on mw1169 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:27] RECOVERY - puppet last run on mw1104 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:28] RECOVERY - puppet last run on mw1214 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:18:28] RECOVERY - Translation cache space on mw2110 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:28] RECOVERY - puppet last run on mw2111 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [08:18:28] RECOVERY - puppet last run on mw2049 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [08:18:33] <_joe_> sorry for the spam [08:18:36] RECOVERY - Translation cache space on mw2165 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:37] RECOVERY - puppet last run on mw1017 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:18:37] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:37] RECOVERY - puppet last run on mw1078 is OK Puppet is currently enabled, last run 39 seconds ago with 0 failures [08:18:37] RECOVERY - puppet last run on mw2142 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:18:37] RECOVERY - Translation cache space on mw1073 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:38] RECOVERY - puppet last run on mw1095 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [08:18:38] RECOVERY - Translation cache space on mw1179 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:18:39] RECOVERY - puppet last run on mw2176 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:39] RECOVERY - puppet last run on mw2178 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [08:18:47] RECOVERY - puppet last run on mw2211 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [08:18:47] RECOVERY - puppet last run on mw1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:47] RECOVERY - puppet last run on mw2084 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:47] RECOVERY - puppet last run on mw2039 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [08:18:47] RECOVERY - puppet last run on mw2052 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [08:18:47] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [08:18:48] RECOVERY - puppet last run on mw2060 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:49] RECOVERY - puppet last run on mw2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:18:49] RECOVERY - puppet last run on mw2029 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [08:18:50] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:18:50] RECOVERY - puppet last run on mw2077 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [08:18:51] RECOVERY - puppet last run on mw2182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:07] RECOVERY - puppet last run on mw1075 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:07] RECOVERY - Translation cache space on mw1019 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:07] RECOVERY - Translation cache space on mw1078 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:07] RECOVERY - puppet last run on mw1070 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:07] RECOVERY - puppet last run on mw1179 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:07] RECOVERY - puppet last run on mw1169 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:19:08] RECOVERY - Translation cache space on mw2025 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:08] RECOVERY - Translation cache space on mw2061 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:09] RECOVERY - Translation cache space on mw1127 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:09] RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [08:19:16] RECOVERY - puppet last run on mw2165 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:19:16] RECOVERY - puppet last run on mw1184 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:19:16] RECOVERY - puppet last run on mw1085 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:17] RECOVERY - puppet last run on mw2101 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:19:26] RECOVERY - Translation cache space on mw1015 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:27] RECOVERY - puppet last run on mw2198 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [08:19:27] RECOVERY - puppet last run on mw1020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:19:36] RECOVERY - puppet last run on mw1083 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:19:37] RECOVERY - puppet last run on mw1127 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:19:37] RECOVERY - puppet last run on mw1094 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [08:19:37] RECOVERY - Translation cache space on mw2138 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:38] RECOVERY - Translation cache space on mw1095 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:19:47] RECOVERY - puppet last run on mw1191 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [08:19:58] RECOVERY - puppet last run on mw1232 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [08:20:06] RECOVERY - puppet last run on mw1019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:07] RECOVERY - puppet last run on mw2135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:17] RECOVERY - puppet last run on mw1246 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [08:20:17] RECOVERY - puppet last run on mw2119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:18] RECOVERY - puppet last run on mw1196 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [08:20:18] RECOVERY - puppet last run on mw2133 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [08:20:18] RECOVERY - puppet last run on mw1138 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:20:18] RECOVERY - puppet last run on mw1136 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:27] RECOVERY - puppet last run on mw1182 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:31] RECOVERY - puppet last run on mw2035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:31] RECOVERY - puppet last run on mw2061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:31] RECOVERY - puppet last run on mw2005 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:20:31] RECOVERY - puppet last run on mw2174 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:20:37] RECOVERY - puppet last run on mw2098 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [08:20:37] RECOVERY - puppet last run on mw2172 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:37] RECOVERY - puppet last run on mw2025 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:20:46] RECOVERY - puppet last run on mw2181 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [08:20:47] RECOVERY - puppet last run on mw2053 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [08:20:47] RECOVERY - puppet last run on mw1234 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [08:20:47] RECOVERY - puppet last run on mw2177 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:21:07] RECOVERY - puppet last run on mw1035 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [08:21:17] RECOVERY - puppet last run on mw1096 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:21:45] RECOVERY - puppet last run on mw2072 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [08:21:55] RECOVERY - puppet last run on mw1036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:22:44] RECOVERY - puppet last run on mw2202 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:22:55] RECOVERY - puppet last run on mw2037 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:25:06] PROBLEM - Translation cache space on mw1053 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 100% [08:25:25] PROBLEM - Translation cache space on mw1143 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 100% [08:25:44] PROBLEM - Translation cache space on mw1012 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [08:25:44] PROBLEM - Translation cache space on mw1008 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [08:25:45] PROBLEM - Translation cache space on mw1114 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 100% [08:25:45] RECOVERY - HHVM rendering on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 66097 bytes in 0.228 second response time [08:25:55] PROBLEM - Translation cache space on mw1004 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [08:26:04] PROBLEM - Translation cache space on mw1002 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [08:26:11] <_joe_> these alarms here ^^ mean we need to restart HHVM there [08:26:15] PROBLEM - Translation cache space on mw1044 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 100% [08:26:22] <_joe_> I'm doing it now [08:26:38] <_joe_> !log restarting a few HHVM instances with a full TC space [08:26:43] Logged the message, Master [08:26:45] RECOVERY - Apache HTTP on mw1169 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.107 second response time [08:27:05] PROBLEM - Translation cache space on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:27:15] PROBLEM - Translation cache space on mw1007 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 93% [08:27:15] PROBLEM - Translation cache space on mw1013 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 96% [08:27:15] PROBLEM - Translation cache space on mw1003 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [08:27:26] PROBLEM - Translation cache space on mw1001 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [08:27:26] PROBLEM - Translation cache space on mw1011 is CRITICAL: HHVM_TC_SPACE CRITICAL code.main: 94% [08:27:55] RECOVERY - Translation cache space on mw1017 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:27:55] RECOVERY - Translation cache space on mw1044 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:28:36] RECOVERY - Translation cache space on mw1053 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:28:54] RECOVERY - Translation cache space on mw1143 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:30:54] RECOVERY - Translation cache space on mw1001 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:31:06] RECOVERY - Translation cache space on mw1002 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:32:25] RECOVERY - Translation cache space on mw1003 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:32:45] RECOVERY - Translation cache space on mw1004 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:33:07] 6operations, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1295894 (10fgiunchedi) doh, I completely forgot about the beta varnishtop collector! It'll need to be adjusted to e.g. consider different varnish instances and (later on) timings but otherwise fine [08:33:55] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [08:33:55] RECOVERY - HHVM busy threads on mw1169 is OK Less than 30.00% above the threshold [76.8] [08:34:05] RECOVERY - Translation cache space on mw1007 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:34:14] RECOVERY - HHVM queue size on mw1169 is OK Less than 30.00% above the threshold [10.0] [08:34:15] RECOVERY - Translation cache space on mw1008 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:35:55] RECOVERY - Translation cache space on mw1012 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:35:55] RECOVERY - Translation cache space on mw1114 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:36:04] RECOVERY - Translation cache space on mw1011 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:36:28] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1295910 (10mobrovac) p:5Low>3High a:3mobrovac [08:37:35] RECOVERY - Translation cache space on mw1013 is OK: HHVM_TC_SPACE OK TC sizes are OK [08:46:32] (03PS3) 10Muehlenhoff: Use 3.19 on jessie by default (Bug: T97411) [puppet] - 10https://gerrit.wikimedia.org/r/211688 [08:58:23] (03PS1) 10Mobrovac: service::node: fix logstash port [puppet] - 10https://gerrit.wikimedia.org/r/211955 (https://phabricator.wikimedia.org/T97615) [09:03:03] (03CR) 10Hashar: [C: 031] jenkins,package_builder,labs_bootstrapvz: lint [puppet] - 10https://gerrit.wikimedia.org/r/211350 (owner: 10Dzahn) [09:13:55] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: fix logstash port [puppet] - 10https://gerrit.wikimedia.org/r/211955 (https://phabricator.wikimedia.org/T97615) (owner: 10Mobrovac) [09:39:40] (03Abandoned) 10Giuseppe Lavagetto: hhvm:fix name of the check [puppet] - 10https://gerrit.wikimedia.org/r/211954 (owner: 10Giuseppe Lavagetto) [09:45:35] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 66.67% of data above the critical threshold [35.0] [09:50:13] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1296010 (10mobrovac) [09:54:03] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1296023 (10mobrovac) 5Open>3Resolved The problem was in the wrong logstash port. It's all good now and Graphoid's dashboard can be found at https://logstash.wikimedia.org/#/dashboard/elast... [10:00:57] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1296028 (10mobrovac) >>! In T97615#1294302, @GWicke wrote: > @joe, I see why you went for separate config files, and am okay with that as long as we can integrate that with the regular service... [10:04:55] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [10:10:00] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1296041 (10Joe) @mobrovac I don't see a reason to allow our users to shoot themselves in the foot, but maybe you and @Gwicke see a use-case I don't see. [10:13:14] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1296042 (10Joe) I mean - if someone wants to create a completely custom service installation, the right way to go is to create an independent module and not to use service::node. [10:26:54] 6operations, 10Graphoid, 6Services: Configure Graphoid Logstash Dashboard - https://phabricator.wikimedia.org/T97615#1296054 (10mobrovac) >>! In T97615#1296041, @Joe wrote: > @mobrovac I don't see a reason to allow our users to shoot themselves in the foot, but maybe you and @Gwicke see a use-case I don't se... [10:31:22] (03PS4) 10Muehlenhoff: Use 3.19 on jessie by default (Bug: T97411) [puppet] - 10https://gerrit.wikimedia.org/r/211688 [10:36:14] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [10:44:56] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [10:45:07] (03CR) 10Filippo Giunchedi: [C: 031] "note this is also a partial fix for T94177 (packages not upgraded post-install)" [puppet] - 10https://gerrit.wikimedia.org/r/211688 (owner: 10Muehlenhoff) [11:22:25] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [11:59:12] (03CR) 10Alexandros Kosiaris: [C: 032] Typo: Fix typo in cxserver module [puppet] - 10https://gerrit.wikimedia.org/r/211392 (owner: 10KartikMistry) [12:12:39] mmhh mw1152 is still pushing xhprof metrics? _joe_ ? [12:17:15] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [12:18:34] <_joe_> godog: no idea about that [12:18:57] <_joe_> godog: "restart hhvm and it will go away" [12:19:19] <_joe_> but maybe something funny's happened to that machine [12:19:36] 23 Matching Service Entries Displayed [12:32:06] (03CR) 10BBlack: [C: 031] Use 3.19 on jessie by default (Bug: T97411) [puppet] - 10https://gerrit.wikimedia.org/r/211688 (owner: 10Muehlenhoff) [12:48:18] 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename all mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1296159 (10Aklapper) [12:50:33] _joe_: ack [12:50:54] 6operations: Wikis down due to brief AMS-IX outage - https://phabricator.wikimedia.org/T98952#1296162 (10Aklapper) [12:50:57] !log bounce hhvm on mw1152 [12:51:05] Logged the message, Master [12:53:45] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1296164 (10Aklapper) >>! In T98084#1278103, @greg wrote: > (please don't close until we can confirm this stays working for more than a day) One week later... any updates? Still happening? @... [12:55:37] 6operations, 6Commons, 10MediaWiki-Database, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error - https://phabricator.wikimedia.org/T98706#1296165 (10Aklapper) [12:56:15] 6operations, 6WMF-Legal, 10Wikimedia-General-or-Unknown: dbtree loads third party resources (from jquery.com) - https://phabricator.wikimedia.org/T96499#1296168 (10Aklapper) [13:00:04] godog, mobrovac: Respected human, time to deploy RESTBase (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150519T1300). Please do the needful. [13:02:45] mobrovac: yt? [13:03:14] godog: yup, gimme 2 mins please [13:03:28] np [13:05:24] godog: ok, ready when you are [13:06:02] i'll disable puppet on restbase100x [13:06:18] mobrovac: kk [13:06:34] then you +2 the config change and then we enable puppet on each node in seq [13:06:44] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1296179 (10Aklapper) @Nemo_bis: See T95184#1244757 - if you think this is a blocker, explain why you think so by replying to my explanation why I do not see it as... [13:07:05] (03PS8) 10Filippo Giunchedi: Enable group1 wikis in RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [13:07:56] 6operations, 7Performance: edits seem to be very slow (due to Redis / jobrunners) - https://phabricator.wikimedia.org/T97930#1296180 (10Aklapper) [13:08:19] godog: ok, puppet disabled, you can go ahead and +2 [13:08:28] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Enable group1 wikis in RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [13:08:58] mobrovac: yep it is merged [13:09:07] !log disabled puppet on restbase100x [13:09:12] Logged the message, Master [13:09:20] enabling on restbase1001 ... [13:09:37] well, running it actually [13:11:50] mobrovac: ok! [13:11:57] no good [13:12:44] need to set cass consistency to 1 temporarily to make it start up [13:13:40] mmhh to create the initial storage (?) [13:16:36] yeah, there were some schema changes [13:16:41] godog: didn't help though [13:16:51] will need to stop the other cassandra nodes apparently [13:17:21] mobrovac: mmhh and then? [13:17:51] and leave one running, and with consistency of 1 it should manage to do the needed changes [13:17:53] right? [13:18:47] ah no, wait [13:19:29] i'll manually erase the new keyspaces, then start one restbase instance with consistency 1 [13:19:32] that should do it [13:19:47] since right now we have the problem: Operation timed out - received only 1 responses. [13:19:58] on the restbase side? [13:20:15] which means consistency of 1 is ignored (as it tried to create with localQuorum prior to that) [13:20:23] yeah, it's a cassandra driver error [13:20:25] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [13:20:42] happens when running select * from "local_group_wikiquote_T_parsoid_html"."meta" where "key" = ? limit 1 [13:20:55] PROBLEM - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is CRITICAL: Connection refused [13:21:14] mobrovac: mhh so looks like it is down now? [13:21:26] lemme check the otehrs [13:21:26] getting paged, what's up? [13:22:03] paravoid: see above, new restbase/cassandra wikis being added [13:22:14] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [13:22:20] yup, down, will try to log into c* and remove these keyspaces [13:22:39] alright [13:22:41] anything I can help with? [13:22:50] mobrovac: I take it we can rebuild it all if need be? [13:23:00] rebuild? [13:23:02] paravoid: not sure ATM, I'll silence restbase tho [13:23:17] mobrovac: I mean repopulate cassandra [13:23:35] it's not primary data so it should be possible, but last time it took days, no ? [13:23:37] the offending keyspaces are being created, so they're empty [13:23:53] a specific keyspaces, that's better [13:24:22] mobrovac: ok! [13:24:25] RECOVERY - LVS HTTP IPv4 on restbase.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 - 6496 bytes in 0.011 second response time [13:24:27] do we need to revert the group1 thing until this is fixed, or? [13:24:44] <_joe_> bblack: I guess this screwed up cassandra gloriously [13:24:52] <_joe_> or something [13:25:06] well, not gloriously, we still haven't lost data :P [13:25:26] <_joe_> akosiaris: oh I misinterpreted what was said [13:25:30] * akosiaris knocks on wood [13:25:38] <_joe_> but I just woke up, so... [13:25:42] so did I at first [13:26:53] bblack: no I think it is fine creating the new keyspaces and resuming [13:27:01] ok [13:28:09] which node is 10.64.0.220 ? [13:28:13] cass doesn't seem to like it [13:28:25] 1001 [13:28:27] <_joe_> mobrovac: dig -x [13:28:29] k [13:29:13] damn getting t-o even for dropping these stupid keyspaces [13:29:28] cass logs seem ok thouhg [13:30:16] nodetool reports all nodes are UN [13:31:00] yup, I'm also looking at http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad [13:32:53] !log rebooting ulsfo caches (cp40xx - currently depooled from all traffic + downtimed in icinga) [13:32:57] Logged the message, Master [13:34:08] godog: rb is 503, so let's stop cassandra on all nodes but 1001 and continue from there? [13:35:25] mobrovac: and delete new keyspaces and manually create them? [13:35:51] ok, i figured out why this is happening [13:36:13] * mobrovac needs to punish himself [13:36:22] puppet ran on all of the nodes afterall [13:36:36] because i didn't run it as root [13:36:37] duh [13:36:38] stupid [13:36:54] <_joe_> mobrovac: ouch [13:37:15] mobrovac: ok, so is it really disabled now though? [13:37:17] <_joe_> mobrovac: is there a way to recover? [13:37:39] ok, let's try this [13:37:47] godog: please revert the patch [13:37:57] we'll let puppet run [13:37:57] mobrovac: ok [13:38:02] and then disable it for real [13:38:13] and run the re-applied patch on only one ndoe [13:38:30] i'll keep the config there so i can run it manually on one node [13:39:08] (03PS1) 10Filippo Giunchedi: Revert "Enable group1 wikis in RESTBase" [puppet] - 10https://gerrit.wikimedia.org/r/211982 [13:39:10] <_joe_> mobrovac: restbase is donw at the moment? [13:39:22] <_joe_> if so, what's the impact? VE broken? [13:39:38] yes, unfortunately [13:39:47] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Revert "Enable group1 wikis in RESTBase" [puppet] - 10https://gerrit.wikimedia.org/r/211982 (owner: 10Filippo Giunchedi) [13:40:10] will run puppet agent on rb nodes now [13:40:11] mobrovac: done [13:40:30] <_joe_> mobrovac: lemme assist in running puppet [13:40:52] <_joe_> godog: puppet-merged? [13:41:00] _joe_: for i in 1 2 3 4 5 6; do ssh restbase100${i} 'sudo puppet agent -tv'; done :) [13:41:02] i'll let you do it [13:41:06] _joe_: yep [13:41:16] ok as long as only one person operates at a time [13:41:42] <_joe_> mobrovac: with salt I run them in parallel [13:41:57] <_joe_> godog: luckily puppet has locking [13:42:13] (03CR) 10Andrew Bogott: [C: 032] contint: Use device=none in tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/204542 (owner: 10Krinkle) [13:42:19] <_joe_> mobrovac: so, run puppet on all nodes, then disable it, right? [13:42:20] k, rb is back up [13:42:40] _joe_: as a general statement, not puppet specifically [13:42:43] _joe_: no need to disable it now, i have the copy of the config i need [13:42:46] hold off that [13:43:17] <_joe_> mobrovac: puppet would overwrite your config/restart restbase upon the next cron exec [13:43:24] <_joe_> which may be.. anytime [13:43:27] _joe_: copied it over :) [13:43:37] to a diff location [13:44:06] <_joe_> ok, just disable puppet anyways where you're operating. [13:47:06] !log done with cp40xx reboot process [13:47:10] Logged the message, Master [13:48:59] gr, still can't remove those damn keyspaces [13:49:11] !log test [13:49:15] Logged the message, Master [13:49:16] on the upside, RB is working [13:49:19] uh, evil. [13:49:38] <_joe_> mobrovac: good to know [13:49:53] <_joe_> Steinsplitter: you've just been logged to the SAL [13:49:55] mobrovac: ok progress [13:50:40] mobrovac: so timeout now while deleting? or error? [13:50:52] t-o still [13:51:07] _joe_ : reverted Logsbot :) [13:51:13] <_joe_> oh i love distributed bdynamo stores [13:51:20] godog: OperationTimedOut: errors={}, last_host=10.64.0.220 [13:51:40] i'm on the last_host [13:52:04] <_joe_> mobrovac: try from another :P [13:52:15] did that already [13:52:18] but, wait wait [13:52:20] it seems the keyspace is gone [13:52:26] lemme double-check that [13:54:02] <_joe_> new rule: all WMF's distributed datastores wil now obey the lolCAP theorem [13:55:15] ok, good news, i'm getting t-o's when dropping them, but they seem to be indeed dropped [13:55:15] will let you know when i'm done deleting all of them [13:55:16] (03PS1) 10Andrew Bogott: Use m5-master.eqiad.wmnet for the openstack/labs db server [puppet] - 10https://gerrit.wikimedia.org/r/211987 [13:55:17] <_joe_> mobrovac: apart from VE, other things that might have been affected? [13:55:46] _joe_: possibly ocg and cxserver [13:55:57] <_joe_> mobrovac: ok [13:56:10] <_joe_> because they call parsoid via restbase [13:56:10] <_joe_> right? [13:58:58] yup [13:59:02] mobrovac: please !log too your actions so there's a trail [13:59:46] yeah will do once i remove them all manually [14:09:40] mobrovac: will you write an incident report about this? [14:09:42] PROBLEM - puppet last run on labnet1001 is CRITICAL Puppet last ran 23 hours ago [14:09:50] RECOVERY - puppet last run on labnet1001 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [14:10:04] (03PS1) 10BBlack: repool ulsfo (fiber cut resolved) [dns] - 10https://gerrit.wikimedia.org/r/211989 [14:10:18] (03CR) 10BBlack: [C: 032] repool ulsfo (fiber cut resolved) [dns] - 10https://gerrit.wikimedia.org/r/211989 (owner: 10BBlack) [14:10:22] (03CR) 10Filippo Giunchedi: "note this has been reverted shortly after, operational error (puppet wasn't disabled)" [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [14:10:48] of course new column family == many new graphite metrics [14:25:59] mobrovac: what's the status? [14:26:02] <_joe_> godog: of course, or the side effects of a deploy. [14:26:02] gwicke: status: rb is up, but without g1 wikis, i'm removing their keyspaces by hand now and will then start only one rb to re-populate it [14:26:07] yes, but i had a feeling _joe_ was doing one already? [14:26:07] I'll help with the timeline too [14:26:08] mobrovac: okay; lets also make sure that the nodes agree on the schema before re-trying [14:26:08] <_joe_> mobrovac: not until now [14:26:10] ah ok, was just being optimistic [14:26:10] <_joe_> let's wait to have solved this fully :) [14:26:10] gwicke: they do, as all of the nodes have been restarted after the revert [14:26:10] okay [14:26:11] did you end up running puppet on all nodes at once? [14:26:11] !log restbase100x: removed superfluous keyspaces by hand from Cassandra [14:26:11] Logged the message, Master [14:26:15] <_joe_> gwicke: puppet runs happened as scheduled by cron, we ran it all at once to revert and recover rb after that [14:26:17] gwicke: should i put defaultConsistency to one out of precaution? [14:26:17] kk; I'm mainly wondering as a single node starting up & migrating the schema ought to work fine now [14:26:18] mobrovac: that shouldn't be needed [14:26:18] k [14:26:19] gwicke: but yeah, that's what (unintentionally) happened [14:26:21] what cassandra doesn't like is many workers on multiple nodes racing to change the schema in overlapping ways [14:26:21] yeah I remember we had this conversation re: a restbase tool to manage schemas as opposed to when it starts up [14:26:21] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1296258 (10Eevans) >>! In T98084#1296164, @Aklapper wrote: >>>! In T98084#1278103, @greg wrote: >> (please don't close until we can confirm this stays working for more than a day) > > One we... [14:26:22] !log starting manually RB with group1 wikis enabled on restbase1001 [14:26:22] Logged the message, Master [14:26:22] a single node now starts its workers sequentially, so as long as it's a rolling restart things should be fine in theory [14:26:22] * gwicke wished puppet supported rolling restarts with checks [14:26:23] still creating CFs, so far so good [14:26:26] "restbase listening on port 7231" [14:26:26] yupii [14:26:26] cool [14:26:26] !log restbase group1 wiki keyspaces created [14:26:26] Logged the message, Master [14:26:26] godog: ok, we're now safe to re-enable that patch [14:26:26] (03CR) 10Greg Grossmeier: [C: 031] Add Jan Zerebecki to contint-admins [puppet] - 10https://gerrit.wikimedia.org/r/210692 (https://phabricator.wikimedia.org/T98961) (owner: 10Hashar) [14:26:26] godog: no, wait [14:26:26] <_joe_> mobrovac: and it can run freely? [14:26:26] all rb restarts should really be coordinated [14:26:26] mobrovac: yep I'll wait [14:26:26] actually, yeah, godog, _joe_, we're good now [14:26:27] mobrovac: ok I'll re apply the patch [14:26:27] all of the CFs are in place [14:26:27] <_joe_> gwicke: if so, we can do as we do with apache and use salt [14:26:27] yeah, or ansible ;) [14:26:27] godog: waiiit [14:26:27] <_joe_> gwicke: really? in this moment, after an outage, you come and pitch ansible _again_ [14:26:28] wtf is this? [14:26:29] mobrovac: haha ok [14:26:29] Query 0x4dd0361d6908c6ec88867e0513909fae not prepared on host 10.64.0.221, preparing and retrying [14:26:30] hm, no, ok, getting overparanoid [14:29:18] <_joe_> mobrovac: overparanoid is good when doing operations [14:29:18] _joe_: I'm pitching ssh [14:29:20] well, it's easy to be a general after the battle, especially one that you f***ed up [14:29:20] (i mean myself in this battle) [14:29:20] mobrovac: the logs are looking good so far [14:29:20] yup, monitoring constantly [14:29:20] <_joe_> mobrovac: stop blaming yourself, and if we're GTG I'd re-revert the patch [14:29:20] yeah we're fgood [14:29:20] !log enabled puppet on restbase1001 [14:29:21] (03PS1) 10Giuseppe Lavagetto: Revert "Revert "Enable group1 wikis in RESTBase"" [puppet] - 10https://gerrit.wikimedia.org/r/211993 [14:29:21] Logged the message, Master [14:29:21] <_joe_> mobrovac: can you +1 please? :) [14:29:21] ? [14:29:21] sorry I'm not following all the backscroll here, but we're really ready to go again on this? [14:29:21] (03CR) 10Mobrovac: [C: 031] Revert "Revert "Enable group1 wikis in RESTBase"" [puppet] - 10https://gerrit.wikimedia.org/r/211993 (owner: 10Giuseppe Lavagetto) [14:29:21] bblack: yup [14:29:21] yes, the keyspaces were created by starting up one node at a time [14:29:22] ok [14:29:22] <_joe_> mobrovac: so the procedure now is? I merge, disable puppet, then run puppet one node at a time? [14:29:22] _joe_: preferably, yes [14:29:23] _joe_: it should be fine to let puppet do it now, but the risk is lack of checking and coordination, so users could see request failures [14:29:23] <_joe_> I'm not sure I got what you mean, but ok [14:29:23] hence "preferably" [14:29:23] puppet will happily restart all nodes, even if some already failed [14:29:24] there's no check for availability before proceeding [14:29:24] <_joe_> gwicke: how can I check? via a curl maybe? [14:29:32] (03PS2) 10Giuseppe Lavagetto: Revert "Revert "Enable group1 wikis in RESTBase"" [puppet] - 10https://gerrit.wikimedia.org/r/211993 [14:29:32] a cheap option is connecting to port 9042 [14:29:33] !log temporarily going read-only for virt1000 for database migration [14:29:37] Logged the message, Master [14:29:48] <_joe_> gwicke: doesn't pybal do that already? [14:29:51] (03CR) 10Andrew Bogott: [C: 032] Use m5-master.eqiad.wmnet for the openstack/labs db server [puppet] - 10https://gerrit.wikimedia.org/r/211987 (owner: 10Andrew Bogott) [14:30:04] _joe_: no, pybal doesn't deal with cassandra [14:30:16] <_joe_> oh so the check is for cassandra [14:30:51] oh right, this is RB [14:30:57] in that case, port 7231 [14:31:21] even if you are allergic to it, but here's how that looks in the ansible script I've been using: https://github.com/gwicke/ansible-playground/blob/master/roles/restbase/tasks/check.yml [14:32:15] it's not perfect, should add some actual requests in there too [14:32:16] <_joe_> !log disabled puppet on the eqiad restbase cluster [14:33:00] gwicke: there's ~43G of old (>1month) cassandra metrics, mostly from the time system cf were enabled, the new wikis are creating a lot of metrics so I'd like to purge the old ones [14:33:20] <_joe_> (btw, I'd like to log here that I don't dislike ansible, and you continue to willfully mis-interpret my remarks on this) [14:33:35] (03PS3) 10Giuseppe Lavagetto: Revert "Revert "Enable group1 wikis in RESTBase"" [puppet] - 10https://gerrit.wikimedia.org/r/211993 [14:33:43] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Revert "Revert "Enable group1 wikis in RESTBase"" [puppet] - 10https://gerrit.wikimedia.org/r/211993 (owner: 10Giuseppe Lavagetto) [14:34:15] godog: removing the system cf metrics sounds fine to me [14:34:22] which others are old? [14:34:43] _joe_: sorry if I misinterpreted your reaction [14:34:45] <_joe_> mobrovac: running puppet on restbase1001 [14:35:14] <_joe_> mobrovac: it's a noop there, expected? [14:35:28] euh, lemme check [14:35:40] gwicke: some are renames, like RangeTotalLatency vs RangeTotalLatency.count [14:35:43] <_joe_> mobrovac: sorry my error [14:36:07] _joe_: config.yaml is still the old one there, and RB wasn't restared afaict [14:36:21] <_joe_> mobrovac: yeah I got distracted from puppet-merging [14:36:26] godog: are those renames cassandra-specific? [14:36:50] or is that related to the switch to extended counters? [14:37:08] gwicke: I think the former, cassandra is pushing straight to graphite, not statsd IIRC [14:37:42] ah, good point [14:38:01] maybe those metrics were caught up in a rename script that was really targeted at statsd metrics [14:38:06] _joe_: looking good on rb1001 [14:38:07] <_joe_> mobrovac: I'm on rb1002 atm [14:38:18] godog: +1 for nuking anything that's really old [14:38:26] in the cassandra hierarchy [14:38:27] <_joe_> yeah I verified as gabriel suggested, did a telnet to that port [14:38:52] <_joe_> mobrovac: rb1002 done, and cassandra still responds [14:39:13] _joe_: rb1002 good as well [14:39:16] gwicke: ok thanks, btw on the same topic some metrics don't have parsable values https://phabricator.wikimedia.org/T97024 [14:39:35] PROBLEM - nova-compute process on labvirt1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:39:46] <_joe_> andrewbogott: ^^ [14:39:55] PROBLEM - nova-compute process on labvirt1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [14:40:06] _joe_: ah, a side-effect of a db migration, I will fix [14:40:06] _joe_: in general, once rb binds to 7231, it means start-up went ok [14:40:26] <_joe_> mobrovac: oh, that's even simpler [14:40:59] _joe_: rb1003 ok [14:41:09] <_joe_> mobrovac: 03 done, doing 04, now in loop with a check for the open port [14:41:15] !log purge cassandra system CF metrics from graphite1001 [14:41:19] Logged the message, Master [14:41:25] RECOVERY - nova-compute process on labvirt1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:41:40] godog: if that's coming from all nodes then it would be an issue with the graphite reporter jar, or the combination of that & our graphite version [14:41:44] RECOVERY - nova-compute process on labvirt1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [14:42:01] _joe_: rb1004 ok [14:42:45] <_joe_> rb1005 too [14:42:52] gwicke: yeah I think it is likely the former [14:43:11] !log back to read/write after virt1000 database migration - migration seems ok [14:43:15] Logged the message, Master [14:43:20] _joe_: rb1006 too [14:43:26] <_joe_> mobrovac: {{done}} [14:43:39] yup, thnx _joe_ godog gwicke [14:44:07] mobrovac, _joe_, godog: yay for having all public wikis enabled now! [14:44:22] gwicke: yeah, finally :) [14:44:29] the list at http://rest.wikimedia.org/ is getting rather long [14:44:37] heehe nice indeed, I'll keep an eye on graphite cassandra metrics [14:46:05] so once we have verified that OCG has switched, we can now disable the Parsoid update jobs, which will almost half the load on the Parsoid cluster & indirectly on the PHP API cluster [14:46:42] <_joe_> \o/ [14:46:55] <_joe_> well, we're back to pre-restbase levels at least [14:46:57] <_joe_> :) [14:47:54] actually, a bit below; there are some tricks we like If-Unmodified-Since deduplication that we have implemented in RB, but aren't available with the Parsoid jobs [14:48:04] s/we // [14:48:16] <_joe_> If-UNmodified-Since? [14:48:31] yup, part of the HTTP spec [14:48:47] only update this if it hasn't been updated already [14:49:00] <_joe_> yeah I think it's the second time that I read about it in my life [14:49:07] <_joe_> the first was in the rfc :P [14:49:38] _joe_: you are on a safari, seeing all these rare animals ;) [14:50:03] legoktm, MatmaRex: Ping for SWAT in 10 minutes. [14:50:28] hi. [14:50:29] anomie: pong [14:51:37] (03PS1) 10Andrew Bogott: Move the keystone token cleanup to a keystone manifest. [puppet] - 10https://gerrit.wikimedia.org/r/211994 [14:52:55] PROBLEM - Disk space on graphite2001 is CRITICAL: DISK CRITICAL - free space: /var/lib/carbon 34059 MB (3% inode=99%) [14:53:26] (03CR) 10Jcrespo: [C: 032] Move the keystone token cleanup to a keystone manifest. [puppet] - 10https://gerrit.wikimedia.org/r/211994 (owner: 10Andrew Bogott) [14:53:57] (03PS1) 10Andrew Bogott: No longer include a database server as part of the nova controller. [puppet] - 10https://gerrit.wikimedia.org/r/211995 [14:54:25] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1296322 (10BBlack) Various things have been blocking me from getting around to these reboots lately. At this point, all of ulsfo is on the new kernel, as well as cp3030 (in esams) and cp1008 (non-prod t... [14:55:32] _joe_: in hiera / puppet, is there a way to reference a role variable for a $::site if that role isn't applied to the host it's targeted at? [14:55:54] <_joe_> gwicke: can I get a practical example? [14:56:02] (03CR) 10Jcrespo: [C: 032] No longer include a database server as part of the nova controller. [puppet] - 10https://gerrit.wikimedia.org/r/211995 (owner: 10Andrew Bogott) [14:56:21] <_joe_> gwicke: in general, it looks like a constraints violation to me [14:56:34] graphoid defines its host in the graphoid role; I was wondering if I could reference that from RB [14:56:55] rather than copying the value [14:57:29] <_joe_> gwicke: well, if it's common to all eqiad, we should e.g. move that data in eqiad.yaml instead that keeping it specific to that role [14:58:50] <_joe_> gwicke: then, in the rb module you do hiera('grahoid::url') to fetch it [14:59:13] <_joe_> (you can do uglier tricks, but you _don't_ want it [14:59:15] would that work with role/common too? [14:59:39] <_joe_> gwicke: whatever is in role/* is only visible to the servers who apply that role [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, anomie, legoktm, MatmaRex: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150519T1500). [15:00:13] * anomie begins SWAT [15:00:13] <_joe_> so if some hiera variable needs to be accessed by a more vast number of hosts, it should stay in common/ or in $::site/ hierarchies [15:00:25] legoktm: I'll do yours first [15:00:29] _joe_: ok [15:00:39] * ^d unpings himself [15:00:50] <_joe_> and now, coffee :P [15:01:13] _joe_: also, thanks! [15:01:46] yeah i need another round of atomic coffee myself [15:01:58] (03PS1) 10Andrew Bogott: Typo fix in the keystone token cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/211996 [15:02:21] (03CR) 10Andrew Bogott: [C: 032] Typo fix in the keystone token cleanup cron [puppet] - 10https://gerrit.wikimedia.org/r/211996 (owner: 10Andrew Bogott) [15:09:48] (03PS3) 10GWicke: Enable graphoid in labs & production [puppet] - 10https://gerrit.wikimedia.org/r/211758 [15:12:57] (03CR) 10GWicke: [C: 04-1] "Adding -1 to prevent a merge until https://github.com/wikimedia/restbase/pull/247 is deployed." [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [15:14:09] (03PS4) 10GWicke: Enable graphoid in labs & production [puppet] - 10https://gerrit.wikimedia.org/r/211758 [15:14:23] (03CR) 10GWicke: [C: 04-1] Enable graphoid in labs & production [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [15:15:10] !log anomie Synchronized php-1.26wmf6/includes/registration/ExtensionRegistry.php: SWAT: registration: Don't array_unique() over the queue before loading it [[gerrit:211947] (duration: 00m 12s) [15:15:12] legoktm: ^ Test please [15:15:15] Logged the message, Master [15:15:41] anomie: looks good! CologneBlue is back! [15:16:09] !log anomie Synchronized php-1.26wmf5/includes/registration/ExtensionRegistry.php: SWAT: registration: Don't array_unique() over the queue before loading it [[gerrit:211948] (duration: 00m 12s) [15:16:11] legoktm: ^ Test please [15:16:13] Logged the message, Master [15:16:25] MatmaRex: You're next [15:17:01] anomie: I don't have a wmf5 reproduction case, going to assume since it was fine on wmf6 it'll be ok on wmf5 too [15:17:05] ok [15:17:19] (03CR) 10GWicke: "@mobrovac, I think it would be nice to share production & labs configs more than we do right now, perhaps by supporting a generic yaml con" [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [15:17:20] yup [15:19:18] _joe_: godog: gwicke: mark: i'm writing an incident report for this RB outage, will send it to ops-l once i'm done [15:19:46] thank you [15:20:01] <_joe_> mobrovac: are you writing it on wikitech? [15:20:06] yup [15:20:17] <_joe_> ok, not everyone knows :) [15:23:42] mobrovac: kk, thanks! [15:26:42] Why so slow, Jenkins? [15:28:22] !log anomie Synchronized php-1.26wmf6/includes/skins/SkinTemplate.php: SWAT: Revert "output mw-content-{ltr,rtl} unconditionally" [[gerrit:211894]] (duration: 00m 13s) [15:28:23] MatmaRex: ^ Test please [15:28:28] Logged the message, Master [15:29:15] looking [15:30:22] 6operations, 7Graphite: audit graphite retention schemas - https://phabricator.wikimedia.org/T96662#1296423 (10fgiunchedi) e.g. dropping 1h retention from 5y to 2y yields ~27% decrease: ``` $ whisper-resize 5MinuteRate.wsp 1m:7d 5m:30d 15m:1y 1h:2y Retrieving all data from the archives Creating new whisper da... [15:30:34] anomie: can you do wmf5? there are no RTL wikis on wmf6 yet, i think [15:30:41] ok [15:31:04] !log anomie Synchronized php-1.26wmf5/includes/skins/SkinTemplate.php: SWAT: Revert "output mw-content-{ltr,rtl} unconditionally" [[gerrit:211893]] (duration: 00m 12s) [15:31:05] MatmaRex: ^ Test please [15:31:07] Logged the message, Master [15:32:45] (hold on) [15:33:03] (03CR) 10Gilles: [C: 031] Removed "refreshLinks" from $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211930 (owner: 10Aaron Schulz) [15:34:36] anomie: are you sure it's deployed? https://fa.wikipedia.org/w/index.php?title=%D9%85%D8%AF%DB%8C%D8%A7%D9%88%DB%8C%DA%A9%DB%8C:Common.css&action=history still displays wrong (LTR rather than RTL, check it without JS enabled because there's a local JS hack to fix it) [15:34:47] anomie: Special:Version https://fa.wikipedia.org/wiki/ویژه:نسخه says it's running the older version [15:35:15] (or, like, a really old version. from a few days ago) [15:35:34] MatmaRex: What does "wrong" look like? [15:36:35] MatmaRex: At that link I'm not seeing any "mw-content-rtl" or "mw-content-ltr" in the HTML. [15:36:37] oh, wait. seems i was not logged in, and it's cached. false alarm [15:36:56] anomie: wrong looked like this: http://i.imgur.com/kSGPUaz.png [15:37:18] correct is this: http://i.imgur.com/6HMxroG.png [15:37:39] thanks! all good :) [15:38:50] gwicke mobrovac new cassandra graphite metrics are still being created btw, anything we can remove? just the three test machines together are 100G... [15:41:19] godog: I'm not aware of anything old that we could remove at this point [15:41:48] how much space is left at this point, and how long do we have to hold out until we have more space? [15:43:47] ~180G in total on graphite1001, I'm working towards putting graphite1002 in service (basically graphite on jessie) and move the heaviest hitters there, there's 1.7TB [15:44:55] I was also looking at reducing retention from 5yr to 2yr, https://phabricator.wikimedia.org/T96662 [15:44:59] kk, that should be good for a year or so ;) [15:45:44] SSDs are pretty cheap these days [15:46:24] did you just volunteer to own graphite? :) [15:46:36] hehe, no [15:47:06] just saying that it might be better to add some bigger SSDs in that box before we commission it [15:47:47] heh, anyways what if I drop from 5yr to 2yr now on the cassandra metrics, that'd be ~27% less space used [15:48:28] gwicke: ^ [15:50:23] it's better than no metrics, we'd certainly survive [15:50:48] cassandra would have to survive more than 2yr too at that point :) [15:50:50] !log anomie Synchronized php-1.26wmf6/extensions/AbuseFilter/: SWAT: Fix boolean response in API action=abusefiltercheckmatch [[gerrit:211744]] (duration: 00m 10s) [15:50:52] is this to address the short-term shortage, or something you'd like to do in general? [15:50:55] Logged the message, Master [15:50:59] Yay, works. [15:51:24] gwicke: short-term now, mostly on the heavy hitters [15:51:34] godog: yeah, and C* versions will probably be bumped in that period too; but, that can sometimes be an interesting thing to compare [15:51:35] !log anomie Synchronized php-1.26wmf5/extensions/AbuseFilter/: SWAT: Fix boolean response in API action=abusefiltercheckmatch [[gerrit:211743]] (duration: 00m 12s) [15:51:38] Logged the message, Master [15:51:40] Yay, works. [15:51:45] * anomie is done with SWAT [15:52:08] godog: will we lose data in the process, and can we reverse it later? [15:52:39] gwicke: yeah basically we tune how long each archive (1h aggregation) lasts, so since no metric has data for >2yr anyway there's no loss [15:53:00] and for the same reason that's reversable [15:53:35] kk, that sounds like it has basically no short-term downsides [15:53:54] if so, +1 from me [15:54:58] 6operations, 7Graphite: audit graphite retention schemas - https://phabricator.wikimedia.org/T96662#1296498 (10fgiunchedi) from irc, given the space shortage and the fact that cassandra metrics are heavy hitters on graphite we can reduce retention for those only for now cc @gwicke [15:55:02] gwicke: kk thanks, related is ^ [15:55:27] godog: re graphite1002, do you plan to migrate all stats over to it eventually? [15:55:37] or are you aiming for a multi-host setup? [15:56:53] gwicke: more inclined for the latter, there's two/three heavy hitters on metrics but the rest is fine [15:59:22] 7Puppet, 6Reading-Infrastructure-Team, 6Release-Engineering, 5Patch-For-Review, 15User-Bd808-Test: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1296507 (10bd808) [16:00:24] from a user perspective a clustered graphite might be nicer, but if it's two or three apps only on a different host then it should still be doable [16:01:13] just need to parametrize the host etc in alerts and dashboards [16:02:33] I think the graphite frontend can ask multiple machines, anyways [16:14:00] gwicke: looks like creation is slowing down, I'm not going ahead with reducing retention at least for now [16:15:20] mobrovac: also being able to manage rb keyspaces from the command line would be a good action out of the incident report [16:15:59] e.g. in this case we would have run rb with a new config and create the keyspaces "offline" without the restart dance [16:17:24] godog: yeah, perhaps even a separate tool that could read the new config and apply cassandra stuff as necessary [16:18:11] yep, basically exposing a little bit the cassandra model to an operator [16:18:32] which I think rb has to have already anyway [16:18:38] cool, i'll add that [16:19:09] mobrovac: nice, thanks! [16:30:12] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1296628 (10DCDuring) Do the partial-dump wikis in the queue have the place indicated or will they jump before those that have had full dumps more recent than their last full... [16:39:19] jynus: your GPG key is 1024D, you should issue a new one [16:41:36] paravoid, yep, that was ok some time ago [16:45:07] bd808: what do you think about you and i meeting for 20 minutes each week? i'll set it up if you're game [16:45:57] dr0ptp4kt: Sure. [16:50:51] godog, mobrovac: starting a second rb instance basically does that [16:51:17] ah right [16:52:50] basically [16:53:40] s/basically/exactly/ ;) [16:54:12] if cp3019 icinga alerts show up here, it's because my first iteration of a script had a screwup. ignore them! :) [16:54:25] we also just discussed disabling puppet's auto-restart for RB [16:54:40] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1296709 (10Andrew) ok, I am back to thinking this is the right way forward. So, Rob, over to you. [16:55:43] (03PS1) 10Anomie: Add ApiFeatureUsage in production and enable on group0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212014 (https://phabricator.wikimedia.org/T1272) [16:56:44] gwicke: doing the patch as we speak [16:59:09] (03PS1) 10Mobrovac: Disable RESTBase restarts by Puppet [puppet] - 10https://gerrit.wikimedia.org/r/212016 [16:59:21] gwicke: godog: there ^^ [17:00:06] !log automated reboots of esams/eqiad non-upload caches starting (should auto-downtime, should be no real impact)... [17:00:13] Logged the message, Master [17:01:21] RobH: Respected human, time to deploy Mailman Maintainance (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150519T1700). Please do the needful. [17:01:46] PROBLEM - Host labcontrol1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:48] (03PS1) 10Florianschmidtwelzow: Disable WikiGrok in WMF production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212017 (https://phabricator.wikimedia.org/T98142) [17:03:12] 6operations, 7HHVM: mw1123 has defunct unkillable hhvm process - https://phabricator.wikimedia.org/T99594#1296741 (10Dzahn) root@mw1123:~# service hhvm start start: Job is already running: hhvm but the only thing running is: www-data 30342 12.5 0.0 0 0 ? Zsl May18 132:06 [hhvm] a... [17:04:02] (03CR) 10Mobrovac: "I agree @GWicke. My main concern when creating these separate configs was the list of domains served by prod/beta. The alternative to the " [puppet] - 10https://gerrit.wikimedia.org/r/211758 (owner: 10GWicke) [17:04:14] RECOVERY - Host labcontrol1001 is UPING OK - Packet loss = 0%, RTA = 2.51 ms [17:04:27] (03CR) 10Filippo Giunchedi: "I'd clarify the first line with 'on config change' too, the reason being that puppet will still restart rb if it is down (service is ensur" [puppet] - 10https://gerrit.wikimedia.org/r/212016 (owner: 10Mobrovac) [17:05:24] !log starting mailman downtime window to scrub content off list archive per T99098 [17:05:28] Logged the message, Master [17:06:09] [stopping mailman process basically while web is still up] [17:06:11] !log puppet stopped on sodium (dont need it restarting mailman while im working) [17:06:15] Logged the message, Master [17:06:28] (03CR) 10Kaldari: [C: 031] Disable WikiGrok in WMF production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212017 (https://phabricator.wikimedia.org/T98142) (owner: 10Florianschmidtwelzow) [17:06:34] yea i didnt think we needed to kill web access [17:06:43] JohnLewis: correct me if you disagree =] [17:07:07] I don't disagree, I agree with it :p [17:07:19] 6operations, 7HHVM: mw1123 has defunct unkillable hhvm process - https://phabricator.wikimedia.org/T99594#1296748 (10BBlack) The reason the process is unkillable and untraceable is Horrible Things Happened. From dmesg: ``` [13354952.662343] systemd-udevd[1634]: starting version 204 [13358051.779364] init: hh... [17:08:05] PROBLEM - salt-minion processes on labcontrol1001 is CRITICAL: Connection refused by host [17:08:06] PROBLEM - dhclient process on labcontrol1001 is CRITICAL: Connection refused by host [17:08:16] PROBLEM - configured eth on labcontrol1001 is CRITICAL: Connection refused by host [17:08:25] PROBLEM - RAID on labcontrol1001 is CRITICAL: Connection refused by host [17:08:26] PROBLEM - DPKG on labcontrol1001 is CRITICAL: Connection refused by host [17:08:44] PROBLEM - Disk space on labcontrol1001 is CRITICAL: Connection refused by host [17:08:44] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: Connection refused by host [17:09:04] (03PS2) 10Mobrovac: Disable RESTBase restarts by Puppet on config change [puppet] - 10https://gerrit.wikimedia.org/r/212016 [17:09:19] what's up with labcontrol1001? [17:09:31] is this known ongoing work, or random $outage? [17:09:44] PROBLEM - mailman_ctl on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/mailmanctl [17:09:45] PROBLEM - mailman_qrunner on sodium is CRITICAL: PROCS CRITICAL: 0 processes with UID = 38 (list), regex args /mailman/bin/qrunner [17:09:54] (03CR) 10GWicke: [C: 031] Disable RESTBase restarts by Puppet on config change [puppet] - 10https://gerrit.wikimedia.org/r/212016 (owner: 10Mobrovac) [17:13:48] argh, painful mbox hacking [17:13:52] * robh hacks away [17:14:53] andrewbogott: ping [17:15:02] Coren: ping? [17:15:32] I think labcontrol1001 rebooted, no idea why. I also can't log into the machine to look at it (root@ and bblack@ ask for passwords). [17:15:41] don't see any downtime/SAL stuff... [17:19:55] PROBLEM - Host labcontrol1001 is DOWN: PING CRITICAL - Packet loss = 100% [17:20:16] https://phabricator.wikimedia.org/T96048 <- seems to be setup ticket for relatively-new labcontrol1001, but looks like it was finished back on Apr 28th for basics? [17:22:00] in any case, nobody seems to be screaming about fallout, so I'll assume for now it's a new host not in real use yet, and that it's rebooting for some good reason... [17:22:25] RECOVERY - Host labcontrol1001 is UPING OK - Packet loss = 0%, RTA = 1.50 ms [17:25:18] (03PS2) 10Ottomata: Adjust Hadoop memory settings [puppet] - 10https://gerrit.wikimedia.org/r/209642 [17:26:14] (03CR) 10Ottomata: [C: 032 V: 032] "We will apply this setting and keep an eye on jobs and load." [puppet] - 10https://gerrit.wikimedia.org/r/209642 (owner: 10Ottomata) [17:26:18] bblack: i think it was replacing virt0 and stuff but dunno [17:26:22] andrewbogott: ^ [17:26:35] bblack: I’m reimaging labcontrol1001, it’s not used for anything at the moment. [17:26:35] (labcontrol1001 reboot?) [17:26:37] cool [17:26:49] I’m surprised it threw alerts! It looked to me like there was nothing there [17:27:13] I couldn’t log in, couldn’t connect using the new_install key, I figured it was blank. Sorry for the alerts. [17:28:13] 6operations, 10Traffic: Reboot caches for kernel 3.19.6 globally - https://phabricator.wikimedia.org/T96854#1296800 (10BBlack) For future reference, this is what I'm doing now for the non-upload caches: ``` $ for h in `cat rebooters`; do hs=${h%.*.wmnet}; echo ======================; echo === $hs @ $(date) ==... [17:28:27] bblack: sorry about the noise [17:28:54] np [17:29:11] .... ok, rebuilding the wiki-research-l archives... [17:29:17] what i thought was the rebuild was the index. [17:29:25] JohnLewis: you were right, it was too fast! [17:29:56] :p [17:30:00] I think we'll still be within the window... in 2009 already. [17:30:28] andrewbogott: so, not sure if jynus told you already; that labnet1001 issue was fixed [17:30:37] andrewbogott: that said, I'm unsure why labnet1001 is on the labs host subnet [17:30:37] robh: don't forget though, there are other tasks (well one requiring a rebuild) [17:30:57] paravoid: yes! We migrated to the new db already, everything is working great. [17:31:07] JohnLewis: reviewing the other open tasks now whiel this runs [17:31:12] kk [17:31:52] the rename has to rebiuld right? [17:32:01] https://meta.wikimedia.org/wiki/Special:CentralAuth/Mistymoonlight something broken :O [17:32:04] paravoid: everything that every VM does is bridged through labnet1001, seems like it needs to be in the same subnet for that? [17:32:08] robh: yeah [17:32:15] But maybe I should open a ticket about auditing all of our vlan decisions for labs [17:32:20] Exception encountered, of type "Exception" [17:32:42] robh: may be best to do that next as the other tasks can be done quickly and at the end of a window/without the window still on-going [17:32:49] yep, agreed [17:32:53] that CA exception is pretty awesome, wtf? [17:32:59] Steinsplitter: looking [17:33:03] robh: see https://gerrit.wikimedia.org/r/#/c/211047/ as well as a puppet merge is needed [17:33:03] no 5xx, just whole content replaced with an exception? :) [17:33:30] legoktm: thx (after rename, maybe a jobqueue/runner lag) [17:33:30] bblack: idk, MediaWiki doesn't seem to be catching them properly, and that's how HHVM renders it [17:33:49] well, that's very very broken behavior. You know we cache 200s [17:34:13] (03CR) 10Dzahn: "is this missing an approval or good to go?" [puppet] - 10https://gerrit.wikimedia.org/r/210692 (https://phabricator.wikimedia.org/T98961) (owner: 10Hashar) [17:34:46] !log starting reboots of analytics worker nodes in order to enable hyperthreading Bug: https://phabricator.wikimedia.org/T90640 [17:34:51] Logged the message, Master [17:35:57] Steinsplitter: hmm, they didnt' get renamed on wikidata for some reason [17:36:55] PROBLEM - NTP on cp1043 is CRITICAL: NTP CRITICAL: Offset unknown [17:37:48] Steinsplitter: um, I think this is some crazy race condition [17:38:14] wikidata account was created at 17:19:22 [17:38:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1296825 (10Dzahn) Checked yesterday's meeting etherpad, the access requests section, i see 2 other requests but this one wasn't mentioned. Shouldn't it have... [17:38:31] log entry for rename is 2015-05-19T17:19:22 [17:38:49] wow [17:39:20] legoktm: oh, ok [17:41:11] Steinsplitter: https://phabricator.wikimedia.org/T99688?workflow=create [17:41:53] thx [17:44:44] RECOVERY - Disk space on labcontrol1001 is OK: DISK OK [17:44:45] !log mailing lists still down, scrubbing list archives is painful and error prone [17:44:52] Logged the message, Master [17:45:56] RECOVERY - dhclient process on labcontrol1001 is OK: PROCS OK: 0 processes with command name dhclient [17:46:05] RECOVERY - configured eth on labcontrol1001 is OK - interfaces up [17:46:05] RECOVERY - RAID on labcontrol1001 is OK: NRPE: Unable to read output [17:46:05] RECOVERY - DPKG on labcontrol1001 is OK: All packages OK [17:46:25] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:46:58] (03PS1) 10Florianschmidtwelzow: Enable alternate and canonical links for mobile/desktop pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212022 (https://phabricator.wikimedia.org/T99587) [17:47:35] RECOVERY - salt-minion processes on labcontrol1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:47:35] RECOVERY - NTP on cp1043 is OK: NTP OK: Offset 0.002826929092 secs [17:48:19] andrewbogott: i think i may have finished the install on that before handoff, if so you'll have to clear out old puppet/salt keys and stuffs [17:48:23] (you may have already done so) [17:48:47] robh: yeah, looks like it was ready to go and I duplicated your work… [17:48:54] it’s back now in any case [17:48:57] something is busted with mailman - members of a private mailing list keep getting bounced out when they try to email within the list. anyone around know about it or should I file a phab ticket? [17:49:05] manybubbles: its in downtime [17:49:06] planned maint [17:49:09] (03CR) 10Florianschmidtwelzow: [C: 04-1] "needs some clarification first, see bug" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212022 (https://phabricator.wikimedia.org/T99587) (owner: 10Florianschmidtwelzow) [17:49:15] an email was sent and its on the deploy calendar [17:49:20] well, deploy wikitech page [17:49:29] robh: so it bounces back emails with a totally wrong message? [17:49:37] it shouldnt do that, they should just sit and not hit [17:49:40] =/ [17:50:09] manybubbles: you can file a ticket and kick to me, and we can ensure its workign again post maint, but seems odd to reject rather than just sit [17:50:29] my understanding is they just hit our mail server, and it fails to deliver so it waits [17:50:41] so yea, task with details pls =] [17:50:43] robh: its been doing it every once in a while for weeks. I'll file a ticket [17:50:47] oh [17:50:56] thats likely unrelated, but a problem [17:51:05] right now though it wont work ;D [17:51:18] manybubbles: over a few weeks? definitely an issue then. if you can attach the bounce back they get - that'll help even more [17:51:29] got it [17:52:14] wiki-research-l rebuild (2nd attempt) is in 2012-july =P [17:52:48] +1 [17:52:58] (03CR) 10Dzahn: [C: 031] transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [17:54:01] (03PS2) 10BBlack: transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [17:54:16] (03CR) 10BBlack: [C: 032] transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [17:54:25] (03CR) 10BBlack: [V: 032] transparency - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/211394 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [17:54:33] sighhhh, cmjohnson1, yt? [17:54:42] yes, I am here whats up? [17:54:53] i just rebootted analytics1028 with hyperthreading on. [17:54:53] breaking the HP's? [17:54:58] not sure what it's doing now [17:55:07] not coming back up, no output on console [17:55:09] okay..lemme go look [17:55:10] can't exit console [17:55:15] can log into new console [17:55:18] i bet it's the power firmware..i have an updat [17:55:22] oo? [17:55:33] i'll be rebotting a bunch of these, should I try to apply that as I do? [17:55:41] yeah..in fact when I get back I wanna update all the 720's [17:55:45] especially yours [17:55:53] do you have to do it from the DC? [17:56:16] it's an ISO ..so you can tftp if you want [17:56:22] 6operations, 10Wikimedia-Mailing-lists: Mailman rejecting emails by members of private mailing list - https://phabricator.wikimedia.org/T99690#1296893 (10JohnLewis) [17:56:24] hrmm [17:56:26] I can add it to my home dir on iron if you want it [17:56:28] root@sodium:/usr/lib/mailman/bin# /etc/init.d/mailman start [17:56:28] close failed in file object destructor: [17:56:28] Error in sys.excepthook: [17:56:30] Original exception was: [17:56:31] if you give me the commands to run i'll do it as I reboot them [17:56:32] * Starting Mailman master qrunner mailmanctl [ OK ] [17:56:33] ja that would be cool [17:56:34] I get that restarting mailman... [17:56:40] so it looks like it restarts, but throws error during [17:56:52] and its totally running [17:56:59] ps aux shows a bunch of qrunners [17:57:02] i'm doing 1028-1041 now [17:57:10] JohnLewis: ^ it looks like it still renumbered the archives =[ [17:57:11] robh: it's running? [17:57:20] well, there are a ton olist 17885 0.0 0.3 64292 29120 ? S 17:35 0:00 /usr/bin/python /var/lib/mailman/bin/qrunner --runner=BounceRunner:0:1 -s [17:57:24] of those [17:57:36] thats mailman firing off various items right? [17:57:53] arch/command/bounce runners [17:57:54] robh: yeah it's running + they didn't re-renumber which makes me thing someone did a no-no in the past [17:57:58] (which is the issue_ [17:58:20] i think if we restore the mbox and rebuild [17:58:22] same issue [17:58:33] this is someone before me broke something [17:59:38] cmjohnson1: i'm going to walk home from this cafe, back on shortly, will ping you when I get back online to ask about an28, k? [17:59:46] ok [17:59:53] JohnLewis: Well, the info is scrubbed, and I'm inclined to leave it as such, but renumbering is painful [18:00:02] and im not sure about the error when mailman restarts [18:00:28] and mail delivery is less than one minute to a list [18:00:34] robh: 1. we can't fix it now really 2. need associated logs to debug so I can't :) [18:00:39] 6operations, 10Wikimedia-Mailing-lists: Mailman rejecting emails by members of private mailing list - https://phabricator.wikimedia.org/T99690#1296906 (10Manybubbles) Here is a mail I sent: ```... [18:00:49] JohnLewis: So, on to the rename? [18:01:02] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1296907 (10Dzahn) [18:01:09] robh: I'm giving a +1 to it so your call [18:01:21] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150519T1800). Please do the needful. [18:01:25] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1289436 (10Dzahn) Added dba and database tags re: providing the static snapshot [18:02:09] manybubbles: I see the issue! [18:02:19] that was quick! [18:02:21] JohnLewis: you've created the new list part already? or no steps done yet? [18:02:24] (well, I don't directly see it but... I can guess the issue :) ) [18:02:31] k [18:02:35] robh: the wikidata list needs to be deleted/disabled [18:02:47] robh: or! [18:03:09] ahh, it was already used, got it [18:03:10] cp the directory to another one and then reuse the wikidata directory existing [18:03:34] 7Blocked-on-Operations, 6Commons, 10Wikimedia-Site-requests: Add *.wmflabs.org to `wgCopyUploadsDomains` - https://phabricator.wikimedia.org/T78167#1296915 (10Steinsplitter) 5stalled>3Open [18:04:10] JohnLewis: hrmm, seems lots of folks hate the idea of renaming this list as the test =P [18:04:23] but cp over seems legit and easy enough [18:04:37] manybubbles: well, it can be multiple issues. as I don't have access to the site password, I either can go through a bunch of possible issues or you can change the admin password to something dummy-ish and send me and I'll fix it? either works [18:04:37] and your on record saying it shouldnt break their filters, sicne we keep the list id the same [18:05:04] JohnLewis: I can change the admin password - sure [18:05:11] JohnLewis: cp the exiting wikidata (non used list) directory for what reason, backup? [18:05:20] robh: yeah backup [18:05:30] its only like 6 or 7 messages anyway [18:06:26] (03PS3) 10Dzahn: admin: ebernhardson for elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) [18:06:54] done, and the next is the cp of wikidata-l into the wikidata directory, as long as you have the old list up to change the name when we finish? [18:07:07] (03CR) 10Manybubbles: [C: 031] admin: ebernhardson for elasticsearch-roots [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) (owner: 10Dzahn) [18:07:24] (03CR) 10Dzahn: [C: 032] "https://phabricator.wikimedia.org/T98766#1294377" [puppet] - 10https://gerrit.wikimedia.org/r/210250 (https://phabricator.wikimedia.org/T98766) (owner: 10Dzahn) [18:07:42] robh: the on-wiki instructions for renaming seem solid so follow them step by step [18:08:16] (03PS3) 10BBlack: noc - redirect HTTP to HTTPS; enable HSTS 7 days [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:08:38] ebernhardson: ^ got elastic root [18:09:11] (03CR) 10BBlack: [C: 032 V: 032] "I looked at check_http behavior in this case, and it seems to be ok with a 301. I also did some varnishlog-watching on the misc cluster, " [puppet] - 10https://gerrit.wikimedia.org/r/199515 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:09:18] rephrase: when im done, the 'wikidata' list has to have the real name of wikidata or wikidata-l? [18:09:47] it has been the goal to remove the -l usually [18:09:55] yes... but the isntructions arent clear [18:10:00] and i dont want to break shit [18:10:04] https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Rename_a_mailing_list [18:10:22] before the next step be prepared to change the "real_name" value of the list in the web ui, but don't send it yet. have the mailman master pass ready. [18:10:25] step 3 [18:10:40] the old list has real name wikidata-l [18:10:43] the new list has name wikidata [18:10:56] now, this makes a big deal about the name migrating, but we dont want to do that right? [18:11:05] then i dont get what this is for, other than to export the member list for folks? [18:11:17] (it even says to leave old web archive in place?) [18:11:33] robh: copy the pck conf files over to the new list location (wikidata), done that? [18:11:43] No, because im not comfortable with all the steps [18:11:46] im still askign about step 3 [18:12:20] so the real name is rewritten by the pck copy [18:12:24] yes [18:12:27] and we just have to quicly name it back to wikidata? [18:12:40] so it doesnt break shit, then we disable the old list after we copy the archives [18:12:42] when you move the config.pck over, wikidata-l will be called wikidata-l and wikidata will be called wikidata-l [18:12:47] ok [18:12:54] after you've moved it, wikidata should be called wikidata [18:13:06] you are creating a new list with the new name [18:13:15] then you copy the config from the old list over to the new list [18:13:29] mutante: thank you [18:13:35] and doing that will also change the name back to old, as it's included in the config you copied [18:13:37] (03PS1) 1020after4: Group1 wikis to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212024 [18:13:42] yea, got it. [18:13:46] so you want to change that back quickly to the right new name [18:13:49] mutante: you are on time delay ;D [18:13:57] ok [18:14:26] robh: I just received a mailing list email [18:14:36] stop the mailman process please :) [18:14:41] yea, its back up for now, i'll have to take it back down before i copy [18:14:45] im not copying things yet! [18:14:53] (figured i'd let it catch up for a short while) [18:14:55] that was a gentle reminder :p [18:14:58] =] [18:15:00] no worries [18:15:01] and good idea actually [18:16:34] Ok, I think I get what I have to change around now [18:16:44] * robh just wanted to pull up each directory, make backups, understand what he was doing [18:17:02] !log stopping mailman again for further planned work T99098 [18:17:06] Logged the message, Master [18:17:25] robh: okay :) [18:17:45] ebernhardson: and.. puppet created it on elastic1001 , the others will follow automatically [18:18:46] cmjohnson1: back! [18:18:50] any thing? [18:19:27] ok, copy of config done and realname changed back [18:19:28] (03PS2) 1020after4: Group1 wikis to 1.26wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212024 [18:19:31] going to copy of the archives now