[00:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150730T0000). [00:04:06] 6operations, 7Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1493632 (10ori) 3NEW [00:04:39] (03CR) 10Dzahn: "logstash.wikimedia.org listens on 80 but shouldn't that be allowed from anywhere, not just from INTERNAL. kibana is on 9200 but only liste" [puppet] - 10https://gerrit.wikimedia.org/r/227197 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [00:06:33] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:12:40] (03CR) 10Dzahn: [C: 031] "@Ottomata adding rules like this won't break anything in labs but also won't do anything unless base::firewall is applied to instances" [puppet] - 10https://gerrit.wikimedia.org/r/226071 (owner: 10Muehlenhoff) [00:14:00] (03CR) 10Dzahn: "why is gmond related to mariadb? should probably come from a ganglia class insteead of this, no?" [puppet] - 10https://gerrit.wikimedia.org/r/226267 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [00:17:49] (03PS1) 10Yuvipanda: labstore: Make create-dbusers create users for users too [puppet] - 10https://gerrit.wikimedia.org/r/227915 (https://phabricator.wikimedia.org/T104453) [00:18:32] (03CR) 10Ottomata: "Ah, dont' review yet. I'm going to make this diff against the old debian/ dir, so that we can only review the changes." [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227768 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [00:20:12] (03CR) 10BryanDavis: "> logstash.wikimedia.org listens on 80 but shouldn't that be allowed from anywhere, not just from INTERNAL" [puppet] - 10https://gerrit.wikimedia.org/r/227197 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [00:21:43] (03CR) 10Dzahn: "easy enough, confirmed on labsdb1001. there's only the standard mysql port" [puppet] - 10https://gerrit.wikimedia.org/r/226068 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [00:21:49] (03PS2) 10Dzahn: Add ferm rules for mariadb labsdb [puppet] - 10https://gerrit.wikimedia.org/r/226068 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [00:24:37] (03PS1) 10GWicke: Add {ar,ca}wiki and remove ee-prototype in beta labs RB config [puppet] - 10https://gerrit.wikimedia.org/r/227918 [00:25:19] RoanKattouw: ^^ [00:25:38] (03PS2) 10Catrope: Add {ar,ca}wiki and remove ee-prototype in beta labs RB config [puppet] - 10https://gerrit.wikimedia.org/r/227918 (https://phabricator.wikimedia.org/T107342) (owner: 10GWicke) [00:25:48] (03CR) 10Catrope: [C: 032] Add {ar,ca}wiki and remove ee-prototype in beta labs RB config [puppet] - 10https://gerrit.wikimedia.org/r/227918 (https://phabricator.wikimedia.org/T107342) (owner: 10GWicke) [00:25:52] Thanks gwicke [00:26:29] oh, nice; forgot that you have +2 [00:27:00] (03CR) 10Dzahn: [C: 032] Add ferm rules for mariadb labsdb [puppet] - 10https://gerrit.wikimedia.org/r/226068 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [00:28:30] (03CR) 10Dzahn: "you are right, i checked for "kibana" in misc-web config but of course it's logstash" [puppet] - 10https://gerrit.wikimedia.org/r/227197 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [00:28:36] (03PS2) 10Dzahn: Add ferm rules for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/227197 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [00:30:04] (03CR) 10Dzahn: [C: 032] Add ferm rules for Kibana [puppet] - 10https://gerrit.wikimedia.org/r/227197 (https://phabricator.wikimedia.org/T104939) (owner: 10Muehlenhoff) [00:31:08] gwicke, RoanKattouw, why remove ee-prototype? [00:31:35] It wasn't in the Parsoid config [00:31:42] Also isn't that wiki unmaintained? [00:31:51] (03PS2) 10Dzahn: Enable base::firewall on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/227416 (owner: 10Muehlenhoff) [00:31:54] maybe we should close it... [00:34:20] matt_flaschen: Do we still use ee-prototpye? [00:34:35] RoanKattouw, also, VE is enabled there [00:34:52] RoanKattouw, not that I know of. [00:34:54] but of course you just get the mis-matching IDs error (0 from restbase/parsoid) [00:35:18] It has only one page: Main_Page [00:35:24] in the main namespace [00:36:17] !log ori Synchronized php-1.26wmf16/includes/Message.php: eb281630ce: Debug logging for T102199 (duration: 00m 11s) [00:36:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:36:36] only 9 pages total [00:37:45] Main_Page, User:Hydriz, User:Hydriz/vector.css, User:Hydriz/vector.js, User:Meno25, User:Riley Huntley/vector.js, User talk:Meno25, User talk:ShanmugamTesting, and User talk:ShanmugamTest [00:37:59] can probably just be deleted... [00:40:59] 6operations, 6Performance-Team, 3Reading-Web: Remove docroot:/images/mobile in favour of docroot:/static/images/mobile - https://phabricator.wikimedia.org/T107395#1493813 (10Krinkle) 3NEW [00:42:07] only 22 log entries if you exclude user creations, mostly usermerge/renameuser related [00:42:16] 6operations, 10Traffic, 7Mobile, 7Varnish: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#1493833 (10Jdlrobson) [00:44:26] 6operations, 6Performance-Team, 3Reading-Web: Remove docroot:/images/mobile in favour of docroot:/static/images/mobile - https://phabricator.wikimedia.org/T107395#1493837 (10Krinkle) [00:44:28] 6operations, 10Traffic, 7Mobile, 7Varnish: Static image files from en.m.wikipedia.org are served with cache-suppressing headers - https://phabricator.wikimedia.org/T86993#1493836 (10Krinkle) [00:45:28] [17:26] gwicke oh, nice; forgot that you have +2 [00:45:38] gwicke: I didn't even *realize* that I did [00:45:41] I just auto-piloted [00:45:46] But crap, I can't deploy that change [00:46:04] Oh, meh it's labs only anyway [00:46:19] (03PS1) 10Ori.livneh: HHVM: Limit wall execution time of FCGI reqs to 145s [puppet] - 10https://gerrit.wikimedia.org/r/227922 [00:48:21] (03PS23) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [00:48:32] !log ori Synchronized php-1.26wmf15/includes/Message.php: 160f69871c: Debug logging for T102199 (duration: 00m 13s) [00:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:50:14] RoanKattouw: lol ;) [00:50:22] all good in the end [00:50:25] RoanKattouw: Beg ori to push it? ;-) [00:50:32] James_F: It's labs [00:50:34] push what? [00:50:47] no pushing needed [00:50:50] OK. [00:51:05] ori: I accidentally +2ed an ops/puppet change, not realizing I had +2 there without deploy rights. But it's a labs-specific file so it should be fine [00:51:13] it still needs to be puppet-merged [00:51:23] even if it won't impact production, we'll get alerts eventaully [00:51:25] about unmerged changes [00:51:28] Oh, right, alerts [00:51:37] OK, in that case, can you rescue me and puppet-merge it? [00:51:41] (03PS3) 10Ori.livneh: Add {ar,ca}wiki and remove ee-prototype in beta labs RB config [puppet] - 10https://gerrit.wikimedia.org/r/227918 (https://phabricator.wikimedia.org/T107342) (owner: 10GWicke) [00:51:51] it didn't submit because puppet now requires rebasing [00:51:56] (03CR) 10Ori.livneh: [C: 032 V: 032] Add {ar,ca}wiki and remove ee-prototype in beta labs RB config [puppet] - 10https://gerrit.wikimedia.org/r/227918 (https://phabricator.wikimedia.org/T107342) (owner: 10GWicke) [00:51:56] lol [00:51:59] Oh even better [00:52:13] submitted, merged [00:52:26] all good in the very end [00:53:43] Thanks ori [00:53:51] np [01:14:53] (03CR) 1020after4: [C: 04-1] "I think we should try this instead:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [01:28:52] (03CR) 1020after4: [C: 031] Add service deploy via scap [tools/scap] - 10https://gerrit.wikimedia.org/r/224374 (owner: 10Thcipriani) [01:30:14] !log MIMEsearchPage::reallyDoQuery queries with crazy eg, LIMIT 10405000,501, on commonswiki vslow slave, from tide***.microsoft.com bots. log noise is queries hitting 5min limit and auto-killed [01:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:33:50] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1493937 (10Springle) Continuing on db1042 right now and queries regularly hitting 5min limit. The LIMIT 10405000,501 is most of the problem here; sho... [01:34:38] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1493938 (10Springle) Also, all bot driven, eg: tide543.microsoft.com [01:37:25] I don't know if that host necessarily shows it's a bot, springle [01:38:41] Krenair: possibly. 100+ concurrent similar queries from tide[0-9]{3}.microsoft.com seems automated, at least [01:39:04] the concurrent queries, yes [01:41:21] several old posts say that host pattern is used for the proxies for the MS corp network [01:41:35] (03CR) 10Springle: Add ferm rules for dbstore systems (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226267 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [01:41:49] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1493946 (10MaxSem) Quick plug: make this particular special page noindex, nofollow? [01:42:29] (03PS24) 10Gergő Tisza: Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:50:47] (03PS1) 10Ori.livneh: coal: log all Navigation Timing metrics, not just four [puppet] - 10https://gerrit.wikimedia.org/r/227928 (https://phabricator.wikimedia.org/T104902) [01:51:16] (03CR) 10Ori.livneh: [C: 032 V: 032] coal: log all Navigation Timing metrics, not just four [puppet] - 10https://gerrit.wikimedia.org/r/227928 (https://phabricator.wikimedia.org/T104902) (owner: 10Ori.livneh) [02:03:29] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-30 02:03:29+00:00 [02:03:30] !log LocalisationUpdate failed (1.26wmf16) at 2015-07-30 02:03:29+00:00 [02:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:04:35] it did that yesterday too [02:04:44] I wonder what's up with it [02:07:04] hmm [02:07:40] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 30 02:07:40 UTC 2015 (duration 7m 39s) [02:07:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:50] tin:/var/log/l10nupdatelog/l10nupdate.log is still being written to [02:08:43] PROBLEM - IPsec on cp4009 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (not-connected: 1) [02:08:53] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - 41 ESP transports installed, 1 problems (not-connected: 1) [02:15:10] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1493983 (10Springle) Just to clarify, I don't think we've seen actual OOM killer on s[1-7], right? The only front-line //production// concer... [02:16:26] hoo, SMalyshev: are either of you two doing something with l10nupdate? [02:16:38] nope [02:16:51] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 866.120573496 [02:21:06] Krenair: so it's pretending it's broken and then continuing?? [02:21:14] Maybe. [02:21:27] Or retrying and succeeding for some stupid reason. [02:21:31] lol [02:21:50] Or failing and then trying again and failing again but failing to properly report failure, instead reporting success. [02:23:39] Who knows? [02:24:47] WIKIUSER ALL=NOPASSWD: /home/wikipedia/sbin/wikiuser_pass_real [02:24:48] (03PS1) 10Ori.livneh: Double $wgMemoryLimit (330 => 660) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227930 [02:24:52] why is this still in puppet? [02:25:16] (03CR) 10Ori.livneh: [C: 032] Double $wgMemoryLimit (330 => 660) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227930 (owner: 10Ori.livneh) [02:25:22] (03Merged) 10jenkins-bot: Double $wgMemoryLimit (330 => 660) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227930 (owner: 10Ori.livneh) [02:25:47] krenair@snapshot1001:~$ cd /home/wikipedia [02:25:47] -bash: cd: /home/wikipedia: No such file or directory [02:26:11] !log ori Synchronized wmf-config/InitialiseSettings.php: I3c6217f06: Double $wgMemoryLimit (330 => 660) (duration: 00m 12s) [02:26:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:16] hmmm [02:27:42] I'm not sure those files in snapshot1001:/etc/sudoers.d are puppetised [02:27:57] who was it saying earlier it needs re-imaging? :) [02:27:58] (03PS1) 10Gergő Tisza: Add sentry-phabricator package [software/sentry] - 10https://gerrit.wikimedia.org/r/227931 (https://phabricator.wikimedia.org/T97136) [02:31:35] actually the list doesn't seem so different from some other hosts [02:36:25] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 07m 45s) [02:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:38] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-30 02:40:38+00:00 [02:40:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:10] All done at 2015-07-30 02:40:38+00:00 [02:42:10] Running updates for 1.26wmf16 (on cawikibooks) [02:42:10] Warning: LU_Updater::readMessages: Unable to parse messages from /var/lib/l10nupdate/mediawiki/extensions/Validator/Validator.i18n.php in /srv/mediawiki-staging/php-1.26wmf16/extensions/LocalisationUpdate/Updater.php on line 63 [02:42:10] Warning: LU_Updater::readMessages: Unable to parse messages from /var/lib/l10nupdate/mediawiki/extensions/SemanticResultFormats/SemanticResultFormats.i18n.php in /srv/mediawiki-staging/php-1.26wmf16/extensions/LocalisationUpdate/Updater.php on line 63 [02:46:13] other error I found was DB connection error: Can't connect to MySQL server on '208.80.154.136' (4) (208.80.154.136) [02:46:30] which of course is silver, there's a ticket for that [02:48:47] why is it reading from .i18n.php? it should ignore those files in favor of json [02:52:14] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (not-connected: 1) [02:58:42] PROBLEM - IPsec on cp1065 is CRITICAL: Strongswan CRITICAL - 41 ESP transports installed, 1 problems (not-connected: 1) [02:59:32] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: (no message) (duration: 04m 25s) [02:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:01:16] legoktm, that failure message is sent if `mwscript extensions/LocalisationUpdate/update.php --wiki="$mwDbName"` fails [03:01:49] !log LocalisationUpdate completed (1.26wmf16) at 2015-07-30 03:01:49+00:00 [03:01:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:16] I've noticed the pagecounts data set has stopped updating since last night. (https://dumps.wikimedia.org/other/pagecounts-all-sites/2015/2015-07/). I've been looking for outage info to no avail. Does anyone have info on this? Thanks [03:33:52] !log killing processes by ellery on stat1002 - load avg was over 1500 and users reported pagecounts are broken (possibly all other crons as well) [03:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:34:38] raidex: ^ [03:35:38] raidex: ^ yea, that, i hope it will be created again soon [03:41:51] PROBLEM - configured eth on stat1002 is CRITICAL: Connection refused by host [03:42:11] PROBLEM - DPKG on stat1002 is CRITICAL: Connection refused by host [03:43:12] PROBLEM - RAID on stat1002 is CRITICAL: Connection refused by host [03:46:23] legoktm, mutante: thank you very much. Hoping it get sorted out soon [03:47:42] (03PS2) 10Ori.livneh: HHVM: Limit wall execution time of FCGI reqs to 145s [puppet] - 10https://gerrit.wikimedia.org/r/227922 [03:47:50] (03CR) 10Ori.livneh: [C: 032 V: 032] HHVM: Limit wall execution time of FCGI reqs to 145s [puppet] - 10https://gerrit.wikimedia.org/r/227922 (owner: 10Ori.livneh) [04:05:28] !log LocalisationUpdate failed: git pull of core failed [04:05:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:05:57] that was my fault, trying to debug this damn thing and ran it under my own user [04:06:13] !log Ignore that last error [04:06:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:08:13] legoktm, I don't understand why it runs twice though [04:08:30] nfi [04:08:31] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [04:16:11] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [04:17:12] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [04:34:22] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:39:54] (03CR) 10Ori.livneh: [C: 032] Make the files relocatable [software/sentry] - 10https://gerrit.wikimedia.org/r/227597 (owner: 10Gergő Tisza) [04:40:02] (03CR) 10Ori.livneh: [V: 032] Make the files relocatable [software/sentry] - 10https://gerrit.wikimedia.org/r/227597 (owner: 10Gergő Tisza) [04:40:15] (03CR) 10Ori.livneh: [C: 032 V: 032] Version update: 7.6.2 -> 7.7.0 [software/sentry] - 10https://gerrit.wikimedia.org/r/227899 (owner: 10Gergő Tisza) [04:40:54] (03CR) 10Ori.livneh: "Let's get Sentry running first, IMO." [software/sentry] - 10https://gerrit.wikimedia.org/r/227931 (https://phabricator.wikimedia.org/T97136) (owner: 10Gergő Tisza) [04:42:54] legoktm / Krenair -- any idea what this is about? [04:42:55] [2015-07-30 03:58:44] Fatal error: /srv/mediawiki/wikiversions.cdb has no version entry for `cawikiquot`. [04:42:56] at /srv/mediawiki-staging/multiversion/MWMultiVersion.php on line 367 [04:43:03] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [04:43:07] that was me [04:43:16] I guess that reached the actual logs? [04:43:23] on fluorine? [04:43:24] yes [04:43:28] yeah, ignore it [04:43:31] cool [04:44:04] I'm wondering what DB l10nupdate-1 tries to provide the maint script to get a failure [04:44:43] passing in a nonsense one like that would make it fail [04:45:00] I wonder if those warnings it outputs can make it get considered as a failure [04:45:01] have you checked the logs? (are they world-readable?) [04:45:07] yeah [04:45:13] it doesn't log the right thing [04:45:17] planning to debug after some sleep [04:47:09] Warning: LU_Updater::readMessages: Unable to parse messages from /var/lib/l10nupdate/mediawiki/extensions/Validator/Validator.i18n.php in /srv/mediawiki-staging/php-1.26wmf16/extensions/LocalisationUpdate/Updater.php on line 63 [04:47:09] Warning: LU_Updater::readMessages: Unable to parse messages from /var/lib/l10nupdate/mediawiki/extensions/SemanticResultFormats/SemanticResultFormats.i18n.php in /srv/mediawiki-staging/php-1.26wmf16/extensions/LocalisationUpdate/Updater.php on line 63 [05:00:22] Krenair: not doing anything AFAIK [05:09:00] 6operations, 10Incident-20150205-SiteOutage, 7Availability, 5Patch-For-Review: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1494071 (10ori) 5Open>3Resolved [05:10:03] 6operations, 7Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1494072 (10ori) [05:43:42] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [05:44:21] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Package mwbzutils for Trusty - https://phabricator.wikimedia.org/T107405#1494078 (10ori) 3NEW [05:48:21] 6operations, 10Incident-20150205-SiteOutage, 7Availability, 5Patch-For-Review: Nutcracker needs to automatically recover from MC failure - rebalancing issues - https://phabricator.wikimedia.org/T88730#1494085 (10ori) 5Resolved>3Open @joe noticed that the default value in the source code (//pace// the d... [06:05:25] 6operations: Backport ffmpeg 2.7.3 to Trusty - https://phabricator.wikimedia.org/T107313#1494101 (10ori) a:3fgiunchedi [06:34:01] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 2 failures [06:44:58] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Package mwbzutils for Trusty - https://phabricator.wikimedia.org/T107405#1494143 (10ArielGlenn) a:3ArielGlenn [06:58:28] (03PS2) 10Muehlenhoff: Add ferm rules for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/226071 [06:58:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for rsync server [puppet] - 10https://gerrit.wikimedia.org/r/226071 (owner: 10Muehlenhoff) [06:59:33] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:10:39] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 30 07:10:39 UTC 2015 (duration 10m 38s) [07:10:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:11:54] (03PS1) 10Muehlenhoff: Enable base::firewall on labsdb1001 [puppet] - 10https://gerrit.wikimedia.org/r/227944 [07:11:56] (03PS1) 10Muehlenhoff: Enable base::firewall on the remaining labsdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/227945 [07:15:12] (03PS2) 10Yuvipanda: labstore: Make create-dbusers create users for users too [puppet] - 10https://gerrit.wikimedia.org/r/227915 (https://phabricator.wikimedia.org/T104453) [07:15:32] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Make create-dbusers create users for users too [puppet] - 10https://gerrit.wikimedia.org/r/227915 (https://phabricator.wikimedia.org/T104453) (owner: 10Yuvipanda) [07:39:48] (03PS1) 10Yuvipanda: labstore: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/227946 [07:40:12] (03PS2) 10Yuvipanda: labstore: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/227946 [07:40:21] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix typo [puppet] - 10https://gerrit.wikimedia.org/r/227946 (owner: 10Yuvipanda) [07:40:29] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "1) This should all be in the nrpe:: module, given those are nrpe checks" [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [07:50:36] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1494174 (10Chmarkine) wikipedia.org is already on the preload list! Among Alexa Top 10 websites, Wikipedia is the only one that has all subdomains preloaded! https://chromium.googlesource.... [07:53:36] (03PS1) 10Yuvipanda: labstore: Monitor that create-dbusers is running [puppet] - 10https://gerrit.wikimedia.org/r/227947 [07:53:44] _joe_: ^ I guess that's right [07:55:43] <_joe_> YuviPanda: what is? [07:55:56] can't english today, apparently [07:56:10] <_joe_> YuviPanda: it is [07:56:16] <_joe_> YuviPanda: also, go to bed [07:56:19] I guess I meant 'that patch I just put up is right for what I wanted to do' [07:56:31] need to get this albatross off my head before this ewekend [07:56:36] <_joe_> uhm [07:56:38] (03PS2) 10Yuvipanda: labstore: Monitor that create-dbusers is running [puppet] - 10https://gerrit.wikimedia.org/r/227947 [07:56:46] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Monitor that create-dbusers is running [puppet] - 10https://gerrit.wikimedia.org/r/227947 (owner: 10Yuvipanda) [07:56:47] <_joe_> what was declare_service => false? [07:56:58] <_joe_> it doesn't define the service? [07:57:01] _joe_: yup [07:57:35] I wanted to start the unit manually [07:57:39] while it was still in development [07:57:45] now that it isn't in development anymore.. [07:58:45] hahaha, I love how we've been putting in place an upstart script to run a perl script on a jessie box [07:58:51] * YuviPanda gets rid of them all [08:00:08] <_joe_> YuviPanda: so you should declare the service [08:00:13] <_joe_> or the alarm will go off :) [08:00:23] _joe_: well, there's this whole start-nfs mess as well [08:00:39] _joe_: well, if I have puppet manage the service, alarm will stop going off every puppet run [08:00:45] (03PS1) 10Yuvipanda: labstore: Cleanup mysql user account creation for tools [puppet] - 10https://gerrit.wikimedia.org/r/227948 [08:01:04] but that's ok, we'll probably notice the swapping :) [08:01:11] _joe_: ^ is the cleanup patch [08:02:00] (03CR) 10Muehlenhoff: "It seems your git repo is outdated: This is covered in misc::syslog-server, the patch was merged in https://gerrit.wikimedia.org/r/#/c/226" [puppet] - 10https://gerrit.wikimedia.org/r/227697 (owner: 10Muehlenhoff) [08:04:47] (03PS2) 10Yuvipanda: labstore: Cleanup mysql user account creation for tools [puppet] - 10https://gerrit.wikimedia.org/r/227948 [08:04:58] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Cleanup mysql user account creation for tools [puppet] - 10https://gerrit.wikimedia.org/r/227948 (owner: 10Yuvipanda) [08:15:51] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1494215 (10MoritzMuehlenhoff) 1.7.1 has been released, which fixes a data-loss bug, I'll import that version as well: "... [08:19:05] (03PS2) 10Hashar: nodepool: point to Zuul DNS service entry [puppet] - 10https://gerrit.wikimedia.org/r/225370 [08:19:47] (03CR) 10Hashar: "Will ease migration later on and lift an ambiguation between the Jenkins and Zuul background tasks which are named according to the target" [puppet] - 10https://gerrit.wikimedia.org/r/225370 (owner: 10Hashar) [08:22:25] good morning [08:23:27] (03PS1) 10Yuvipanda: labstore: Add NRPE check for nfx-exports [puppet] - 10https://gerrit.wikimedia.org/r/227950 [08:23:38] (03PS2) 10Yuvipanda: labstore: Add NRPE check for nfx-exports [puppet] - 10https://gerrit.wikimedia.org/r/227950 [08:23:46] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Add NRPE check for nfx-exports [puppet] - 10https://gerrit.wikimedia.org/r/227950 (owner: 10Yuvipanda) [08:24:00] (03PS6) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [08:29:40] 6operations, 5Continuous-Integration-Isolation, 7Nodepool: Bump our Nodepool Debian package to 0.1.1 - https://phabricator.wikimedia.org/T107266#1494222 (10hashar) [08:29:42] 6operations, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Bump our Nodepool package to 0.1.0 - https://phabricator.wikimedia.org/T104971#1494220 (10hashar) 5Open>3Resolved I forgot to got any Nodepool package uploaded, so when the server got reinstalled puppet complained. Andrew ki... [08:36:25] PROBLEM - Ensure mysql credential creation for tools users is running on labstore2001 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state failed [08:36:26] PROBLEM - Ensure mysql credential creation for tools users is running on labstore1001 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state failed [08:38:32] hmm [08:38:36] so it shouldn't be running on 2001 [08:38:38] nor on 1001 [08:38:40] lol [08:38:40] ok [08:38:53] so we need the notion of an 'active' [08:38:59] and have it active only on that host [08:46:12] (03PS1) 10Yuvipanda: labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) [08:46:17] (03CR) 10jenkins-bot: [V: 04-1] labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [08:46:33] (03PS2) 10Yuvipanda: labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) [08:50:03] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1494233 (10yuvipanda) Anything else? /root still has test scripts, but that's presumably ok? [09:06:25] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1494249 (10MoritzMuehlenhoff) elasticsearch-1.7.1 has been imported for jessie-wikimedia and trusty-wikimedia. [09:06:42] (03PS1) 10Filippo Giunchedi: cassandra: check for heap dumps [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) [09:15:33] (03PS2) 10Filippo Giunchedi: cassandra: check for heap dumps [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) [09:17:52] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia and trusty-wikimedia - https://phabricator.wikimedia.org/T106499#1494261 (10hashar) Upgraded on all CI slaves. Thank you very much! [09:18:57] !log Upgraded Zuul on all CI slaves. Should be a noop for zuul-cloner. [09:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:19:51] 6operations: Jessie imaging installs nfs-common needlessly - https://phabricator.wikimedia.org/T107412#1494262 (10MoritzMuehlenhoff) 3NEW [09:25:54] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1494270 (10fgiunchedi) FWIW I think the puppet compiler is a pretty valuable tool, given this also blocks {T98129} I think it should be priority high [09:26:16] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1494272 (10mark) We can also buy (or rent) servers, if that would be better. :) [09:27:38] (03CR) 10Mobrovac: "Two minor comments, LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) (owner: 10Filippo Giunchedi) [09:30:18] 6operations, 7Database: Reduce memory commitment on database hosts with many objects, specially s3, dbstore/research and labs - https://phabricator.wikimedia.org/T107282#1494275 (10jcrespo) Regarding s3, I am unsure if we would get any significative advantage reducing physical files, but I do not disagree with... [09:30:30] RECOVERY - puppet last run on restbase1008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:31:34] 6operations: Puppet catalog compiler is broken - https://phabricator.wikimedia.org/T96802#1494277 (10hashar) The slave is offline : https://integration.wikimedia.org/ci/computer/puppet-compiler02.puppet3-diffs.eqiad.wmflabs/ and the Jenkins job is tied to it so it can never runs. #release-engineering can assis... [09:35:32] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 1 - https://phabricator.wikimedia.org/T106090#1494278 (10dcausse) [09:35:39] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1458711 (10dcausse) [09:36:02] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1494282 (10dcausse) [09:36:04] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1458711 (10dcausse) [09:36:15] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint, 5Patch-For-Review: Upgrade production to elasticsearch 1.7.1 - https://phabricator.wikimedia.org/T106165#1494283 (10dcausse) [09:37:30] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1458711 (10dcausse) [09:37:32] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Validate Cirrus against Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106160#1494285 (10dcausse) 5Open>3Resolved [09:56:27] (03PS1) 10Muehlenhoff: Add ferm rules for Logstash/Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) [10:01:22] (03CR) 10Muehlenhoff: "I had considered multiport, I can rework the patch to use that instead." [puppet] - 10https://gerrit.wikimedia.org/r/227216 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [10:09:32] moritzm: are these rcstream dynamic ports actually supposed to be used from across the network? [10:09:48] moritzm: I don't remember the details but they may be only be used on localhost, from nginx [10:10:44] root@rcs1001:~# netstat -nap | grep python |grep -c 127.0.0.1 [10:10:44] 49 [10:11:18] # Spawn as many instances as there are CPU cores, less two. [10:11:18] $backends = range(10080, 10080 + $::processorcount - 2) [10:11:20] bind_address => '127.0.0.1', [10:11:20] ports => $backends, [10:11:26] so yeah, probably a bit moot :) [10:19:04] (03PS1) 10Zhuyifei1999: Project logos: Updated commonswiki.png from File:Wiki-commons.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227962 (https://phabricator.wikimedia.org/T106375) [10:22:32] (03PS1) 10Giuseppe Lavagetto: new_wmf_service: preserve order in yaml output [puppet] - 10https://gerrit.wikimedia.org/r/227963 [10:22:34] (03PS1) 10Giuseppe Lavagetto: new_wmf_service: use a slightly less ugly anchor/alias template [puppet] - 10https://gerrit.wikimedia.org/r/227964 [10:22:36] (03PS1) 10Giuseppe Lavagetto: new_wmf_service: add conftool service config as well [puppet] - 10https://gerrit.wikimedia.org/r/227965 [10:29:16] ah, you're totally right, that makes it much simpler :-) [10:31:55] (03Abandoned) 10Muehlenhoff: WIP/RfC: Allow multiple/dynamic range of ports for ferm services [puppet] - 10https://gerrit.wikimedia.org/r/227216 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [10:34:16] (03PS4) 10Faidon Liambotis: mail: remove secondary MX role from sodium (2nd take) [puppet] - 10https://gerrit.wikimedia.org/r/216642 [10:34:28] (03CR) 10Faidon Liambotis: [C: 032] mail: remove secondary MX role from sodium (2nd take) [puppet] - 10https://gerrit.wikimedia.org/r/216642 (owner: 10Faidon Liambotis) [10:37:58] (03CR) 10Filippo Giunchedi: [C: 031] Add ferm rules for new Logstash ingestion module logstash::input::udp [puppet] - 10https://gerrit.wikimedia.org/r/227723 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [10:39:47] if you hear of any lists-related issues, let me know :) [10:40:08] it's a relatively simple config change but it has gone wrong before [10:40:53] (03PS1) 10Giuseppe Lavagetto: imagescalers: convert the last two servers to HAT [puppet] - 10https://gerrit.wikimedia.org/r/227967 (https://phabricator.wikimedia.org/T84842) [10:41:30] (03PS1) 10Muehlenhoff: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) [10:46:13] (03PS4) 10Faidon Liambotis: exim: remove $smart_route_list [puppet] - 10https://gerrit.wikimedia.org/r/216643 [10:46:21] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: remove $smart_route_list [puppet] - 10https://gerrit.wikimedia.org/r/216643 (owner: 10Faidon Liambotis) [10:50:33] (03PS4) 10Faidon Liambotis: exim: inline @local_domains [puppet] - 10https://gerrit.wikimedia.org/r/216644 [10:50:40] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: inline @local_domains [puppet] - 10https://gerrit.wikimedia.org/r/216644 (owner: 10Faidon Liambotis) [10:51:12] (03PS4) 10Faidon Liambotis: exim: kill unused exim::roled parameters [puppet] - 10https://gerrit.wikimedia.org/r/216645 [10:51:34] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1494356 (10dcausse) Upgrade is delayed to next week because we'll switch directly to elasticsearch-1.7.1. I propose Tue, Aug 4, same time (5PM UTC) I will test that everything ru... [10:54:16] (03CR) 10Faidon Liambotis: [C: 032] exim: kill unused exim::roled parameters [puppet] - 10https://gerrit.wikimedia.org/r/216645 (owner: 10Faidon Liambotis) [10:55:15] (03PS3) 10Filippo Giunchedi: cassandra: check for heap dumps [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) [10:56:23] (03PS1) 10Muehlenhoff: zim: Add missing ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/227969 (https://phabricator.wikimedia.org/T105040) [10:56:56] (03CR) 10Mobrovac: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) (owner: 10Filippo Giunchedi) [10:57:04] (03CR) 10Filippo Giunchedi: cassandra: check for heap dumps (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) (owner: 10Filippo Giunchedi) [10:57:32] (03PS4) 10Faidon Liambotis: exim: kill all exim::* classes except for ::roled [puppet] - 10https://gerrit.wikimedia.org/r/216646 [10:57:50] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: kill all exim::* classes except for ::roled [puppet] - 10https://gerrit.wikimedia.org/r/216646 (owner: 10Faidon Liambotis) [10:58:05] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/AyBars [10:58:07] something is blocked here [10:58:09] can somone look pls? [10:58:13] __joe__? [10:59:13] (03PS4) 10Faidon Liambotis: exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 [11:00:48] (03PS2) 10Muehlenhoff: Add ferm rules for new Logstash ingestion module logstash::input::udp [puppet] - 10https://gerrit.wikimedia.org/r/227723 (https://phabricator.wikimedia.org/T104964) [11:01:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add ferm rules for new Logstash ingestion module logstash::input::udp [puppet] - 10https://gerrit.wikimedia.org/r/227723 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [11:01:13] <_joe_> Steinsplitter: I don't know a lot about GlobalRename, I see some errors coming from that url in logstash though [11:01:44] _joe_ which erros exactly? [11:02:05] <_joe_> [GET] Expectation (masterConns <= 0) by MediaWiki::main not met: [11:02:07] <_joe_> [connect to 10.64.16.22 (centralauth)] [11:03:08] (03PS1) 10Faidon Liambotis: exim: fix otrs config's mysql_password reference [puppet] - 10https://gerrit.wikimedia.org/r/227971 [11:03:40] (03PS2) 10Faidon Liambotis: exim: fix otrs config's mysql_password reference [puppet] - 10https://gerrit.wikimedia.org/r/227971 [11:03:47] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: fix otrs config's mysql_password reference [puppet] - 10https://gerrit.wikimedia.org/r/227971 (owner: 10Faidon Liambotis) [11:04:30] centralauth is s7 [11:04:33] (03PS5) 10Faidon Liambotis: exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 [11:05:04] <_joe_> jynus: maybe it's a red herring, maybe not [11:05:51] I see spikes on connections, but nothing worring and no connections aborted [11:08:31] PROBLEM - puppet last run on sodium is CRITICAL puppet fail [11:08:54] (03PS6) 10Faidon Liambotis: exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 [11:08:56] (03PS1) 10Faidon Liambotis: exim: fix compilation failure due to ordering [puppet] - 10https://gerrit.wikimedia.org/r/227972 [11:09:19] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: fix compilation failure due to ordering [puppet] - 10https://gerrit.wikimedia.org/r/227972 (owner: 10Faidon Liambotis) [11:09:38] (03CR) 10Faidon Liambotis: [C: 032] exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 (owner: 10Faidon Liambotis) [11:10:30] maybe the jobrunners are just slow [11:12:32] RECOVERY - puppet last run on sodium is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [11:13:05] Steinsplitter, I am not an expert, but if for some reason jobs fail, it may take some time to try again, but they eventually work [11:14:47] I think I see activity related to that now [11:15:01] RECOVERY - IPsec on cp1065 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [11:15:41] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [11:15:42] the problem is that the user can't login until the renam is done [11:15:51] RECOVERY - IPsec on cp4009 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [11:16:01] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [11:18:23] (03PS5) 10Faidon Liambotis: exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 [11:18:25] (03PS4) 10Faidon Liambotis: exim: use exim4 directly from role::otrs [puppet] - 10https://gerrit.wikimedia.org/r/216650 [11:18:27] (03PS4) 10Faidon Liambotis: exim: use exim4 directly from role::mail::lists [puppet] - 10https://gerrit.wikimedia.org/r/216649 [11:18:29] (03PS4) 10Faidon Liambotis: exim: use exim4 directly from Phab/RT [puppet] - 10https://gerrit.wikimedia.org/r/216648 [11:19:55] (03CR) 10Faidon Liambotis: [C: 032] exim: use exim4 directly from Phab/RT [puppet] - 10https://gerrit.wikimedia.org/r/216648 (owner: 10Faidon Liambotis) [11:22:12] (03CR) 10Faidon Liambotis: [C: 032] exim: use exim4 directly from role::mail::lists [puppet] - 10https://gerrit.wikimedia.org/r/216649 (owner: 10Faidon Liambotis) [11:25:19] matanya you online? [11:25:28] or legoktm? [11:28:41] PROBLEM - puppet last run on sodium is CRITICAL puppet fail [11:31:47] 6operations, 5Interdatacenter-IPsec: Strongswan: security association reauthentication failure - https://phabricator.wikimedia.org/T96111#1494430 (10BBlack) 5Resolved>3Open Still seeing a similar symptom now that we have more hosts active. Two ipv6 associations (4009 + 1065, 4018 + 1067) died last night a... [11:32:32] (03PS1) 10BBlack: strongswan: Make SAs and rekey actions more robust, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/227977 (https://phabricator.wikimedia.org/T96111) [11:33:09] 6operations, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1494436 (10jcrespo) Why is a bot indexing /w/index.php?title=Special:MIMESearch if /w/ is disabled on robots.txt ? [11:33:33] (03PS1) 10Faidon Liambotis: mail: re-add $outbound_ips/$list_outbound_ips to role [puppet] - 10https://gerrit.wikimedia.org/r/227978 [11:33:55] (03CR) 10Faidon Liambotis: [C: 032] mail: re-add $outbound_ips/$list_outbound_ips to role [puppet] - 10https://gerrit.wikimedia.org/r/227978 (owner: 10Faidon Liambotis) [11:35:39] (03PS2) 10BBlack: strongswan: Make SAs and rekey actions more robust, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/227977 (https://phabricator.wikimedia.org/T96111) [11:37:00] RECOVERY - puppet last run on sodium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:37:07] (03PS5) 10Faidon Liambotis: exim: use exim4 directly from role::otrs [puppet] - 10https://gerrit.wikimedia.org/r/216650 [11:37:13] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: use exim4 directly from role::otrs [puppet] - 10https://gerrit.wikimedia.org/r/216650 (owner: 10Faidon Liambotis) [11:37:56] (03PS3) 10BBlack: strongswan: Make SAs and rekey actions more robust, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/227977 (https://phabricator.wikimedia.org/T96111) [11:38:09] (03CR) 10BBlack: [C: 032 V: 032] strongswan: Make SAs and rekey actions more robust, hopefully [puppet] - 10https://gerrit.wikimedia.org/r/227977 (https://phabricator.wikimedia.org/T96111) (owner: 10BBlack) [11:40:17] (03PS1) 10Faidon Liambotis: exim: reference the proper system_filter for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/227979 [11:40:32] (03CR) 10Faidon Liambotis: [C: 032] exim: reference the proper system_filter for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/227979 (owner: 10Faidon Liambotis) [11:41:00] (03CR) 10Faidon Liambotis: [V: 032] exim: reference the proper system_filter for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/227979 (owner: 10Faidon Liambotis) [11:45:06] (03PS1) 10BBlack: enable ipsec for mobile and bits clusters [puppet] - 10https://gerrit.wikimedia.org/r/227980 (https://phabricator.wikimedia.org/T92604) [11:46:10] (03CR) 10BBlack: [C: 032] enable ipsec for mobile and bits clusters [puppet] - 10https://gerrit.wikimedia.org/r/227980 (https://phabricator.wikimedia.org/T92604) (owner: 10BBlack) [11:46:19] (03PS6) 10Faidon Liambotis: exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 [11:46:27] (03CR) 10Faidon Liambotis: [C: 032] exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 (owner: 10Faidon Liambotis) [11:47:09] jynus: can you restart the failed jobs by hand please? [11:48:28] <_joe_> Steinsplitter: the best way to see that resolved is to open a ticket, I think. it's not something we can fix in 5 minutes nor is it an emergency [11:48:43] <_joe_> (is it? did I get that wrong?) [11:49:03] it isn't a emegency. but i user can't login. will file a ticke [11:49:53] (03PS5) 10Faidon Liambotis: mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 [11:50:29] <_joe_> Steinsplitter: sorry, but it would take me hours to figure out how to resolve this correctly [11:50:50] no problem , i ask on phab (cc hoo and lego) [11:51:03] This isn't the only one btw. Another user also reported that it earlier today. [11:51:09] Looks like the rename does actually get done [11:51:12] (even the moves) [11:51:25] <_joe_> Glaisher: yeah I suspect that is the case [11:51:26] but it's failing in the final step [11:51:28] which is setting the status back to success [11:51:50] PROBLEM - puppet last run on cp1069 is CRITICAL puppet fail [11:53:50] RECOVERY - puppet last run on cp1069 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:53:55] Glaisher: https://phabricator.wikimedia.org/T107415 [11:54:07] (03PS6) 10Faidon Liambotis: mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 [12:04:01] 7Puppet, 6operations: merge swift_new and swift puppet modules/classes - https://phabricator.wikimedia.org/T107416#1494480 (10fgiunchedi) 3NEW a:3fgiunchedi [12:36:10] !log downgrade openjdk-7-jre on restbase1001, nodetool flush and cassandra restart [12:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:26] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1494562 (10faidon) Sounds good to me, although I'd really prefer it if we reinstalled labstore1001 at this point before we switch over t... [12:47:28] (03CR) 10John F. Lewis: [C: 031] "Looks good to me as a class rename. Labs will need to be updated but that can be done shortly after the merge." [puppet] - 10https://gerrit.wikimedia.org/r/216652 (owner: 10Faidon Liambotis) [12:47:29] !log downgrade openjdk-7-jre on restbase1002, nodetool flush and cassandra restart [12:47:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:47:42] JohnFLewis: hi :) [12:48:07] paravoid: morning :) [12:48:09] (03CR) 10Faidon Liambotis: [C: 032] mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 (owner: 10Faidon Liambotis) [12:48:36] thanks for the review :) [12:48:46] also see the merges I did earlier today and yesterday [12:49:16] If only my client stayed connected. I'll check the repo [12:49:26] yeah, just git log [12:49:34] it's topic:kill-exim-roled in gerrit too if you prefer that [12:59:03] paravoid: nice. We probably should hiera-ise the outbound IPs in lists as well. I didn't do it with interface IPs because the outbound weren't direct blockers for mailman in labs but apart from that, die exim::roled [12:59:17] yeah we probably should [13:02:25] !log downgrade openjdk-7-jre on restbase1003, nodetool flush and cassandra restart [13:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:04:00] (03PS1) 10Faidon Liambotis: mail: bump spamassassin's max_children to 16 [puppet] - 10https://gerrit.wikimedia.org/r/227986 [13:04:19] (03PS2) 10Faidon Liambotis: mail: bump MX's spamassassin max_children to 16 [puppet] - 10https://gerrit.wikimedia.org/r/227986 [13:05:48] (03CR) 10Faidon Liambotis: [C: 032] mail: bump MX's spamassassin max_children to 16 [puppet] - 10https://gerrit.wikimedia.org/r/227986 (owner: 10Faidon Liambotis) [13:06:38] (03CR) 10Faidon Liambotis: [V: 032] mail: bump MX's spamassassin max_children to 16 [puppet] - 10https://gerrit.wikimedia.org/r/227986 (owner: 10Faidon Liambotis) [13:17:41] !log downgrade openjdk-7-jre on restbase1004, nodetool flush and cassandra restart [13:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:24:02] PROBLEM - Restbase root url on restbase1008 is CRITICAL: Connection refused [13:25:11] !log installed openjdk updates on gallium, restarting jenkins [13:25:13] (03PS1) 10Faidon Liambotis: mail: fix smarthost route_list for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/227987 [13:25:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:10] (03CR) 10Faidon Liambotis: [C: 032] mail: fix smarthost route_list for OTRS [puppet] - 10https://gerrit.wikimedia.org/r/227987 (owner: 10Faidon Liambotis) [13:28:51] (03PS1) 10Ottomata: Recreate debianization on top of 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227988 [13:29:14] (03PS2) 10Ottomata: Recreate debianization on top of 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227988 (https://phabricator.wikimedia.org/T106581) [13:29:26] !log downgrade openjdk-7-jre on restbase1005, nodetool flush and cassandra restart [13:29:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:57] (03PS1) 10Ottomata: Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 [13:36:41] (03Abandoned) 10Ottomata: Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227768 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [13:39:39] !log downgrade openjdk-7-jre on restbase1006, nodetool flush and cassandra restart [13:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:04] godog: just from looking at the latency graphs I would not have been able to tell that you are restarting nodes [13:41:49] easy to get used to that ;) [13:42:16] gwicke: heheh restbase's latencies? that's great news [13:42:52] http://grafana.wikimedia.org/#/dashboard/db/restbase [13:44:18] !log downgrade openjdk-7-jre on restbase1007, nodetool flush and cassandra restart [13:44:23] indeed [13:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:44:45] (03PS3) 10Andrew Bogott: nodepool: point to Zuul DNS service entry [puppet] - 10https://gerrit.wikimedia.org/r/225370 (owner: 10Hashar) [13:46:02] (03CR) 10Andrew Bogott: [C: 032] nodepool: point to Zuul DNS service entry [puppet] - 10https://gerrit.wikimedia.org/r/225370 (owner: 10Hashar) [13:48:09] (03PS2) 10Ottomata: Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 [13:49:19] (03CR) 10Andrew Bogott: "One typo, in line. Do you really want to set INFO and DEBUG as the defaults? Seems like it could get awfully noisy." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [13:52:50] (03PS2) 10Andrew Bogott: Remove custom fact ec2id (2nd try), unused [puppet] - 10https://gerrit.wikimedia.org/r/227201 (owner: 10Faidon Liambotis) [13:53:24] (03PS2) 10BBlack: tlsproxy: refactor/cleanup, beta work [puppet] - 10https://gerrit.wikimedia.org/r/227404 (https://phabricator.wikimedia.org/T97593) [13:55:49] (03CR) 10Andrew Bogott: [C: 032] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/227201 (owner: 10Faidon Liambotis) [13:57:49] (03CR) 10Andrew Bogott: [C: 04-1] nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [14:00:53] 6operations: Opendj on Neptunium running java 6, on Nembus java 7 - https://phabricator.wikimedia.org/T107424#1494682 (10Andrew) 3NEW a:3Andrew [14:01:02] RECOVERY - RAID on stat1002 is OK optimal, 1 logical, 12 physical [14:01:13] !log fstrim -v /var on restbase1008 [14:01:31] RECOVERY - configured eth on stat1002 is OK - interfaces up [14:03:26] RECOVERY - puppet last run on stat1002 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:03:27] PROBLEM - Certificate expiration on neptunium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [14:03:27] (03CR) 10Dzahn: [C: 031] "yes, this server has no more roles. it used to be authdns but that has been removed in I9029f2b049438aa6927" [puppet] - 10https://gerrit.wikimedia.org/r/227416 (owner: 10Muehlenhoff) [14:03:27] morebots: :( [14:03:27] RECOVERY - dhclient process on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [14:03:27] RECOVERY - DPKG on stat1002 is OK: All packages OK [14:03:27] RECOVERY - Disk space on stat1002 is OK: DISK OK [14:03:27] (03PS3) 10Dzahn: Enable base::firewall on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/227416 (owner: 10Muehlenhoff) [14:03:27] RECOVERY - salt-minion processes on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:03:27] !log restarted opendj on nembus/neptunium to effect OpenJDK security updates [14:03:47] godog: hopefully fstrim doesn't emit pipelined TRIM commands to the SSD [14:03:58] (03CR) 10BBlack: Add legacy bits.wm.o support to text-lb VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [14:04:31] godog: https://phabricator.wikimedia.org/T89584#1368434 [14:05:00] I am a logbot running on tools-exec-1210. [14:05:00] Messages are logged to wikitech.wikimedia.org/wiki/Server_Admin_Log. [14:05:00] To log a message, type !log . [14:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:24] (03CR) 10Andrew Bogott: [C: 04-1] "Even though the hiera lookups default to false, I'd still like to see" [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [14:06:49] gwicke: generally I think one tends to assume fstrim will interrupt the I/O pipeline. maybe some newer hosts/busses/drives can do better, but the assumption would be not. [14:07:18] it will flush the queued i/o to the device and basically pause all I/O while it trims in the normal/legacy case [14:07:39] yeah, that should be fine [14:07:45] (03CR) 10Andrew Bogott: [C: 031] "This is clearly right but someone will still need to watch closely when it applies." [puppet] - 10https://gerrit.wikimedia.org/r/227944 (owner: 10Muehlenhoff) [14:07:48] (03CR) 10Muehlenhoff: [C: 04-1] "This needs additional, unmerged changes." [puppet] - 10https://gerrit.wikimedia.org/r/227944 (owner: 10Muehlenhoff) [14:08:03] (03CR) 10Andrew Bogott: [C: 031] "Looks good but will need babysitting" [puppet] - 10https://gerrit.wikimedia.org/r/227945 (owner: 10Muehlenhoff) [14:08:31] I don't know how TRIM plays out with those drives. with the intels, we don't use the trim mount options or periodic fstrim [14:08:32] (03CR) 10Dzahn: [C: 032] Enable base::firewall on multatuli [puppet] - 10https://gerrit.wikimedia.org/r/227416 (owner: 10Muehlenhoff) [14:09:43] basically, basically they enough built-in overprovisioned capacity that TRIM isn't necessary anyways and doesn't really buy any perf boost [14:10:20] yeah, we've basically done the same with the samsungs by not filling them up all the way [14:10:35] they have less overprovisioning of course [14:11:18] yeah that's what we do on the older cache boxes with legacy drives, too. we allocate an extra partition of free space at the end to give the drive overprovisioned headroom [14:11:24] it's just that a disk test script ended up filling up the ones on 1008 [14:11:44] hence the disire to fstrim once [14:11:57] *desire [14:12:07] if you mean the partitions fill the disk but you're overprovisioning space within the partitions, that's not the same thing [14:12:09] (03CR) 10Hashar: "I found debug output to be valuable on Zuul setup. Though that surely consumes a few GBytes in Zuul case." [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [14:12:27] ideally you want the overprovisioned space to be a fixed chunk at the end of the drive that's a separate partition not in use by anything [14:12:31] bblack: I know it's not the same thing, but it can work out the same way if you really never fill it up [14:12:38] ack, so for this to work it is enough to leave unpartitioned space at the end of the disk vs an empty partition? [14:13:06] well that's not much different too [14:13:09] gwicke: I donno, maybe. depends on lvm/md and filesystem block allocation strategies, etc. if nothing else, you'll get superblock copies out there [14:13:20] godog: all that counts is that the logical blocks aren't recorded as having been written to [14:13:22] making it a separate unused disk partition ensures no write activity [14:14:07] if it's a fresh disk, you can just make an extra parition at the end and never use it and you're done. if you don't know whether that area was written to the past, you have to trim that partition once when you set it up. [14:14:22] SSDs are running a kind of filesystem in their firmware that keeps track of the mapping from logical blocks (what the kernel speaks) to actual flash blocks [14:14:59] it uses the blocks it hasn't seen writes for to wear level [14:15:16] yeah, mostly, kinda :) [14:15:47] if the disk was fully written once, then it only has the overprovisioning headroom for the wear leveling, which means that it'll be slightly slower and will wear out a bit more quickly [14:16:33] yeah but, without a trim mount option or periodic fstrim, the drive never knows when the filesystem stops using a block [14:16:46] consumer SSDs tend to have ~12-20% more flash than advertised, while server ones have closer to 40 [14:16:54] picture a simple scenario where you put one giant fs on one partition that fills the available partitionable space [14:17:18] but you only ever fill that partition to 10% capacity, but you are constantly overwriting data within that 10% limit [14:17:42] actually, the 850 pro has 7.6% only according to http://www.anandtech.com/show/8216/samsung-ssd-850-pro-128gb-256gb-1tb-review-enter-the-3d-era [14:18:05] the FS will keep overwriting blocks and allocating new blocks, but when an old block goes into the fs-level "free" pool, the disk doesn't know it can discard it without TRIM. So over time, the disk firmware eventually assumes all of the available blocks are in-use and can't be discarded [14:19:05] so I tend to put this all into 3 general case solution categories: [14:19:09] generally though, large SSDs don't wear out nearly as quickly as small ones just by having a lot more flash to burn through [14:19:30] 1) The drive has sufficient hidden overprovisioning that you don't have to care, so don't bother with anything related and treat it like a spinny disk. [14:20:14] 2) The drive doesn't, so you still allocate all the space, but you use TRIM so the device knows what blocks are truly free (via periodic fstrim, or trim mount options, or both) [14:20:36] 3) The drive doesn't, so you allocate a whole unused partition at the end and never write to it, and then don't use TRIM on the allocated partitions you're using [14:21:17] the tradeoff on 2 vs 3 is space vs performance. 3 will perform better (no TRIM overhead from mount option, and/or no i/o spike from periodic fstrim), 2 wastes less available FS space. [14:22:03] bblack: most filesystems don't write all over the disk, but use the blocks they have already touched first [14:22:26] that's what enables relatively quick fs shrinking, for example [14:23:00] so I'd say that there's an option 2b) where you create a large fs, but don't use trim [14:23:14] and never fill the partition more than X% [14:23:48] main advantage is that you still have the headroom available for unforeseen events [14:24:01] eh I donno about that [14:24:20] I mean, I really don't know, so I prefer the safe option of assuming the worst [14:24:46] re: how block allocation/usage maps between filesystem strategy, lvm/md layer, logical blocks for the SSD itself, etc [14:25:26] it may vary depending on your app's data usage patterns at the POSIX filesystem API level too [14:25:41] how it uses rename(2) vs overwriting files in place, etc [14:26:43] I'm not aware of any filesystem that will touch all of the logical blocks if all you do is write heavily but only ever use 1% of the space [14:27:19] and the LVS etc layer mappings all tend to be deterministic [14:27:40] yeah but you're not using anywhere near as small as 1%, and there's fragmentation to consider, and overlap of allocation (write new replacement large file before deleting old one), concurrency from many threads/processes writing, etc [14:28:13] yeah, you definitely touch a bit more than your du output [14:28:33] I just don't think you have any real gaurantees, without TRIM or explicit overprovisioned partition, that you won't be slowly killing off the intended overprovisioning if it's inside an FS [14:29:30] yeah, for hard guarantees partitioning wins [14:29:52] (03CR) 10Chad: Phabricator: Fetch all references in Git (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:30:02] (also, if switching from strategy 2 to 3, you can use blkdiscard to TRIM off the whole partition if it was possibly previously written to) [14:30:11] PROBLEM - puppet last run on analytics1018 is CRITICAL Puppet has 2 failures [14:31:28] however, I am not actually so sure that it still matters as much as it used to [14:31:51] as the write volumes are fairly stable, but disk sizes and thus available flash to burn have gone up a lot [14:31:53] on newer/badasser drives, it really doesn't. better firmware, more internal overprovision, etc [14:32:44] with good firmware, if you have lots of write-idle periods of downtime, or just spiky write rate in general, that helps a lot too [14:32:54] but I'm assuming a scenario with 24/7 fairly constant/heavy write ops [14:33:06] (that you don't want latency spiking into from TRIM and such, either) [14:33:13] yeah [14:36:40] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1008 [puppet] - 10https://gerrit.wikimedia.org/r/227994 (https://phabricator.wikimedia.org/T102015) [14:37:07] (03CR) 10Tim Landscheidt: [C: 04-1] labstore: Use hiera to set currently active labstore (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [14:37:42] !log installed openjdk security updates on analytics* [14:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:37:55] I have a weird problem with listpages.py, someone has some experience with it? [14:38:49] (03CR) 10GWicke: [C: 031] "Moar nodes!" [puppet] - 10https://gerrit.wikimedia.org/r/227994 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [14:39:48] (03CR) 10Mobrovac: [C: 031] "go go go" [puppet] - 10https://gerrit.wikimedia.org/r/227994 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [14:40:33] (03CR) 10Andrew Bogott: [C: 032] Labs: Let Puppet install mpt-status [puppet] - 10https://gerrit.wikimedia.org/r/226232 (https://phabricator.wikimedia.org/T104779) (owner: 10Tim Landscheidt) [14:40:45] (03PS2) 10Andrew Bogott: Labs: Let Puppet install mpt-status [puppet] - 10https://gerrit.wikimedia.org/r/226232 (https://phabricator.wikimedia.org/T104779) (owner: 10Tim Landscheidt) [14:41:29] (03PS2) 1020after4: Phabricator: Setup git config for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [14:42:02] (03CR) 1020after4: [C: 031] "This is needed on the phabricator box in order for phabricator to track github repos. Someone +2 please" [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [14:42:49] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1008 [puppet] - 10https://gerrit.wikimedia.org/r/227994 (https://phabricator.wikimedia.org/T102015) [14:42:56] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1008 [puppet] - 10https://gerrit.wikimedia.org/r/227994 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [14:43:04] (03PS1) 10Dzahn: remove multatuli from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/227997 [14:45:58] (03PS2) 10Dzahn: remove multatuli from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/227997 [14:50:49] (03PS2) 1020after4: Phabricator: Fetch all gerrit references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:50:49] (03CR) 10John F. Lewis: [C: 031] "For the remove since it's not used since the AuthDNS remove." [puppet] - 10https://gerrit.wikimedia.org/r/227997 (owner: 10Dzahn) [14:50:49] (03PS3) 1020after4: Phabricator: Fetch all gerrit references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:50:49] (03PS3) 1020after4: Phabricator: Setup git config for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/227488 (owner: 10Chad) [14:50:49] (03PS4) 1020after4: Phabricator: Fetch all gerrit references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:50:49] RECOVERY - Restbase root url on restbase1008 is OK: HTTP OK: HTTP/1.1 200 - 15145 bytes in 0.009 second response time [14:50:49] (03CR) 1020after4: [C: 031] Phabricator: Fetch all gerrit references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 (owner: 10Chad) [14:50:49] when trying to make changes to cassandra module: The branch 'master' does not exist on the given remote 'gerrit'. submodules once again .. sigh [14:53:50] (03PS3) 1020after4: Don't match Phabricator task IDs inside URLs [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [14:55:46] twentyafterfour: hey [14:56:54] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [14:57:11] (03CR) 10Alex Monk: [C: 04-1] "I don't think you ran this through optipng." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227962 (https://phabricator.wikimedia.org/T106375) (owner: 10Zhuyifei1999) [14:57:25] RECOVERY - puppet last run on analytics1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:57:31] (03CR) 10Muehlenhoff: "One nitpick, but looks good to me (I didn't try to build it, only from review of the debian/ directory changes)" [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 (owner: 10Ottomata) [14:58:21] (03CR) 10Muehlenhoff: [C: 031] Debianize 0.8.2.1 tag (031 comment) [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 (owner: 10Ottomata) [14:58:34] (03CR) 1020after4: "> It looks like e3fe7011e360804f1c7ab4b0c93b9bb5f6cf8b39 should have fixed this but a8999ced55773efd7a93d21bbbf77d52ed3055b2 reverted it." [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [14:58:55] (03PS2) 10Alex Monk: Project logos: Updated commonswiki.png from File:Wiki-commons.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227962 (https://phabricator.wikimedia.org/T106375) (owner: 10Zhuyifei1999) [14:59:54] RECOVERY - Restbase endpoints health on restbase1008 is OK: All endpoints are healthy [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150730T1500). [15:00:49] (03CR) 10Alex Monk: [C: 032] Project logos: Updated commonswiki.png from File:Wiki-commons.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227962 (https://phabricator.wikimedia.org/T106375) (owner: 10Zhuyifei1999) [15:00:54] (03Merged) 10jenkins-bot: Project logos: Updated commonswiki.png from File:Wiki-commons.png [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227962 (https://phabricator.wikimedia.org/T106375) (owner: 10Zhuyifei1999) [15:01:17] cool, thanks moritzm! [15:02:02] cmjohnson1: just a poke about row b hadoop nodes. If we can get those up this week, I can plan a kafka upgrade and expansion next week. [15:02:06] (03PS3) 10Ottomata: Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 [15:02:07] !log manually cleaned up RB code on 1007 and 1008 [15:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:29] (03CR) 10Ottomata: [C: 032 V: 032] Recreate debianization on top of 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227988 (https://phabricator.wikimedia.org/T106581) (owner: 10Ottomata) [15:02:32] !log bootstrap cassandra on restbase1008 [15:02:38] !log krenair Synchronized w/static/images/project-logos/commonswiki.png: https://gerrit.wikimedia.org/r/#/c/227962/ (duration: 00m 13s) [15:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:43] ottomata: I am working on decomming virts today...pushing for tomorrow to get the new servers racked [15:02:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:02:53] awesome, danke! [15:03:10] !log disabled old restbase checkout on tin to make sure it doesn't start up [15:03:10] mutante: poke ? [15:03:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:36] RECOVERY - Cassandra database on restbase1008 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [15:05:29] (03CR) 1020after4: Don't match Phabricator task IDs inside URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [15:05:32] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/227956 (https://phabricator.wikimedia.org/T106346) (owner: 10Filippo Giunchedi) [15:05:54] 6operations, 10Beta-Cluster, 10Traffic, 5Patch-For-Review: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1494796 (10demon) Got instances of `deployment-cache-*04` running jessie, all succeeding with puppet (minus TLS stuff, which I'm skipping for now to prevent the eterna... [15:07:49] (03PS2) 10GWicke: Remove trebuchet setup from restbase config [puppet] - 10https://gerrit.wikimedia.org/r/219253 [15:09:02] (03CR) 10Muehlenhoff: [C: 031] Debianize 0.8.2.1 tag [debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/227989 (owner: 10Ottomata) [15:12:02] cool, thanks moritzm, i'm building it from that now, and then will do a bunch of testing in labs with upgrades and migrations before I merge, just in case I find something missing i want to amend [15:12:42] (03CR) 10GWicke: "We had old code start up on new production nodes earlier today, which is a bad thing. Thankfully those nodes weren't pooled yet, and I hav" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [15:12:45] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [15:13:05] ottomata: ok [15:15:17] (03Abandoned) 10Chad: es-tool: Restart ganglia after restarting Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224435 (owner: 10Chad) [15:15:39] (03PS7) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [15:15:52] Krenair: Can I throw two things into the SWAT? [15:15:55] yep [15:16:02] Krenair: (Well, one thing but two branches.) [15:16:10] (03PS1) 10Alex Monk: Add interwiki import sources for mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227999 (https://phabricator.wikimedia.org/T105116) [15:16:16] Gah, CI is always slow when you want it not to be. [15:16:21] (03CR) 10Faidon Liambotis: [C: 031] Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [15:20:26] (03PS2) 10Alex Monk: Add interwiki import sources for mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227999 (https://phabricator.wikimedia.org/T105116) [15:20:44] (03CR) 10Alex Monk: [C: 032] Add interwiki import sources for mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227999 (https://phabricator.wikimedia.org/T105116) (owner: 10Alex Monk) [15:20:49] (03CR) 10Faidon Liambotis: [C: 031] tlsproxy: refactor/cleanup, beta work [puppet] - 10https://gerrit.wikimedia.org/r/227404 (https://phabricator.wikimedia.org/T97593) (owner: 10BBlack) [15:20:51] (03Merged) 10jenkins-bot: Add interwiki import sources for mrwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227999 (https://phabricator.wikimedia.org/T105116) (owner: 10Alex Monk) [15:21:05] Krenair: https://gerrit.wikimedia.org/r/#/q/I434acd82c89a3d0a3cad1e4add36134072db26e9,n,z [15:21:15] * James_F adds to the calendar. [15:21:40] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227999/ (duration: 00m 12s) [15:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:00] 6operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494832 (10BBlack) 3NEW [15:23:10] 6operations, 10Traffic: Decom bits.wikimedia.org hostname - https://phabricator.wikimedia.org/T107430#1494841 (10BBlack) [15:23:11] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1494842 (10BBlack) [15:24:49] (03CR) 10Muehlenhoff: "That patch was actually merged already, so retracting my -1" [puppet] - 10https://gerrit.wikimedia.org/r/227944 (owner: 10Muehlenhoff) [15:24:50] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1190727 (10BBlack) [15:26:56] (03PS2) 10Dzahn: zim: Add missing ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/227969 (https://phabricator.wikimedia.org/T105040) (owner: 10Muehlenhoff) [15:29:59] !log krenair Synchronized php-1.26wmf16/extensions/MobileFrontend/includes/Resources.php: https://gerrit.wikimedia.org/r/#/c/228000/ (duration: 00m 11s) [15:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:25] !log krenair Synchronized php-1.26wmf15/extensions/MobileFrontend/includes/Resources.php: https://gerrit.wikimedia.org/r/#/c/228001/ (duration: 00m 12s) [15:30:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:43] James_F, ^ [15:30:49] Thanks. [15:30:51] * James_F tests. [15:30:52] (03PS10) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [15:31:36] Hmm. [15:31:41] Works on enwiki. [15:32:22] (03CR) 10BBlack: [C: 032] Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [15:33:45] (03CR) 10Dzahn: [C: 032] zim: Add missing ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/227969 (https://phabricator.wikimedia.org/T105040) (owner: 10Muehlenhoff) [15:34:08] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1494881 (10BBlack) [15:35:22] it would be nice if Patch-For-Review tag killed itself when the outstanding patch(es) merge [15:38:10] bblack: https://phabricator.wikimedia.org/T104413 :) [15:38:29] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7.1 - https://phabricator.wikimedia.org/T106090#1494895 (10chasemp) >>! In T106090#1494356, @dcausse wrote: > Upgrade is delayed to next week because we'll switch directly to elasticsearch-1.7.1. > I propose Tue, Aug 4, same t... [15:38:53] I asked about that and was told it has no state info on other patches on a task so [15:39:04] it is basically too dumb to do such [15:39:13] better idea is to not use gerrit? :) [15:39:15] I would bet it will end up pending on switching to arc [15:39:25] then it gest better integration and ability to wipe the tag via state, etc [15:39:30] s/gest/gets/ [15:39:57] there is a whole other interface for changesets etc in that case and it doesn't do the tag thing at all (natively) [15:40:06] ah [15:40:29] that tag things is a hack port of a hack port of a hack integration between bz and gerrit. it's all kinds of un [15:40:31] fun even [15:40:35] PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [15:40:39] unfun? [15:41:23] IMHO, in the interim it might be better to just not automatically apply the Patch-for-Review tag at all [15:41:36] and then maybe people will manually add the tag when they're really searching for reviewers [15:41:40] agree, i doubt we actually use it more than we are removing it [15:41:43] I agree I don't pay attention to it at all but some rely on it religiously [15:42:04] yeah I only notice it when I'm on clinic duty [15:42:06] bblack: fwiw I just do "ref T123" or whatever and has it link post merge to the task and it shouldn't do the tag thing [15:42:12] in the commit message I mean [15:42:24] I //think// only "task: T123" does the tag [15:42:41] (Bug|Task): T123 [15:42:52] ah [15:42:54] as opposed to that, i love how phab tasks get auto-updated on merge when we add the reference in the commit message. we should always keep that [15:44:11] 6operations, 7discovery-system: implement write locking in conftool - https://phabricator.wikimedia.org/T107286#1494916 (10Joe) a:3Joe [15:44:22] 6operations, 7discovery-system: implement write locking in conftool - https://phabricator.wikimedia.org/T107286#1491599 (10Joe) p:5Triage>3Normal [15:44:31] I just today realized I can go into the emailpreferences link I've been ignoring at the bottom of every phab email, and switch " A task's subscribers changed" to "Ignore" [15:44:43] I think that's going to be a big phab-spam reduction for me heh [15:45:09] i changed all the email to notifications in phab web ui [15:45:31] and removed that one "subscribers changes" as well, yea [15:46:49] also 'workboard move' was generating some spam for me [15:47:14] moved to "doing" :p [15:48:18] (03CR) 10Thcipriani: [C: 031] "Since the repo_config has move into hiera this patch wouldn't require any local cherry picks in labs to use trebuchet for beta/labs testin" [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [15:51:34] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 5077.36395871 [16:03:34] (03CR) 10Mobrovac: [C: 031] "> I think @mobrovac's comments about startup have since been resolved (we don't start up on boot)." [puppet] - 10https://gerrit.wikimedia.org/r/219253 (owner: 10GWicke) [16:11:28] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T106654#1494978 (10fgiunchedi) a:3Cmjohnson @cmjohnson did it happen? [16:17:34] ori: that l10n.php stuff that Krenair was trying to debug last night is whatever voodoo you did to build the LCStoreStaticArray data files on tin. That was the thing I asked you to look at and undo as you were about to get on a plane to leave Mexico City. [16:18:16] l10nupdate is running twice, once to build the cdbs and once to build the php caches that we aren't using. The php cache run is failing every night I think [16:18:35] Sam poked about on tin and couldn't find the second cron [16:18:55] there is only one cron [16:19:15] it doesn't make sense [16:19:43] where is /usr/local/bin/l10nupdate being called from anyway? the cron uses the -l one [16:21:20] sorry, "-1", not "-l" [16:21:56] Krenair: my guess would be that ori cron'ed it from his personal crontab but I don't have the server rights to verify that [16:22:06] maybe [16:22:28] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1495007 (10Mholloway) I'm on a loaner laptop while my main work laptop is out for repairs. Could you please add the SSH key in this paste to my account so I can log in from this machine? ht... [16:22:32] Steinsplitter: hi [16:22:34] It was all a hurry up project he was doing right before wikimania [16:22:48] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1495011 (10Mholloway) 5Resolved>3Open [16:25:55] bd808: umm, is it known that exception.log on fluorine is missing stuff? [16:26:00] legoktm: jobserver stucked, but user now renamed sucessfully after some hours :) [16:26:07] (03CR) 10Dzahn: "doing the secondary server first, apergos told me in the past this is not used much as oppposed to dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [16:26:30] legoktm: define "missing stuff"? [16:26:46] well the traceback for one [16:26:57] welcome to hhvm [16:27:19] I thought that was just fatal.log? [16:28:35] oh... I may have broken that actually :/ [16:29:19] do we set $wgLogExceptionBacktrace? [16:29:47] (03PS3) 10Dzahn: Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [16:29:59] yes we do [16:30:30] MWExceptionHandler isn't checking the value of $wgLogExceptionBacktrace [16:30:42] it has the "global ..." line and then it's unused [16:30:49] (03PS7) 10Hashar: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 [16:30:51] so I made this change -- https://github.com/wikimedia/mediawiki/blob/master/includes/debug/logger/LegacyLogger.php#L219-L227 -- but $wgLogExceptionBacktrace=true on enwiki at least [16:30:53] (03PS4) 10Dzahn: Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [16:31:26] (03PS1) 10RobH: replacing mholloway's ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/228013 [16:31:28] (03CR) 10Hashar: "Fixed the puppet typo" [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [16:32:25] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1495021 (10Krenair) Should really be a new ticket instead of reopening this [16:32:33] (03CR) 10RobH: [C: 032] replacing mholloway's ssh pub key [puppet] - 10https://gerrit.wikimedia.org/r/228013 (owner: 10RobH) [16:32:57] (03CR) 10Dzahn: [C: 032] Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [16:33:13] (03PS5) 10Dzahn: Enable base::firewall on ms1001 [puppet] - 10https://gerrit.wikimedia.org/r/227713 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [16:34:39] legoktm: stacktraces are in logstash and exception-json.log but that probably doesn't make you happier [16:34:56] grrrit-wm: missing stuff [16:35:46] :< [16:36:20] legoktm: I'm looking into it. This wasn't my intent [16:36:29] ok, thanks :) [16:39:14] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1495038 (10RobH) 5Open>3Resolved a:3RobH I've chatted with @mholloway about this request in irc already, so he is aware of the following: * the old key has been replaced with the new o... [16:39:29] 10Ops-Access-Requests, 6operations: Requesting stat1003 access for mholloway - https://phabricator.wikimedia.org/T95506#1495042 (10RobH) a:5RobH>3None [16:40:34] PROBLEM - salt-minion processes on ms1001 is CRITICAL: Timeout while attempting connection [16:41:24] PROBLEM - HTTP on ms1001 is CRITICAL: Connection timed out [16:41:34] PROBLEM - configured eth on ms1001 is CRITICAL: Timeout while attempting connection [16:41:52] legoktm: doh! we don't use LegacyLogger in prod any more [16:41:54] PROBLEM - NFS on ms1001 is CRITICAL: Connection timed out [16:42:03] so yeah I caused this [16:42:04] PROBLEM - dhclient process on ms1001 is CRITICAL: Timeout while attempting connection [16:42:42] i am handling the ms1001 issue [16:46:07] (03PS1) 10Dzahn: dumps: remove odysseus.ip6.fi.muni.cz from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/228014 [16:48:21] (03PS2) 10Dzahn: dumps: remove odysseus.ip6.fi.muni.cz from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/228014 [16:48:54] (03PS3) 10Dzahn: dumps: remove odysseus.ip6.fi.muni.cz from rsync clients [puppet] - 10https://gerrit.wikimedia.org/r/228014 [16:49:25] (03CR) 10Dzahn: [C: 032] "short-term fix for ms1001" [puppet] - 10https://gerrit.wikimedia.org/r/228014 (owner: 10Dzahn) [16:50:37] (03CR) 10Dzahn: "this makes me think again that traditionally the mantra has always been to not use DNS names in firewall rules" [puppet] - 10https://gerrit.wikimedia.org/r/228014 (owner: 10Dzahn) [16:51:25] RECOVERY - HTTP on ms1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5506 bytes in 0.014 second response time [16:51:28] (03CR) 10Dzahn: "but also: ferm should be able to handle v6-only hosts AND it should not just bail out completely just because a lookup fails" [puppet] - 10https://gerrit.wikimedia.org/r/228014 (owner: 10Dzahn) [16:51:36] RECOVERY - configured eth on ms1001 is OK - interfaces up [16:51:55] RECOVERY - NFS on ms1001 is OK: TCP OK - 0.003 second response time on port 2049 [16:52:05] RECOVERY - dhclient process on ms1001 is OK: PROCS OK: 0 processes with command name dhclient [16:52:36] RECOVERY - salt-minion processes on ms1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:54:50] (03PS1) 10Giuseppe Lavagetto: conftool: add write-locks to syncer and confctl [software/conftool] - 10https://gerrit.wikimedia.org/r/228015 (https://phabricator.wikimedia.org/T107286) [16:55:01] (03CR) 10jenkins-bot: [V: 04-1] conftool: add write-locks to syncer and confctl [software/conftool] - 10https://gerrit.wikimedia.org/r/228015 (https://phabricator.wikimedia.org/T107286) (owner: 10Giuseppe Lavagetto) [16:55:26] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [16:56:31] <_joe_> ugh, the jenikins job fails because python-etcd 0.4.0 is not on pip (yet) [16:56:55] <_joe_> time to poke my friend jose :) [17:06:02] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1495103 (10ksmith) @Cmjohnson @Joe: Is this likely to be done in the next couple days? We're trying to allocate resources so a timeline for getting... [17:11:41] (03CR) 10Dzahn: "besides that the rules work:" [puppet] - 10https://gerrit.wikimedia.org/r/228014 (owner: 10Dzahn) [17:12:30] RECOVERY - RAID on ms-be1005 is OK optimal, 13 logical, 13 physical [17:17:46] (03PS3) 10BBlack: tlsproxy: refactor/cleanup, beta work [puppet] - 10https://gerrit.wikimedia.org/r/227404 (https://phabricator.wikimedia.org/T97593) [17:17:50] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T106654#1495130 (10Cmjohnson) 5Open>3Resolved yes, the disk was replaced...all are online, spun up cmjohnson@ms-be1005:~$ sudo megacli -PDList -aALL |grep "Firmware state" Firmware state: Online, Spu... [17:19:19] (03CR) 10BBlack: [C: 032] tlsproxy: refactor/cleanup, beta work [puppet] - 10https://gerrit.wikimedia.org/r/227404 (https://phabricator.wikimedia.org/T97593) (owner: 10BBlack) [17:25:09] RECOVERY - puppet last run on ms-be1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:25:09] (03CR) 10Tim Landscheidt: "This should be good to go now." [puppet] - 10https://gerrit.wikimedia.org/r/227493 (owner: 10Tim Landscheidt) [17:30:11] (03PS2) 10Yuvipanda: Labs: Remove various obsolete migration code [puppet] - 10https://gerrit.wikimedia.org/r/227493 (owner: 10Tim Landscheidt) [17:30:22] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/227493 (owner: 10Tim Landscheidt) [17:30:49] (03PS1) 10BBlack: bits.wm.o -> text-cluster [dns] - 10https://gerrit.wikimedia.org/r/228021 (https://phabricator.wikimedia.org/T95448) [17:31:03] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=9 dev=sdj failed - https://phabricator.wikimedia.org/T106654#1495198 (10Cmjohnson) Return Shipping Information USPS 9202 3946 5301 2428 0396 97 FEDEX 9611918 2393026 49531373 [17:32:39] 6operations, 10Wikimedia-Mailing-lists: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1495202 (10Krenair) Looks like a task that someone with sodium access would be needed to do: https://wikitech.wikimedia.org/wiki/Lists.wikimedia.org#Rename_a_mailing_list [17:34:00] jouncebot: next [17:34:00] In 0 hour(s) and 25 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150730T1800) [17:34:08] (03PS3) 10Yuvipanda: labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) [17:34:13] (03CR) 10jenkins-bot: [V: 04-1] labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [17:34:15] 6operations, 10Wikimedia-Mailing-lists: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1495214 (10Krenair) Might require downtime though? [17:34:25] (03PS1) 10BryanDavis: logging: Enable stacktrace printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228023 (https://phabricator.wikimedia.org/T107440) [17:35:21] (03PS3) 10BBlack: Remove multiple subdomain wiki rewrites from wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [17:35:59] (03CR) 10BryanDavis: "One downside of this is that the stacktraces printed will be full stacktraces rather than redacted traces as produced by MWExceptionHandle" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228023 (https://phabricator.wikimedia.org/T107440) (owner: 10BryanDavis) [17:36:30] 6operations, 10Wikimedia-Mailing-lists: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1495224 (10Selsharbaty-WMF) Thanks, Krenair! There is no problem with time. We are not in a hurry. Does your comment about sodium access mean that my ticket here is in the wrong place? [17:36:42] (03PS4) 10BBlack: Remove multiple subdomain wiki rewrites from wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [17:36:49] (03CR) 10BBlack: [C: 032 V: 032] Remove multiple subdomain wiki rewrites from wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/227172 (https://phabricator.wikimedia.org/T102814) (owner: 10Reedy) [17:37:28] bd808: hmm, will that affect the stack traces we output on test wikis? [17:37:40] 6operations, 10Wikimedia-Mailing-lists: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1495228 (10Krenair) >>! In T107445#1495224, @Selsharbaty-WMF wrote: > Does your comment about sodium access mean that my ticket here is in the wrong place? Nope, just means someone from #oper... [17:38:35] legoktm: nope! The testwiki2?.log files run through the legacy handler that uses our internal magic [17:38:45] ok [17:39:07] (03PS4) 10Yuvipanda: labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) [17:39:11] but we could subclass the normal monolog formatter to get the traces to alwyas be redacted [17:39:19] andrewbogott: Coren ^^ [17:39:26] I didn't think of that until I uploaded the config change [17:39:38] that == redacted traces [17:40:45] bd808: and we'd do the subclassing in core? [17:40:52] yeah [17:41:15] do we still need the ObjectFactory change then? [17:41:39] no, we wouldn't but it doesn't hurt anything [17:41:50] so ... I'll whip up a new formatter [17:42:12] and we can ditch the backport of the ObjectFactory change [17:42:35] ok [17:45:02] (03CR) 10Yuvipanda: [C: 04-1] "I'd suggest testing on labsdb1005 (which has a different config but not too different) first, and then on labsdb1003 (same config as 1002)" [puppet] - 10https://gerrit.wikimedia.org/r/227944 (owner: 10Muehlenhoff) [17:45:17] (03CR) 10Dzahn: "please see https://gerrit.wikimedia.org/r/#/c/228014/ before doing anything with dataset1001" [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [17:45:32] !log killing some long running queries on db1042 [17:45:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:46:09] (03CR) 10Dzahn: "< cirrus> if you want to query v6 addresses, you must specify type=AAAA" [puppet] - 10https://gerrit.wikimedia.org/r/228014 (owner: 10Dzahn) [17:46:24] expect a spike on 500 that you can safely ignore [17:47:51] (03CR) 10Dzahn: [C: 04-1] "we need to talk about the issue with v6-only hosts first (and separately that a failed DNS lookup breaks it so easily)" [puppet] - 10https://gerrit.wikimedia.org/r/227712 (https://phabricator.wikimedia.org/T104991) (owner: 10Muehlenhoff) [17:51:18] (03PS1) 10BBlack: decom bits service IPs [dns] - 10https://gerrit.wikimedia.org/r/228029 (https://phabricator.wikimedia.org/T95448) [17:55:33] 6operations, 6Labs, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1495302 (10yuvipanda) @andrew can you send out a scheduling email? If labvirt1009 is ok with you I can provide exact toollabs failover mechanisms. [17:56:28] !log decom virt1001-virt1009 [17:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150730T1800). [18:01:31] any problems in wmf16 that should delay the train? I'm assuming not since everything seems quiet [18:01:55] (03PS1) 1020after4: all wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228030 [18:02:09] (03CR) 1020after4: [C: 032] all wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228030 (owner: 1020after4) [18:02:15] (03Merged) 10jenkins-bot: all wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228030 (owner: 1020after4) [18:02:32] (03CR) 10Alex Monk: [C: 04-1] Add python3 script to populate meta_p (031 comment) [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [18:02:37] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: all wikis to 1.26wmf16 [18:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:33] (03PS1) 10Cmjohnson: Removing dhcpd entries for virt1001-1009 for decommissiong and rack removal [puppet] - 10https://gerrit.wikimedia.org/r/228031 [18:09:11] (03PS1) 10BBlack: switch bits to meta refs in dataset/snapshot html [puppet] - 10https://gerrit.wikimedia.org/r/228032 (https://phabricator.wikimedia.org/T95448) [18:09:13] (03PS1) 10BBlack: Remove cache::bits roles from bits-cluster hosts [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) [18:09:15] (03PS1) 10BBlack: Decom bits cluster varnish/lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) [18:09:21] (03PS4) 10Alex Monk: Add python3 script to populate meta_p [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) [18:10:25] (03CR) 10BBlack: [C: 032] switch bits to meta refs in dataset/snapshot html [puppet] - 10https://gerrit.wikimedia.org/r/228032 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [18:11:34] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1495410 (10BBlack) [18:12:17] (03PS1) 10Dzahn: add base::firewall on labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/228035 [18:12:25] (03PS1) 10Cmjohnson: Removing dns entries for virt1001-1009 including mgmt since these are going to be removed from the racks. [dns] - 10https://gerrit.wikimedia.org/r/228036 [18:12:33] 6operations, 10Traffic, 5Patch-For-Review, 7Varnish: Move bits traffic to text/mobile clusters - https://phabricator.wikimedia.org/T95448#1190727 (10BBlack) We probably shouldn't move forward on the bits->text switch at the DNS level (and patches beyond that) until T106966 is resolved, so that we don't hav... [18:14:06] 6operations, 6Commons, 10MediaWiki-Special-pages, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1495424 (10Legoktm) [18:14:11] (03PS2) 10Cmjohnson: Removing dhcpd entries for virt1001-1009 for decommissiong and rack removal [puppet] - 10https://gerrit.wikimedia.org/r/228031 [18:15:00] (03CR) 10BBlack: [C: 04-1] "Pre-staged, do not merge yet!" [dns] - 10https://gerrit.wikimedia.org/r/228021 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [18:15:06] (03CR) 10Cmjohnson: [C: 032] Removing dns entries for virt1001-1009 including mgmt since these are going to be removed from the racks. [dns] - 10https://gerrit.wikimedia.org/r/228036 (owner: 10Cmjohnson) [18:15:10] (03CR) 10BBlack: [C: 04-1] "Pre-staged, do not merge yet!" [dns] - 10https://gerrit.wikimedia.org/r/228029 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [18:15:23] (03CR) 10BBlack: [C: 04-1] "Pre-staged, do not merge yet!" [puppet] - 10https://gerrit.wikimedia.org/r/228033 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [18:15:25] (03CR) 10Cmjohnson: [C: 032] Removing dhcpd entries for virt1001-1009 for decommissiong and rack removal [puppet] - 10https://gerrit.wikimedia.org/r/228031 (owner: 10Cmjohnson) [18:15:31] (03CR) 10BBlack: [C: 04-1] "Pre-staged, do not merge yet!" [puppet] - 10https://gerrit.wikimedia.org/r/228034 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [18:18:43] (03PS1) 1020after4: Update mediawiki version regex to support semantic version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228039 (https://phabricator.wikimedia.org/T67306) [18:19:13] YuviPanda: what's a good test to see if labsdb is still working. where should i be able to connect from. i am following your advice on gerrit and do it on labsdb1005 only first [18:20:04] looks up GRANTs [18:20:46] YuviPanda: was that ping for the hiera patch? Or something else? [18:21:24] andrewbogott: how to get root password for labsdb ?:) [18:21:39] private repo? [18:21:59] (03PS1) 10Jforrester: VisualEditor: Switch config from …Namespaces to …AvailableNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228040 [18:22:01] (03PS1) 10Jforrester: Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 [18:22:36] mutante, passwords::misc::scripts::mysql_labsdb_root_pass [18:22:46] Krenair: :) thanks [18:23:11] (03PS2) 10Jforrester: Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 (https://phabricator.wikimedia.org/T107003) [18:23:27] there's references to it in modules/mysql_wmf/templates/skrillex.yaml.erb [18:23:38] oh, the skrillex thing:) [18:23:44] andrewbogott: hiera patch [18:23:52] will always make me think of asher [18:24:02] YuviPanda: ok, will look shortly [18:24:08] andrewbogott: thanks [18:24:52] just that i dont see it in the repo yet.. keeps looking [18:25:24] mutante: why do you need the root password? [18:25:30] mutante: there's also a labsdbadmin user account now [18:25:42] YuviPanda: i want to see all the GRANTs [18:27:24] i found the labsdbadmin password, trying [18:28:01] access denied [18:28:14] it is ip restricted [18:28:17] user 'labsdbadmin'@'localhost' (using password: YES) [18:28:18] login as root [18:28:43] use pt-show-grants to show grants [18:29:10] jynus: i don't know the root password [18:29:20] i dont see that in the passwords class [18:30:03] (03PS3) 10coren: nagios_common: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 [18:30:33] Hm, hah. More old crap. [18:31:15] (03PS4) 10coren: nrpe: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 [18:32:01] YuviPanda: _joe_ ^^ should address all the issues, I think. [18:32:42] (03PS1) 10Ori.livneh: Use relative URLs to construct load.php requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228042 (https://phabricator.wikimedia.org/T106966) [18:32:57] (03CR) 10Ori.livneh: [C: 032] Use relative URLs to construct load.php requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228042 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [18:33:02] (03Merged) 10jenkins-bot: Use relative URLs to construct load.php requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228042 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [18:33:44] !log ori Synchronized wmf-config/CommonSettings.php: I6665bf31: Use relative URLs to construct load.php requests (duration: 00m 12s) [18:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:34:15] Coren: nice [18:34:19] not sure if I should CR perl tho [18:35:04] ori, are you running l10nupdate from your crontab on tin? [18:35:24] ....no [18:35:32] bd808, ^ [18:35:32] [tin:/srv/mediawiki-staging] $ crontab -l [18:35:33] no crontab for ori [18:35:37] ok [18:35:46] YuviPanda: FWIW, that code was actually tested to run properly. [18:35:52] (03CR) 10Andrew Bogott: [C: 031] "Yep, I like this way much better!" [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [18:36:22] YuviPanda: But now we're kinda stuck with _joe_'s -2 anyways so he gets to review it. :-) [18:36:47] ori: question from Krenair is re https://phabricator.wikimedia.org/T106460 [18:37:51] yes, I know. I appreciate that Krenair is investigating that -- it is probably something I broke. I'm not looking at it only because I'm focussing on https://phabricator.wikimedia.org/T102199 and T106966. The l10nupdate is weird but AFAICT it does eventually work, so it is less urgent. [18:38:41] ori: yeah, just hadn't seen a response yet from you, so making sure you knew context [18:39:06] (03PS8) 10Andrew Bogott: nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [18:39:10] Coren true! [18:39:19] coren do you want to setup the timers in the meantime? [18:39:22] sorry, I actually did some debugging last night and sent Krenair some findings. I should have updated Phab as well. [18:39:44] YuviPanda: Didn't you want to test the alarms first? Ima start a run of backups though. [18:40:08] ori: /me nods [18:40:25] coren wait don't [18:40:29] hmm [18:40:50] coren we could have the patches in parallel - _joe_ isn't around all day today I guess so we'll have to wait the night [18:41:05] Coren so we can work on and maybe not merge teh timer patches [18:41:15] coren also if you do a manual run now we can't test the alarms ;) [18:41:51] YuviPanda: Yes we can, just with reduced limits - the current backups are too old for my taste. [18:42:20] Coren hmm, fair enough. if that's the case then we can test the alarms that way too - so I'd prefer we not start them manually but just setup the timers and have them start it? [18:42:30] I guess if we set timer to start every other day it'll start immediately after merge? [18:43:30] YuviPanda: ... I honestly don't know how systemd behaves with new timers. [18:44:20] greg-g, yeah those warnings we saw probably aren't what's causing it to fail [18:44:29] gotcha [18:44:40] it comes from running this: [18:44:40] coren hmm, alright. I guess we could start them manually if systemd doesn't, but if you feel anxious do start a manual run :) [18:44:42] if /usr/local/bin/mwscript extensions/LocalisationUpdate/update.php --wiki="$mwDbName" [18:44:42] then [18:44:46] # success [18:44:47] else [18:44:48] # fail [18:45:30] YuviPanda: I do. Except for -others which you tested the current backups are nearly 6 days old. [18:45:35] {{done}} [18:45:40] (03CR) 10Andrew Bogott: [C: 032] nodepool: setup python logger [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [18:45:43] and it fails sometimes. But run that on it's own (tested with mwDbName=cawikiquote), you get the warnings, and it still succeeds [18:45:47] coren cool [18:46:46] (03CR) 10Andrew Bogott: "This applied cleanly. I'll leave it to you to verify that it's doing something useful :)" [puppet] - 10https://gerrit.wikimedia.org/r/224106 (owner: 10Hashar) [18:47:27] YuviPanda: Did you expect monotonic timers or realtime? I'd lean towards the latter so that backups are predicatbly timed. [18:48:13] (03PS1) 10Ori.livneh: Use absolute URLs to construct load.php requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228044 (https://phabricator.wikimedia.org/T106966) [18:48:27] (03CR) 10Ori.livneh: [C: 032] Use absolute URLs to construct load.php requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228044 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [18:48:33] (03Merged) 10jenkins-bot: Use absolute URLs to construct load.php requests [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228044 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [18:48:45] coren let's start with realtime ones. eventually we should move to monotonic ones tho, since realtime ones make us have to predict how long backups take [18:49:17] !log ori Synchronized wmf-config/CommonSettings.php: I6db1771bf4: Use absolute URLs to construct load.php requests (duration: 00m 12s) [18:49:24] YuviPanda: Hm. I think monotonic timers specify the start-to-start interval, not the end-to-start so you still have to predict. [18:49:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:40] coren OnUnitInactiveSec= maybe [18:50:12] Ah, hm. Didn't know about that one. [18:50:29] Let's start with realtime first anyways - easy to change later at need. [18:50:32] yeah I agree [18:51:17] I go fetch food then I'll look at that hiera thing of yours then the timers. [18:51:27] cool [19:06:40] 6operations, 10ops-eqiad: Decom and wipe cisco virt servers virt1001-1009 then remove from racks - https://phabricator.wikimedia.org/T107159#1495727 (10Cmjohnson) Removed from icinga, cleaned puppet certs, delete salt-key, removed from dhcpd file, removed dns, removed from switch and deleted ports from labs vl... [19:09:23] (03CR) 10coren: [C: 031] "Sensical" [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [19:09:30] !log legoktm Synchronized php-1.26wmf16: Revert "Use OOUI HTMLForm for Special:Watchlist" (duration: 01m 46s) [19:09:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:51] (03PS5) 10Yuvipanda: labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) [19:10:00] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Use hiera to set currently active labstore [puppet] - 10https://gerrit.wikimedia.org/r/227953 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [19:12:00] (03PS2) 10Dzahn: add base::firewall on labsdb1005 [puppet] - 10https://gerrit.wikimedia.org/r/228035 [19:12:00] andrewbogott: coren I'd like someone not me to do https://phabricator.wikimedia.org/T107058 - both to familiarize others with catchpoint and also to let me move on to the experimental goal :D [19:12:23] YuviPanda: "me" sounds okay. [19:12:29] (03PS2) 10BryanDavis: logging: Enable stacktrace printing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228023 (https://phabricator.wikimedia.org/T107440) [19:12:47] coren sweet. NFS definitely is higher priority tho [19:12:57] (03CR) 10Dzahn: [C: 032] "per IRC discussion we are doing only labsdb1005 first" [puppet] - 10https://gerrit.wikimedia.org/r/228035 (owner: 10Dzahn) [19:12:59] thanks Coren [19:13:21] PROBLEM - SSH on ms-be1005 is CRITICAL - Socket timeout after 10 seconds [19:13:35] coren chasemp helped me with interface questions about catchpoint :) [19:13:54] YuviPanda: I'll bug him at need then. [19:15:31] !log Deployed patch for T107170 to wmf/1.26wmf16 [19:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:15:50] can a root grep for "l10nupdate" across all crontabs on tin please? [19:16:22] PROBLEM - RAID on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:30] PROBLEM - puppet last run on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:16:45] per https://phabricator.wikimedia.org/T106460 I think might be being run regularly from somewhere it shouldn't be [19:16:51] Krenair: [19:16:52] 0 2 * * * /usr/local/bin/l10nupdate-1 --verbose >> /var/log/l10nupdatelog/l10nupdate.log 2>&1 [19:17:00] that it? [19:17:55] that's the only one I see, I have an alias that dumps all crons [19:18:05] ooohhh... hang on [19:18:10] bd808, what about l10nupdate on mira? [19:18:31] oh man [19:18:36] that's a great idea.... [19:18:46] but, can it contact anything? [19:18:49] Perhaps mira is trying to run it and failing? [19:19:20] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [19:19:24] chasemp: I don't think any non-ops have access to mira, can you see if there's a /var/log/l10nupdatelog/l10nupdate.log full of failures? [19:19:38] We do [19:19:44] it's the same setup as tin [19:20:16] There it is! [19:20:21] RECOVERY - RAID on ms-be1005 is OK optimal, 14 logical, 14 physical [19:20:22] RECOVERY - puppet last run on ms-be1005 is OK Puppet is currently enabled, last run 21 minutes ago with 0 failures [19:20:44] krenair@mira:/var/log/l10nupdatelog$ grep FAILED l10nupdate.log -c [19:20:44] 97 [19:21:18] nice :) [19:21:24] Krenair: awesome-sauce [19:21:36] Okay... now, how are we going to fix this [19:21:55] dunno, aaron's doing multidatacenter awareness stuff :P [19:22:13] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1495752 (10CCogdill_WMF) I see your point about the unreliability of the OS Marketshare metrics when it comes to overall Wikipedia usage. I just want to illustrate the significance of XP in this p... [19:22:52] Krenair: I owe you a drink for this one :) [19:23:39] It wasn't until recently that mira could actually log these failures to SAL [19:24:43] although the dates don't quite match up [19:25:43] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 55.56% of data above the critical threshold [35.0] [19:28:03] PROBLEM - puppet last run on labstore2001 is CRITICAL puppet fail [19:29:51] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 77.78% of data above the critical threshold [35.0] [19:31:42] Krenair: awesome find! Sounds like we may need a hiera var to turn the cron on/off. [19:31:49] yep [19:32:16] ori: my apologies for blaming you for the problem. timing lined up but I should have dug deeper [19:32:41] also... the messages we post probably should start giving the origin host [19:33:01] maybe when mira is actually usable :) [19:33:11] automated ones, yes [19:33:14] (03PS1) 10Dzahn: Revert "add base::firewall on labsdb1005" [puppet] - 10https://gerrit.wikimedia.org/r/228120 [19:33:35] yeah, that's what I meant. The logs that are relayed from neon [19:36:05] (03CR) 10Dzahn: [C: 032] "breaks geohack, missing rule" [puppet] - 10https://gerrit.wikimedia.org/r/228120 (owner: 10Dzahn) [19:42:41] bd808, okay so, haven't really messed with hiera before, is this how you do it? https://phabricator.wikimedia.org/P1503 [19:42:56] I looked at how is_mail_relay is used for toollabs [19:44:32] Krenair: yeah that looks right. Instead of the `if` you could use the param to control the ensure. That would actually let the cron get uninstalled when the values were flipped [19:44:58] I think we have a helper somewhere to turn a bool into a present/absent value [19:45:00] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1495797 (10Ironholds) Mikhail has attempted to get into stat2 and had his publickey denied; bastion likes it fine but stat1002 doesn't. Could someone sit down with him an... [19:46:12] (03PS1) 10Yuvipanda: labstore: Fix ensure to be properly boolean [puppet] - 10https://gerrit.wikimedia.org/r/228121 [19:46:16] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fix ensure to be properly boolean [puppet] - 10https://gerrit.wikimedia.org/r/228121 (owner: 10Yuvipanda) [19:46:36] PROBLEM - High load average on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:47:05] (03PS1) 10Ori.livneh: Update URL configuration for mobile when entering mobile mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228122 (https://phabricator.wikimedia.org/T106966) [19:47:11] oh hey, maybe YuviPanda knows about it bd808 :) [19:47:11] (03PS2) 10Yuvipanda: labstore: Fix ensure to be properly boolean [puppet] - 10https://gerrit.wikimedia.org/r/228121 [19:47:30] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix ensure to be properly boolean [puppet] - 10https://gerrit.wikimedia.org/r/228121 (owner: 10Yuvipanda) [19:47:51] (03CR) 10Ori.livneh: [C: 032] Update URL configuration for mobile when entering mobile mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228122 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [19:47:57] (03Merged) 10jenkins-bot: Update URL configuration for mobile when entering mobile mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228122 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [19:48:29] !log ori Synchronized wmf-config: I0990ac5b: Update URL configuration for mobile when entering mobile mode (duration: 00m 12s) [19:48:35] RECOVERY - High load average on ms-be1005 is OK - load average: 69.66, 73.01, 69.57 [19:48:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:49:08] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1495811 (10BBlack) wikimediafoundation.org is HTTPS-only already. All of our existing primary domains are, except for wikimedia.org (which is only lagging because of unresolved exceptions like th... [19:50:09] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1495822 (10ori) [19:52:39] maybe mutante knows [19:54:04] (03CR) 10Dzahn: "12:53 < mutante> the firewall hole we wanted is in role::mariadb::labs" [puppet] - 10https://gerrit.wikimedia.org/r/228035 (owner: 10Dzahn) [19:54:24] bd808: hah! no worries. My own guess was that I broke it, too, because the timing lined up. [19:55:05] (03PS1) 10Krinkle: rl-test: Add instrumentation for User-Agent and Remote IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228124 (https://phabricator.wikimedia.org/T105255) [19:55:15] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [19:56:11] (03PS1) 10BryanDavis: Send $USER and $HOSTNAME with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228125 (https://phabricator.wikimedia.org/T106460) [19:56:23] Krenair: ? knows about what [19:56:47] Krenair https://phabricator.wikimedia.org/P1503 looks ok if you purge it from mria by hand [19:57:05] YuviPanda, right, that's what we wanted to change to ensure based on the value [19:57:11] but I think it needs to be present/absent [19:57:14] rather than true/false [19:57:16] or something [19:57:20] heh, exact same thing I need now [19:57:30] puppet takes true / false as well for ensure btw [19:57:34] someone will hit me for it tho [19:57:53] (03PS2) 10Ori.livneh: Send $USER and $HOSTNAME with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228125 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [19:57:58] (03CR) 10Ori.livneh: [C: 032] Send $USER and $HOSTNAME with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228125 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [19:58:23] bd808: is there an easy way to log a message to logstash from the cluster via the commandline? [19:58:25] there are a couple of functions in wmflib for that sort of thing but none for present/absent [19:58:32] yes, please use present/absent in single quotes [19:58:36] (03CR) 10Krinkle: "Set $wgStylePath instead of $wgStyleSheetPath :)" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228122 (https://phabricator.wikimedia.org/T106966) (owner: 10Ori.livneh) [19:58:38] ori: !log :) [19:58:47] otherwise there is also the "quoted boolean" warning [19:58:52] but l10nupdate shouldn't log to the channel [19:58:53] and they can be tricky [19:58:57] that was never appropriate, IMO. [19:59:07] it should go into logstash and logstash directly [19:59:28] we could make something pretty easily. nc to the right place [19:59:52] (03PS1) 10Alex Monk: Don't try to run l10nupdate on mira [puppet] - 10https://gerrit.wikimedia.org/r/228126 (https://phabricator.wikimedia.org/T106460) [20:00:05] bd808: where are the ingress mechanisms for logstash currently? does everything go through redis? [20:00:15] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 222 seconds ago with 0 failures [20:00:21] !log starting rolling wipe process on mobile cache contents for T106966 fixup [20:00:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:30] ori: to be fair, i only noticed it because I saw it on twitter (it == l10n failures) [20:00:41] (03PS2) 10Alex Monk: Don't try to run l10nupdate on mira [puppet] - 10https://gerrit.wikimedia.org/r/228126 (https://phabricator.wikimedia.org/T106460) [20:00:48] ori: no redis at all. we have several udp datagram inputs in prod [20:00:53] maybe successes aren't loud, but failures are [20:01:09] a la browsertests annoucements [20:01:13] ori: syslog is the most generic [20:01:16] RECOVERY - Persistent high iowait on labstore2001 is OK Less than 50.00% above the threshold [25.0] [20:01:42] (03PS1) 10Ori.livneh: Follow-up to I0990ac5b6850: set $wgStylePath as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228128 [20:01:59] (03CR) 10Ori.livneh: [C: 032] Follow-up to I0990ac5b6850: set $wgStylePath as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228128 (owner: 10Ori.livneh) [20:02:05] (03Merged) 10jenkins-bot: Follow-up to I0990ac5b6850: set $wgStylePath as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228128 (owner: 10Ori.livneh) [20:02:14] (03PS1) 10Alex Monk: Copy wdqs-admins group from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/228129 (https://phabricator.wikimedia.org/T105185) [20:02:21] so if you throw a syslog udp datagram at logstash100[1-3]:10514 it will end up in logstash [20:02:38] !log ori Synchronized wmf-config/mobile.php: (no message) (duration: 00m 12s) [20:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:03:11] greg-g: if, say, Flow logged all exceptions via !log, we may notice Flow exceptions that otherwise slip our notice. That doesn't mean it's the proper channel. [20:04:14] You noticed the issue because !log is otherwise high signal, low noise. But l10nupdate is inconsistent with that. [20:04:25] also fair [20:04:32] 7Puppet, 6operations: Clean up files/snapshot/sudoers.snapshot - https://phabricator.wikimedia.org/T107479#1495887 (10Krenair) 3NEW [20:04:58] just need to make sure we can find a way to make the failures easily found, 'tis all [20:05:38] it's like using Amber Alerts for "lost power cord" e-mails on wmfsf [20:05:49] it's like ten thousand spoons [20:05:55] etc. [20:05:56] almost :) [20:06:25] I'm not disagreeing to remove from standard !log'ing, we just need an alert on it (icinga? something) [20:07:10] bd808: why don't we have an rsyslog rule on all hosts, specifying that messages on the logstash channel be forwarded to logstash? [20:07:11] so, before you remove it, I'd like to file a task for it and get input from the deployment team [20:07:56] /usr/bin/logger --tag logstash "This log message will make it onto logstash" [20:08:00] from anywhere on the cluster [20:08:03] greg-g: sure [20:08:17] filed [20:08:39] that was phast [20:08:43] (see what i did there?) [20:08:44] ori: sure. that would be trivial to setup. We do similar for apache2 & hhvm logs [20:08:56] bd808: patch incoming [20:08:57] ori: I've trained my phingers [20:09:06] greg-g: this is no longer phunny [20:09:23] it has gone on phar enough [20:09:36] this pharce has run its course [20:09:50] phrack [20:09:51] i should shut the phuck up [20:10:19] see, it's ok because it's spelled incorrectly [20:10:42] Sphelled? [20:10:55] Stop sphamming people [20:11:00] (03PS1) 10BryanDavis: Send $USER and $HOSTNAME with !log messages [tools/scap] - 10https://gerrit.wikimedia.org/r/228131 (https://phabricator.wikimedia.org/T106460) [20:11:04] This channel is phots only [20:11:31] YuviPanda: interesting dipherent angle [20:11:36] /wmf/puppet$ grep -r "escape key" * [20:12:58] grep Yuvi modules/wmflib/lib/hiera/mwcache.rb [20:14:49] do you remember that person asking about your.org, greg-g? [20:14:57] uh, yeah, one sec [20:15:04] tpearson [20:15:08] I /whois'd him [20:15:10] "# for your.org, contact: Kevin Day [20:15:10] # they manage it but we have access; see file in /home/wikipedia/docs/mirrors" [20:15:18] modules/dataset/files/rsync/rsyncd.conf.dumps_to_public [20:15:32] host | ~tpearson@edge.pearsoncomputing.net [20:15:53] (03PS1) 10Jcrespo: Deprecating mariadb::labsdb [puppet] - 10https://gerrit.wikimedia.org/r/228134 [20:16:00] (03PS1) 10Rush: diamond: start collecting nutcracker stats [puppet] - 10https://gerrit.wikimedia.org/r/228135 [20:16:43] (03CR) 10jenkins-bot: [V: 04-1] diamond: start collecting nutcracker stats [puppet] - 10https://gerrit.wikimedia.org/r/228135 (owner: 10Rush) [20:17:02] chasemp: woo, that was fast! [20:17:07] (03CR) 10BryanDavis: "Follow up to I1d7cccdcd32cb539ca2b730e34fa9b4d451d21fc that does something similar for l10nupdate logging." [tools/scap] - 10https://gerrit.wikimedia.org/r/228131 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [20:17:42] gotta satisfy the pep8 monster [20:18:48] (03CR) 10Greg Grossmeier: [C: 031] Send $USER and $HOSTNAME with !log messages [tools/scap] - 10https://gerrit.wikimedia.org/r/228131 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [20:20:09] AndyRussG, hey, you around? [20:20:17] (03CR) 10Dzahn: [C: 04-1] "not yet, the wrong roles have the holes, enabling it on labsdb1005 had to be reverted. there will be cleanup about which nodes have which " [puppet] - 10https://gerrit.wikimedia.org/r/227945 (owner: 10Muehlenhoff) [20:20:36] (03PS2) 10Rush: diamond: start collecting nutcracker stats [puppet] - 10https://gerrit.wikimedia.org/r/228135 [20:20:41] Hi Krenair! yes... just finishing standup... all good? [20:21:28] ohh... hang on [20:21:37] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1495966 (10CCogdill_WMF) @BBlack we're looking into the possibility, but if it's possible, it would be a solution we could implement in 2 weeks. That is one of the options we would talk about on 1... [20:21:39] atg, not arg [20:21:43] so that must be ariel? [20:21:59] Hmm dunno :) [20:22:21] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1495970 (10CCogdill_WMF) Sorry, it would *not be a solution we could implement in 2 weeks... [20:22:21] indeed git blame points to Ariel T. Glenn :) [20:22:31] (03PS2) 10Dzahn: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:23:00] sorry AndyRussG! [20:23:01] apergos, hey [20:23:09] np! [20:25:21] (03CR) 10coren: [C: 032] "Just comments and notes." [puppet] - 10https://gerrit.wikimedia.org/r/228134 (owner: 10Jcrespo) [20:25:36] (03PS3) 10Rush: diamond: start collecting nutcracker stats [puppet] - 10https://gerrit.wikimedia.org/r/228135 [20:25:41] (03CR) 10Ori.livneh: [C: 04-1] "redis needs to be accessible from the MediaWiki servers" [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:26:43] (03CR) 10Rush: [C: 032] diamond: start collecting nutcracker stats [puppet] - 10https://gerrit.wikimedia.org/r/228135 (owner: 10Rush) [20:27:19] chasemp: whoa [20:27:19] (03PS3) 10Alex Monk: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:27:20] wait [20:27:26] can i have a chance to review? [20:27:53] please if I had known were willing I would have sorted it [20:28:13] kk, i'll just leave commenst on that changeset [20:28:24] I already jumped and don't expect any issues do you want me to repeal or can I make further changes from here? [20:30:35] so, we need to stop T107265, but I do not know if at application level or varnish [20:32:30] (03PS4) 10Dzahn: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:33:51] i think tide.ms is not a regular crawler, it's this: [20:33:56] "Yep, those are our internet proxies." [20:34:01] https://channel9.msdn.com/forums/Coffeehouse/95630-Whos-tidemicrosoftcom/ [20:34:07] jynus: [20:36:32] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1495990 (10BBlack) So doers that mean the vendor can't give us a unique IP to handle the SNI problem at this time? If they supported it at all, I imagine it would be pretty straightforward. [20:36:52] 6operations, 6Commons, 10MediaWiki-Special-pages, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1495992 (10Dzahn) It seems this isn't a regular crawler, but: "Yep, those are our internet proxies." https... [20:42:50] Where might MW_CRON_LOGS from mw-deployment-vars be used bd808? [20:43:21] (03CR) 10Ori.livneh: diamond: start collecting nutcracker stats (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/228135 (owner: 10Rush) [20:43:59] Krenair: yikes. no idea. that directory path is from before my time with the WMF [20:44:07] looks like cruft [20:44:23] I think a lot of mw-deployment-vars is cruft these days [20:44:56] s/mw-deployment-vars/mw/ [20:45:02] (03CR) 10Ori.livneh: diamond: start collecting nutcracker stats (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228135 (owner: 10Rush) [20:45:11] I'm pretty sure nothing uses MW_DSH_ARGS or MW_RSYNC_ARGS any more [20:45:14] (03PS1) 10Dzahn: add script to flush all iptables rules for emergencies [puppet] - 10https://gerrit.wikimedia.org/r/228137 [20:45:33] chasemp: 18:40 < icinga-wm> PROBLEM - Disk space on iridium is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=84%) [20:45:42] chasemp: 141MB now [20:45:52] 6operations, 6Commons, 10MediaWiki-Special-pages, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1496014 (10Krenair) p:5Low>3High This is the majority of the warnings in fluorine:/a/mw-log/hhvm.log at... [20:45:55] CUT THE RED WIRE, THE RED WIRE [20:46:02] Coren, YuviPanda: both labstores have critical alerts [20:46:02] 6operations, 6Multimedia, 6Performance-Team, 10Wikimedia-Site-requests, and 2 others: Please offer larger image thumbnail sizes in Special:Preferences - https://phabricator.wikimedia.org/T65440#1496016 (10Tacsipacsi) [20:46:11] ostriches: twentyafterfour are you guys storing stuff on / at all? [20:46:16] on iridium [20:46:30] paravoid: yeah that's me patch in progress [20:46:35] 6operations, 6Commons, 10MediaWiki-Special-pages, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1496020 (10jcrespo) p:5High>3Normal Proxies do not read an url each second monotonically, nor have agent... [20:46:53] 6operations, 6Commons, 10MediaWiki-Special-pages, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1496022 (10jcrespo) p:5Normal>3High [20:47:04] andrewbogott: labnet1002 puppet last run two days ago? [20:47:28] paravoid: it’s a work in progress… I thought I ack’d it, maybe my ack expired? [20:47:37] !log iridium - apt-get clean - 1.7G avail [20:47:39] twentyafterfour: Unmerged changes on repository mediawiki_config check is broken due to /srv/mediawiki-staging/ not having a "readonly" remote anymore [20:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:47:44] (03CR) 10Ori.livneh: Add ferm rules for rcstream (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:49:18] RECOVERY - Disk space on iridium is OK: DISK OK [20:49:33] (03PS5) 10Dzahn: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:49:42] (03CR) 10Ori.livneh: "Also, if you really want to catch all plausible errors, you should worry about nutcracker's response not being valid JSON. (In fact it alr" [puppet] - 10https://gerrit.wikimedia.org/r/228135 (owner: 10Rush) [20:50:17] (03CR) 10Ori.livneh: [C: 031] Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:51:29] chasemp: speaking of iridium, its puppet runs are very noisy [20:51:44] chasemp: a lot of notifies/execs happening on every run [20:51:48] (03PS6) 10Dzahn: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [20:51:49] yes they are not if the repos are in sync [20:51:51] which is hte idea [20:51:55] I don't manage that anymore [20:52:52] paravoid, it's easy enough to re-add the readonly remote [20:52:55] where should it point? [20:53:21] where it was pointed before? no idea! :) [20:53:33] Probably somewhere non-readonly [20:54:20] #T83854 [20:54:20] ::monitoring::icinga::git_merge { 'mediawiki_config': [20:54:20] dir => '/srv/mediawiki-staging/', [20:54:20] user => 'root', [20:54:20] remote_branch => 'readonly/master' [20:54:22] } [20:54:39] so that's something in puppet relying on something unpuppetised? [20:54:59] probably [20:55:13] (03PS1) 10Smalyshev: T105080: add maintenance mode configs for nginx [puppet] - 10https://gerrit.wikimedia.org/r/228140 [20:55:19] (03PS1) 10Ori.livneh: Add a debug log channel for bug T102199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228141 [20:56:19] (03PS2) 10Smalyshev: T105080: add maintenance mode configs for nginx [puppet] - 10https://gerrit.wikimedia.org/r/228140 [20:56:28] paravoid, I would only rely on the "origin" remote [20:58:27] chasemp: what do you mean by "not manage that anymore? :) [20:58:44] I mean phabricator deployment and repo states is up to releng now [20:58:54] sure but that's not a phabricator issue, it's a puppet issue :) [20:59:21] well I guess not in my mind as it's a warning that the puppet tag and the on box repos are out of sync [20:59:37] which I don't manage, we could change how that is all done, which I think they want to do [21:00:54] twentyafterfour: ostriches why the bad states between puppet and phab on iridium? [21:01:53] (03PS3) 10BryanDavis: Don't try to run l10nupdate on mira [puppet] - 10https://gerrit.wikimedia.org/r/228126 (https://phabricator.wikimedia.org/T106460) (owner: 10Alex Monk) [21:03:34] (03PS1) 10Faidon Liambotis: git: remove out-of-sync alert [puppet] - 10https://gerrit.wikimedia.org/r/228143 [21:04:27] (03CR) 10Faidon Liambotis: [C: 032] git: remove out-of-sync alert [puppet] - 10https://gerrit.wikimedia.org/r/228143 (owner: 10Faidon Liambotis) [21:05:12] (03CR) 10Ori.livneh: [C: 032] Add a debug log channel for bug T102199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228141 (owner: 10Ori.livneh) [21:05:12] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1496145 (10RobH) The user groups for the analytics cluster are (imo) confusing and quite the mess within the admins module (and before.) As such, its a bit confusing for... [21:05:19] (03Merged) 10jenkins-bot: Add a debug log channel for bug T102199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228141 (owner: 10Ori.livneh) [21:05:41] bd808, actually MW_CRON_LOGS is still referenced [21:06:01] !log ori Synchronized wmf-config/InitialiseSettings.php: I1bbf3f0: Add a debug log channel for bug T102199 (duration: 00m 12s) [21:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:19] git grep didn't find it for me. Is it outside of puppet? [21:06:23] yep [21:06:31] !log ori Synchronized php-1.26wmf16/includes/Message.php: c72b7c435f: Debug logging for T102199 (take 2) (duration: 00m 11s) [21:06:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:06:52] bd808, tin:/usr/local/bin/purge-checkuser [21:07:35] just to write the log [21:08:08] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 861.679700384 [21:08:15] There is a puppetised cron called purge-checkuser [21:08:27] but it doesn't call this [21:08:30] which are /dev/null since /home/wikipedia doesn't exist [21:08:59] it just does the real mwscript command itself and sends the output straight to /dev/null [21:09:09] PROBLEM - puppet last run on cp4012 is CRITICAL Puppet has 1 failures [21:09:27] "# Script for CheckUser cron job, hume:/etc/cron.d/mw-purge-checkuser" [21:09:37] hume? [21:09:45] it was in pmtpa [21:09:54] pmtpa host, was for maintenance [21:09:59] replaced by terbium [21:10:03] *nod* [21:10:22] so a root can probably rm tin:/usr/local/bin/purge-checkuser [21:10:34] and we can remove the reference from mw-deployment-vars [21:10:47] the /home/wikipedia reference, I mean [21:12:32] The only thing I know are used in that file are PATH and the MW_STATSD_* vars [21:16:27] (03PS1) 10Alex Monk: Get rid of old /home/wikipedia references [puppet] - 10https://gerrit.wikimedia.org/r/228147 [21:21:50] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1496257 (10RobH) I'm wrong about the above, he should have access with statistics-privatedata-users. I'm working the issue with other opsen. [21:22:34] Yeah I don't think restricted group lets you into stats boxes robh :) [21:22:53] meh, no one konwsssssssss [21:22:54] its odd [21:25:24] https://gerrit.wikimedia.org/r/#/q/owner:krenair+project:operations/puppet+status:open,n,z - some of these should be easy [21:25:47] I would quite like the l10nupdate one to go through within the next 3 hours [21:26:34] (03PS4) 10Ori.livneh: Don't try to run l10nupdate on mira [puppet] - 10https://gerrit.wikimedia.org/r/228126 (https://phabricator.wikimedia.org/T106460) (owner: 10Alex Monk) [21:26:47] (03PS1) 10Yuvipanda: labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 [21:26:49] (03CR) 10Ori.livneh: [C: 032 V: 032] Don't try to run l10nupdate on mira [puppet] - 10https://gerrit.wikimedia.org/r/228126 (https://phabricator.wikimedia.org/T106460) (owner: 10Alex Monk) [21:26:52] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 (owner: 10Yuvipanda) [21:27:04] (03PS2) 10Yuvipanda: labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 [21:27:30] (03CR) 10Ori.livneh: [C: 032] Get rid of old /home/wikipedia references [puppet] - 10https://gerrit.wikimedia.org/r/228147 (owner: 10Alex Monk) [21:27:37] (03PS2) 10Ori.livneh: Get rid of old /home/wikipedia references [puppet] - 10https://gerrit.wikimedia.org/r/228147 (owner: 10Alex Monk) [21:27:43] (03CR) 10Ori.livneh: [V: 032] Get rid of old /home/wikipedia references [puppet] - 10https://gerrit.wikimedia.org/r/228147 (owner: 10Alex Monk) [21:27:51] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 (owner: 10Yuvipanda) [21:28:10] (03PS2) 10Ori.livneh: Copy wdqs-admins group from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/228129 (https://phabricator.wikimedia.org/T105185) (owner: 10Alex Monk) [21:28:17] (03CR) 10Ori.livneh: [C: 032 V: 032] Copy wdqs-admins group from tin to mira [puppet] - 10https://gerrit.wikimedia.org/r/228129 (https://phabricator.wikimedia.org/T105185) (owner: 10Alex Monk) [21:28:50] (03PS2) 10Ori.livneh: Resume running refreshLinks cron on enwiki [puppet] - 10https://gerrit.wikimedia.org/r/225569 (https://phabricator.wikimedia.org/T44180) (owner: 10Alex Monk) [21:28:56] (03CR) 10Ori.livneh: [C: 032 V: 032] Resume running refreshLinks cron on enwiki [puppet] - 10https://gerrit.wikimedia.org/r/225569 (https://phabricator.wikimedia.org/T44180) (owner: 10Alex Monk) [21:29:21] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1496314 (10RobH) So I check on stat1002, and the user doesnt exist (it had run puppet recently and has no errors). I run puppet, still doesnt exist. I ask Daniel to tak... [21:30:23] (03PS3) 10Yuvipanda: labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 [21:30:28] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 (owner: 10Yuvipanda) [21:30:53] (03PS4) 10Yuvipanda: labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 [21:32:09] Krenair: {{done}}, and forced puppet runs on mira and tin to confirm. [21:33:28] thanks ori [21:33:30] (03PS5) 10Yuvipanda: labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 [21:33:40] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Fix ensure issues take 2 [puppet] - 10https://gerrit.wikimedia.org/r/228150 (owner: 10Yuvipanda) [21:34:16] (03PS1) 10Dzahn: analytics refinery: remove jgage [puppet] - 10https://gerrit.wikimedia.org/r/228151 [21:34:54] "sudo -u l10nupdate crontab -l" shows no entries on mira [21:34:55] bd808, ^ [21:35:29] PROBLEM - RAID on ms-be1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:35:39] RECOVERY - puppet last run on cp4012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:37:21] Krenair: sweet. one less mystery error to wonder about [21:37:28] (03PS1) 10Catrope: Enable Flow on plwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228152 (https://phabricator.wikimedia.org/T107301) [21:37:28] RECOVERY - RAID on ms-be1005 is OK optimal, 14 logical, 14 physical [21:37:53] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1496369 (10Krenair) [21:38:22] 6operations, 10Traffic, 10fundraising-tech-ops, 5Patch-For-Review: Decide what to do with *.donate.wikimedia.org subdomain + TLS - https://phabricator.wikimedia.org/T102827#1496380 (10CCogdill_WMF) Update from IBM/Silverpop: it turns out there is a way for us to use https without losing all of our click da... [21:38:29] 10Ops-Access-Requests, 6operations: Access to stat1002, stat1003, and fluorine for user bearloga - https://phabricator.wikimedia.org/T107043#1496381 (10mpopov) Still //Permission denied (publickey)// :,\ [21:38:34] coren did you enable jessie-backports manually on labstore1002? [21:40:01] (03PS1) 10Dzahn: icinga: remove jgage from cgi.cfg and contacts [puppet] - 10https://gerrit.wikimedia.org/r/228153 [21:40:14] YuviPanda: Hm. I don't beleive I did, though that box is one of the first jessie installs from the net-install image before we had a d-i for it so the install was mostly by hand. [21:40:18] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore1001 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state inactive [21:40:59] * Coren stares at icinga. [21:41:20] coren I see. so labstore1001 is failing because it doesn't have jessie-backports and pymysql is from jessie [21:41:26] Oh, the older version. I was beginning to worry someone pushed the patch through. [21:41:58] YuviPanda: Ah. That one was a very recent install so would be the "correct" one. Means we need to add -backports to the manifest then. [21:43:26] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1496410 (10CCogdill_WMF) As Eric mentioned earlier in this task, Trilogy runs on a multi-tenant system, so their system is not currently set up to add new IPs. They are looking into whether or not... [21:43:38] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496411 (10yuvipanda) 3NEW [21:43:53] coren ^ filed bug [21:44:04] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1496421 (10RobH) a:5RobH>3None So immediately after this was last updated Mark & Yuvi were chatting in IRC with others about the scope of this project and the network req... [21:48:17] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496436 (10coren) I'd rather enable jessie-backports, I think, since the packages there will be maintained actively by the community; and they are deactivated by default. [21:51:19] YuviPanda: Hm. The disk space check is being confused by the snapshots mounted for backups - although the error message is kinda odd. [21:53:49] RECOVERY - puppet last run on labstore2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:57:19] PROBLEM - Ensure NFS exports are maintained for new instances with NFS on labstore2001 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state inactive [21:58:05] 6operations, 10Datasets-General-or-Unknown: Find docs on dataset mirrors - https://phabricator.wikimedia.org/T107510#1496469 (10Krenair) 3NEW a:3ArielGlenn [21:58:31] (03CR) 10Alex Monk: "Created T107510 to figure out the dataset one." [puppet] - 10https://gerrit.wikimedia.org/r/228147 (owner: 10Alex Monk) [21:59:53] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496478 (10scfc) Option c: Enable `jessie-backports` only for the hosts that need it and only for the package that is needed. [22:01:10] !log ori Synchronized php-1.26wmf16/includes/page/WikiPage.php: I73fba15c26c1: Defer the InfoAction purge in onArticleEdit() (duration: 00m 11s) [22:01:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:02:26] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1496483 (10CaitVirtue) Hi all, I appreciate all the concerns you've shared with us and I can see that in the very near future we need to do this differently. Caitlin Cogdill has proposed a plan... [22:02:47] 6operations, 10Math, 5Patch-For-Review: Convert Math to use extension registration - https://phabricator.wikimedia.org/T87941#1496489 (10Jdforrester-WMF) [22:02:59] 6operations, 10Math, 5Patch-For-Review: Convert Math to use extension registration - https://phabricator.wikimedia.org/T87941#1496490 (10Jdforrester-WMF) 5Open>3Resolved [22:06:27] RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [22:07:01] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1496511 (10yuvipanda) I've just synced the sources for both of them, and puppet succeeds. Leaving this open to determine the right thing to do... [22:07:55] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1496513 (10yuvipanda) 5Open>3stalled After IRC discussions this requires networking expertise we do not have availability of atm, so stalling for now. [22:08:16] (03PS2) 10Dzahn: analytics refinery: remove jgage [puppet] - 10https://gerrit.wikimedia.org/r/228151 [22:08:58] coren I'm going to leave the disk alert to you - I fixed all the other ones [22:09:17] kk. Still trying to figure out why it goes cray-cray atm [22:09:28] check_disk by hand works with no error. [22:09:53] cajoel: ok [22:09:54] err [22:09:57] sorry cajoel. [22:15:32] (03CR) 10Dzahn: [C: 032] "thanks ori for the review" [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [22:15:37] (03PS7) 10Dzahn: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227968 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [22:16:21] 6operations, 7Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1496747 (10chasemp) [22:16:39] bd808, btw, MEDIAWIKI_DEPLOYMENT_DIR and MEDIAWIKI_STAGING_DIR in that file are also very much used [22:17:03] as in, in mwscript itself [22:17:49] MW_COMMON, MW_COMMON_SOURCE, MW_RSYNC_HOST, MW_DSH_ARGS, MW_RSYNC_ARGS though... don't think so [22:18:24] (03PS3) 10Dzahn: analytics refinery: remove jgage [puppet] - 10https://gerrit.wikimedia.org/r/228151 [22:18:52] (03CR) 10Dzahn: [C: 032] analytics refinery: remove jgage [puppet] - 10https://gerrit.wikimedia.org/r/228151 (owner: 10Dzahn) [22:19:18] (03PS2) 10Dzahn: icinga: remove jgage from cgi.cfg and contacts [puppet] - 10https://gerrit.wikimedia.org/r/228153 [22:20:30] (03CR) 10Dzahn: [C: 032] icinga: remove jgage from cgi.cfg and contacts [puppet] - 10https://gerrit.wikimedia.org/r/228153 (owner: 10Dzahn) [22:21:50] (03PS2) 10Dzahn: Icinga: fix varnishncsa warning on text & mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/226110 (owner: 10Gage) [22:22:16] (03PS3) 10Ori.livneh: Send $USER and $HOSTNAME with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228125 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [22:22:35] (03CR) 10Ori.livneh: [V: 032] Send $USER and $HOSTNAME with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228125 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [22:23:32] (03PS3) 10Dzahn: Icinga: fix varnishncsa warning on text & mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/226110 (owner: 10Gage) [22:24:39] (03CR) 10Dzahn: [C: 032] Icinga: fix varnishncsa warning on text & mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/226110 (owner: 10Gage) [22:25:33] 6operations, 7Monitoring: Collect and report nutcracker statistics to Ganglia and/or Graphite - https://phabricator.wikimedia.org/T107381#1496902 (10chasemp) a:3chasemp [22:29:29] (03CR) 10Dzahn: "one more ping, what keeps us from merging this?" [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [22:30:20] (03CR) 10Ori.livneh: "@Dzahn: a merge conflict ;)" [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [22:31:43] (03PS2) 10Dzahn: toss mw-logs after 90 days, not 180 [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [22:32:04] (03PS3) 10Dzahn: toss mw-logs after 90 days, not 180 [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [22:32:18] (03CR) 10Dzahn: "fixed that" [puppet] - 10https://gerrit.wikimedia.org/r/195917 (owner: 10ArielGlenn) [22:32:34] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 16.67% of data above the critical threshold [500.0] [22:32:44] 6operations, 10Wikimedia-Mailing-lists: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1496934 (10JohnLewis) Will talk to @robh to schedule this for action. [22:33:00] (03PS4) 10Dzahn: toss mw-logs after 90 days, not 180 [puppet] - 10https://gerrit.wikimedia.org/r/195917 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [22:33:07] robh: ^ list rename request. Since you've handled the last few, up to it? [22:33:25] (03CR) 10Ori.livneh: [C: 032 V: 032] toss mw-logs after 90 days, not 180 [puppet] - 10https://gerrit.wikimedia.org/r/195917 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [22:34:06] yea can handle easily enough [22:34:14] i'll claim and follow up with a window next week =] [22:34:16] (03CR) 10Dzahn: "@ottomata should we still wait?" [puppet] - 10https://gerrit.wikimedia.org/r/197081 (https://phabricator.wikimedia.org/T83531) (owner: 10ArielGlenn) [22:34:20] ori: :) [22:34:24] 6operations, 10Wikimedia-Mailing-lists: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1496953 (10RobH) a:3RobH [22:35:21] what's with the 5xx spike? [22:36:02] robh: awesome :) [22:37:48] robh, to clarify, are renames something that need downtime? [22:39:30] i think if a message hits a list during rebuild its bad right? [22:39:32] JohnFLewis: ^ [22:40:22] 6operations, 6Commons, 10MediaWiki-Special-pages, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1497013 (10ori) See {T15438} [22:40:22] if an email is set to education-coop during the process of a rename it either a) will never arrive or b) bounce on sodium. [22:40:36] yea [22:40:39] so bad [22:40:50] Krenair: so it should only be a short window, but dont schedule short windows ;D [22:40:52] A downtime is not the best word as if you stop mailman, apache serves the web pages and exim holds the email. So technically its a delay :) [22:40:58] indeed [22:41:01] but all mail routing stops [22:41:12] and that lists archives are borked during that, but the rest are fine [22:41:24] public archives aren't :) [22:41:27] oh? nice [22:41:35] private archives are though because it requires mailman to confirm [22:42:01] i meant that the list that is rebuilding archives dont have those archives accssible [22:42:15] all mail routing, or only mailman routing? [22:42:18] though education-coop is a bad example, its archives arent public [22:42:21] Krenair: mailman [22:42:22] well yeah of course [22:42:30] heh [22:42:43] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:42:49] just mailman now. secondarymx stuff were split from lists' exim recently by faidon [22:47:49] 6operations: Update wikimedia apt repo to include debs for shiny-server - https://phabricator.wikimedia.org/T106435#1497060 (10EBernhardson) I don't think there is any current plan to move this into production, there is no expectation of uptime for dashboards (yet). Puppetizing the setup was mostly to ensure p... [22:50:23] currently 1 item in swat. i wonder how many pile on in the next 10 minutes ;) [22:58:31] ebernhardson: the mwv response - it wasn't that your puppet code is complicated, it's just that it includes the mediawiki roles by default and those are :P [22:58:56] but I have no strong opinions, more of a 'here be composer errors that are going to confuse everyone else, be careful' [22:59:09] YuviPanda: i actualy looked briefly and turning mediawiki into a role in mwv, but turns out thats much harder than it sounded :P [22:59:10] (03CR) 10GWicke: "@ArielGlenn, ping!" [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/206849 (https://phabricator.wikimedia.org/T17017) (owner: 10ArielGlenn) [22:59:37] ebernhardson: indeed :) [22:59:52] ebernhardson: it also pins you to trusty while on labs you wannabejessie [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150730T2300). Please do the needful. [23:00:17] looks like i have the only swat patch, i'll just deploy it [23:01:32] robh: got any update re. meeting? [23:01:57] ebernhardson: can you add https://gerrit.wikimedia.org/r/228165 ? (once it merges) [23:02:07] on mailman? I haven't worked on it since we chatted last nope. i've been in monitoring tweaking stuff most of the week and the rest procurement [23:02:14] ori: sure can ship that too [23:02:22] thanks [23:02:31] robh: well I'm not available August 8th-15th [23:02:51] so next Saturday to the Saturday after [23:03:00] fun vacation i hope? =] [23:03:28] south of England - so fun :P [23:04:05] where are you going in the south JohnFLewis? [23:04:48] Krenair: Weymouth area [23:06:18] that's really close to where I'm going to be from later today through to the 8th [23:06:34] !log manually merged User:Mirwin's accounts (T107168) [23:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:43] (03CR) 10Dzahn: [C: 031] "Host de-beta.wikipedia.org not found: 3(NXDOMAIN)" [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [23:06:57] (03PS2) 10Krinkle: rl-test: Add instrumentation for User-Agent and Remote IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228124 (https://phabricator.wikimedia.org/T105255) [23:07:02] (03CR) 10Krinkle: [C: 032] rl-test: Add instrumentation for User-Agent and Remote IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228124 (https://phabricator.wikimedia.org/T105255) (owner: 10Krinkle) [23:07:14] (03Merged) 10jenkins-bot: rl-test: Add instrumentation for User-Agent and Remote IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228124 (https://phabricator.wikimedia.org/T105255) (owner: 10Krinkle) [23:07:18] (03CR) 10Dzahn: "path conflict though" [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [23:10:12] !log krinkle Synchronized w/rl-test.php: T105255 (duration: 00m 12s) [23:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:52] !log killing apache on magnesium to manually trigger an outage of racktables and test catchpoint alert formatting [23:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:56] ebernhardson: they merged. there's an update to SpamBlacklist too [23:16:01] i can sync if you like [23:17:24] ori: there look to be a few extra commits from adam as well [23:18:30] well, hmm not sure ordering is a bit wonky here, should be fine [23:18:54] w [23:18:55] what's that worst the happen can? [23:19:22] says vendor submodule update, so many things :P [23:21:17] !log ebernhardson Synchronized php-1.26wmf16/extensions/SpamBlacklist/: Update SpamBlacklist for SWAT (duration: 00m 11s) [23:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:52] !log ebernhardson Synchronized php-1.26wmf16/includes/specials/SpecialSearch.php: Fix search-suggest i18n for frwiki in SWAT (duration: 00m 14s) [23:21:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:23] (03PS1) 10Krinkle: rl-test: Fix IP detection to use WebRequest::getIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228173 (https://phabricator.wikimedia.org/T105255) [23:22:32] ori: pushed them all out...but i've left DonationInterface un-updated since it was just lingering in the deploy branch but not deployed. [23:22:40] !log ebernhardson Synchronized php-1.26wmf16/includes/specials/SpecialMIMEsearch.php: (no message) (duration: 00m 12s) [23:22:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:22:55] K4-713: ^ [23:23:38] awight was doing a lot of payments cluster updates today [23:25:10] 6operations, 6Commons, 10MediaWiki-Special-pages, 5Patch-For-Review, 7Wikimedia-log-errors: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1497228 (10ori) 5Open>3Resolved a:3ori This page is now marked as expensive, so i... [23:25:24] * K4-713 reads backscroll [23:25:41] K4-713: I pasted relevant bits into -fundraising [23:25:48] K4-713: basically i rebase 1.26wmf16 in the prod cluster on tin, and now DonationInterface reports itself as not being in sync [23:26:00] so something was merged but undeployed [23:26:02] Ah, I see. [23:26:20] 23:25 < awight> greg-g: thx. FWIW, the DonationInterface submodule can always be deployed to the main cluster. We use the "deploy" branch for paymentswiki stuff, but on the cluster we go out with the train. [23:26:24] 23:25 < awight> I hope that's reflected in the release configuration? [23:26:34] * greg-g just pasted [23:26:50] 23:26 < awight> On the main cluster, we only use DonationInterface for its translations, on donatewiki. [23:27:11] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1497232 (10GWicke) We downgraded this morning. So far it looks like G1GC new gen collection times might have gone up significantly: {F288218} {F288229} There is a bootstrap going on, but that was also... [23:27:14] All these things are true, yes. Or, should be. [23:27:35] ebernhardson: hey, thanks for being cautious on our behalf! I don't know anything about undeployed DonationInterface commits on the main cluster, you can blow those away any time (per above explanation) [23:27:49] 6operations, 6Commons, 10MediaWiki-Special-pages, 5Patch-For-Review, and 2 others: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1497233 (10ori) [23:27:51] ok, will ship them :) [23:27:59] Sounds like an accident happened at some point... I'm only deploying to paymentswiki today. [23:28:02] thanks! [23:28:26] And just to repeat my paste-self, we don't use the "deploy" branch on the main cluster... [23:28:48] it follows the usual release tagging on master--or should, at least. [23:28:56] !log disregard log entry about racktables, never offlined [23:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:29:12] !log ebernhardson Synchronized php-1.26wmf16/extensions/DonationInterface/: Bump DonationInterfae in 1.26wmf16 (duration: 00m 16s) [23:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:30:27] !log ebernhardson Synchronized php-1.26wmf16/extensions/DonationInterface/: Bump DonationInterfae in 1.26wmf16 again...its uses submodules (duration: 00m 15s) [23:30:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:01] ebernhardson: Hmm, crap, can I sneak something into SWAT? [23:33:18] RoanKattouw: i'm already done, do you want to just deploy it :) [23:33:24] Sure [23:33:26] Thanks [23:33:37] (03CR) 10Catrope: [C: 032] Enable Flow on plwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228152 (https://phabricator.wikimedia.org/T107301) (owner: 10Catrope) [23:33:43] (03Merged) 10jenkins-bot: Enable Flow on plwiki and commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228152 (https://phabricator.wikimedia.org/T107301) (owner: 10Catrope) [23:35:25] csteipp, around? [23:36:15] (03CR) 10Alex Monk: "This should be OK now..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226971 (https://phabricator.wikimedia.org/T106206) (owner: 10Alex Monk) [23:37:29] Krenair: Yeah, waht's up [23:37:47] csteipp, https://gerrit.wikimedia.org/r/#/c/195886/ seems ready to go? [23:38:08] Yeah, I'd like to see that go out [23:38:10] relevant logging patch is in prod (wmf16, krenair@tin:~$ mwversionsinuse [23:38:10] 1.26wmf16) [23:39:49] csteipp, shall we do it after RoanKattouw is done with flow then? [23:40:28] Sorry will deploy that [23:41:01] Krenair: I have to leave a little after 5, but happy to +2 if you're going to be around in case something blows up. [23:41:12] !log catrope Synchronized flow.dblist: Enable Flow on plwiki and commonswiki (duration: 00m 11s) [23:41:14] (20 mins from now that is) [23:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:41:42] I can deploy it and will be around [23:45:28] (03CR) 10Alex Monk: [C: 032] Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [23:45:50] (03Merged) 10jenkins-bot: Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [23:46:51] !log krenair Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/195886/ (duration: 00m 12s) [23:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:30] RoanKattouw, hmm... are you sure your deployment is okay? [23:47:31] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/195886/ (duration: 00m 11s) [23:47:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:57] RoanKattouw, I'm monitoring the logs and there's a lot of entries for flow on commonswiki going into exception.log [23:48:57] [f7085c67] [no req] Flow\Exception\FlowException from line 291 of /srv/mediawiki-staging/php-1.26wmf16/extensions/Flow/includes/TalkpageManager.php: All of the candidate usernames exist, but they are not configured as expected. [23:49:04] Is what I just got from running a maintenance script [23:49:10] Let me look at exception.log [23:49:19] that's what's coming through for real user requests [23:49:34] PROBLEM - puppet last run on ruthenium is CRITICAL Puppet has 6 failures [23:49:53] Shit [23:49:54] OK, undeploying [23:50:15] plwiki seems fine though [23:50:56] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1497335 (10GWicke) C* read latency went up quite a bit as well: {F288370} [23:51:45] (03PS1) 10Catrope: Disable Flow on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228177 [23:52:16] !log catrope Synchronized flow.dblist: remove commons (duration: 00m 14s) [23:52:17] (03CR) 10Catrope: [C: 032] Disable Flow on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228177 (owner: 10Catrope) [23:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:52:40] (03Merged) 10jenkins-bot: Disable Flow on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228177 (owner: 10Catrope) [23:53:49] Krenair: Hopefully that will make it stop. plwiki doesn't seem to have problems [23:53:49] (03PS1) 10Ori.livneh: Use $sessionRedis to set redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228178 [23:53:59] (03PS2) 10Ori.livneh: Use $sessionRedis to set redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228178 [23:54:05] (03CR) 10Ori.livneh: [C: 032] Use $sessionRedis to set redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228178 (owner: 10Ori.livneh) [23:54:10] (03Merged) 10jenkins-bot: Use $sessionRedis to set redis server [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228178 (owner: 10Ori.livneh) [23:54:28] Hmm, also: [23:54:29] 2015-07-30 23:52:42 mw1184 enwiki exception ERROR: [56f4455f] /wiki/Special:WhatLinksHere/Midwinter Flow\Exception\CatchableFatalErrorException from line 35 of /srv/mediawiki/php-1.26wmf16/extensions/Flow/includes/Model/WikiReference.php: Argument 6 passed to Flow\Model\WikiReference::__construct() must be an instance of Title, null given {"exception":"[object] (Flow\\Exception\\CatchableFatalEr [23:54:31] rorException(code: 0): Argument 6 passed to Flow\\Model\\WikiReference::__construct() must be an instance of Title, null given at /srv/mediawiki/php-1.26wmf16/extensions/Flow/includes/Model/WikiReference.php:35)"} [23:55:11] Krenair: I take it you figured out why proofreadpage didn't deploy at first? [23:57:31] Oh, hm. Looks like Diffusion was being silly and told me today of yesterday's attempt. Odd. [23:58:32] Coren, haven't looked into it, the patch is still sitting there -1'd [23:58:34] yeah [23:59:11] gwicke: what does 'C*' mean?