[00:06:52] !log ori Synchronized php-1.26wmf7/extensions/Echo/includes/DiffParser.php: 41d27c4a26: Update Echo for cherry-picks (duration: 00m 13s) [00:06:56] Logged the message, Master [00:07:07] !log ori Synchronized php-1.26wmf7/includes/diff/UnifiedDiffFormatter.php: d95cac90c7: Make the output of UnifiedDiffFormatter match diff -u (duration: 00m 14s) [00:07:11] Logged the message, Master [00:33:15] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1319422 (10Dzahn) achernar: 951 acamar: 951 baham: 0 cobalt: n/a - doesn't exist except mgmt lead: 951 lithium: 951 polonium: 951 rhodium: 951 argon: 951 bast4001: 7628 copper:... [00:39:31] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1319432 (10Dzahn) It's a bit strange. For example i installed "subra" and "suhail" in codfw, so i know it's not that long ago and they are ok. A git log on the "raid1-lvm.cfg" shows... [00:46:49] (03CR) 10BryanDavis: "Fix by _joe_ in I45ddfd4c0ec63b4feeae19d9a42f7a870f34d451. Followup for non-canary servers in I526107099be5c9b9093110b94a7c3ec9856fdb3c." [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis) [00:50:57] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1319454 (10csteipp) After Dzahn dropped the bugs_deleted table, I think this looks ok now. There are still a lot of full emails around, but I think those are all from places where th... [00:56:39] (03CR) 10Dzahn: [C: 032] access: add dbrant to researchers [puppet] - 10https://gerrit.wikimedia.org/r/213970 (https://phabricator.wikimedia.org/T99798) (owner: 10Dzahn) [01:00:11] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1319459 (10Dzahn) alright, thanks. has approval and the waiting period is over. merged. ran puppet on stat1003. -- [stat1003:~] $ id dbrant uid=4910(dbrant... [01:00:26] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1319460 (10Dzahn) 5Open>3Resolved [01:05:13] 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1319479 (10Dzahn) p:5Triage>3Low [02:31:07] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 06m 54s) [02:31:18] Logged the message, Master [02:36:13] !log LocalisationUpdate completed (1.26wmf7) at 2015-05-29 02:35:10+00:00 [02:36:20] Logged the message, Master [02:54:41] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319531 (10BBlack) >>! In T100690#1318724, @Dzahn wrote: > or did you mean to automatically add this in base and stop doing it on individual nodes? yes :) [03:01:59] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 10m 08s) [03:02:06] Logged the message, Master [03:07:55] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319553 (10BBlack) Just for reference, I looked into what facter returns for the `$ipaddress` fact, and it appears to be based on whatever it sees first in ifconfig output which isn't in the 127/... [03:09:18] !log LocalisationUpdate completed (1.26wmf8) at 2015-05-29 03:08:15+00:00 [03:09:23] Logged the message, Master [03:57:05] PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail [04:13:56] RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [04:46:04] Krenair: oops. logged out now. [04:46:13] Krenair: did it created any issue? [04:51:22] (03CR) 10Ori.livneh: hhvm: add memory leak isolation scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212187 (owner: 10Ori.livneh) [04:51:39] (03PS3) 10Ori.livneh: hhvm: add memory leak isolation scripts [puppet] - 10https://gerrit.wikimedia.org/r/212187 [05:08:55] ok kids, I'm going to regen all the salt keys on prod now. if something goes awry it could be a half hour til we have new ones. [05:19:45] regen is in process. [05:23:37] PROBLEM - puppet last run on cp4007 is CRITICAL puppet fail [05:24:26] 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename all some mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1319714 (10Dzahn) [05:25:54] keys regened, now waiting for them to show up o the master to be accepted [05:30:17] 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1319725 (10Dzahn) a:3Dzahn [05:30:46] PROBLEM - puppet last run on tin is CRITICAL Puppet has 1 failures [05:32:12] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1319728 (10Dzahn) 5Open>3stalled [05:32:50] well it looks like we're going to be waitng the half hour. bah. nt that things broke but just that the minions don't want to check back in without a kick [05:35:45] heh as soon as I say that they stat coming in. so prolly done in 2 minutes :-D [05:40:15] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 1 failures [05:40:36] RECOVERY - puppet last run on cp4007 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [05:40:56] PROBLEM - puppet last run on mw1104 is CRITICAL Puppet has 1 failures [05:41:43] waiting to make sure all the minions are happily reconnected [05:41:51] apergos: !log [05:42:01] I shall when complete [05:42:12] ori [05:42:22] :) [05:42:30] prolly about 2 more minutes [05:47:36] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [05:56:56] RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:57:35] RECOVERY - puppet last run on mw1104 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:03:31] oops, of course I got lost in testing [06:03:47] !log salt keys regenerated on all production hosts (minions, not master key) [06:03:51] Logged the message, Master [06:12:47] !log ori Synchronized php-1.26wmf8/includes/deferred/SiteStatsUpdate.php: Icc12c07ab: Update context stats in SiteStatsUpdate (duration: 00m 14s) [06:12:50] Logged the message, Master [06:13:00] !log ori Synchronized php-1.26wmf7/includes/deferred/SiteStatsUpdate.php: Icc12c07ab: Update context stats in SiteStatsUpdate (duration: 00m 13s) [06:13:04] Logged the message, Master [06:29:46] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 2 failures [06:29:55] PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 2 failures [06:29:55] PROBLEM - puppet last run on elastic1022 is CRITICAL puppet fail [06:30:16] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:30:46] PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 2 failures [06:31:08] (03PS1) 10Ori.livneh: carbon-c-relay: blackhole stddev and sum_sq [puppet] - 10https://gerrit.wikimedia.org/r/214576 [06:31:15] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:31:15] PROBLEM - puppet last run on mw2145 is CRITICAL Puppet has 1 failures [06:31:26] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [06:31:27] PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 2 failures [06:31:33] good morning puppetmaster [06:32:46] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures [06:33:46] PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 3 failures [06:34:17] PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures [06:35:07] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures [06:35:45] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures [06:44:08] 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1319755 (10ArielGlenn) After salt key regeneration, I get consistently good results with -b 100 cmd.run uptime (no timeout = default timeout of 5 seconds per batch). I'd suggest -b 50 for anything that does real work, and s... [06:44:56] RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:26] RECOVERY - puppet last run on mw2145 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:16] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:52:48] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 29 06:51:45 UTC 2015 (duration 51m 44s) [06:52:52] Logged the message, Master [07:06:46] RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [07:07:06] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures [07:07:25] RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:07:37] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:07:46] RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:07:56] RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:06] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [07:08:16] RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:27] RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:24:51] 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1319924 (10akosiaris) 5Open>3Resolved yurik confirmed access on IRC and at T100548. Resolving this. [07:35:34] andre__: there is a complain on otrs about your profil picures. can you please attribute it? [07:42:42] fuck otrs [07:42:46] er. hi :D [07:49:52] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319959 (10BBlack) Thinking through options how we could discover the primary interface more elegantly: 1. We could use ipresolve($fqdn) and match that against the set of $ipaddress_INTF facts,... [07:50:06] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319960 (10BBlack) [08:09:12] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1319968 (10akosiaris) @Robh, no the process is not that demanding, neither in CPU cycles or Disk I/O. For disk I/O, @Dzahn has added an I/O check in icinga that up to now only triggers on bacula backing up the machine, not during n... [08:19:09] Steinsplitter: (does not sound like an operations topic?): It does not require attribution so I don't plan to do that. [08:19:36] Steinsplitter: for the records, I commented on that a while ago already on https://www.mediawiki.org/wiki/User_talk:AKlapper_%28WMF%29#Your_profile_picture_on_Phabricator [08:20:17] Steinsplitter, if I somehow misinterpret the requirements that I link to there, please feel free to correct me on that talk discussion and I'm happy to attribute if really required. [08:22:55] (03CR) 10Filippo Giunchedi: [C: 04-1] Add varnish request stats diamond collector (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [08:33:08] (03CR) 10Filippo Giunchedi: [C: 031] "minor nit but LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [08:36:16] (03PS6) 10Filippo Giunchedi: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [08:36:33] andre__: thanks for the link to mw. [08:36:41] heh, sure :) [08:39:35] (03CR) 10Filippo Giunchedi: [C: 031] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [08:41:03] (03CR) 10Filippo Giunchedi: "personally I'm fine with the change but audience should be wider if we're removing stats e.g. phab" [puppet] - 10https://gerrit.wikimedia.org/r/214576 (owner: 10Ori.livneh) [08:42:02] (03CR) 10Mobrovac: [C: 04-1] CX: Log to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [08:46:36] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 856.488270475 [09:11:18] 6operations, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1320035 (10Multichill) Oh wait, what? We're renaming one of the lists? We actually have 4 lists (https://lists.wikimedia.org/mailman/listinfo) with 3 of them still using the old naming scheme.... [09:12:31] 6operations, 10pywikibot-core, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1320038 (10Multichill) [09:13:26] 6operations, 10pywikibot-core, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1320041 (10JohnLewis) We can rename all four and regarding the announcement I for some reason though ladsgroup may poked the list(s). No worries, I'll send an email shortly... [09:13:52] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Without even looking at the code (which I'm sure is ok), I don't like the idea that we lose a priori any indication of which varnish serve" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [09:16:39] 6operations, 10pywikibot-core, 10Wikimedia-Mailing-lists: Rename pywikipedia list prefixes to pywikibot - https://phabricator.wikimedia.org/T100707#1320048 (10JohnLewis) [09:24:39] apergos: I might have upset the salt master by running salt-run jobs.active (not sure I did, just letting you know) [09:39:08] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1320063 (10faidon) >>! In T100636#1319432, @Dzahn wrote: > It's a bit strange. For example i installed "subra" and "suhail" in codfw, so i know it's not that long ago and they are ok.... [09:39:33] godog: if you did it's already happy again [09:42:58] cool [09:49:44] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1320076 (10faidon) Note how add_ip6_mapped defaults to using interfaces[0], which is correctly set on all hosts but dataset1001. dataset1001 is set to eth2 because that's the 10G port, but I thin... [10:09:47] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1320153 (10hashar) p:5Normal>3High [10:11:41] 6operations, 10Continuous-Integration-Config: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#1320156 (10hashar) [10:11:51] 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#1320157 (10hashar) [10:12:34] 6operations, 10Continuous-Integration-Config: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#22553 (10hashar) This is up to #operations to adjust the `/.puppet-lint.rc` file in operations/puppet.git. [10:12:41] 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#768414 (10hashar) This is up to #operations to adjust the `/.puppet-lint.rc` file in operations/puppet.git. [10:13:08] 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#1320163 (10hashar) p:5Normal>3Low [10:14:05] 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#768414 (10hashar) [10:14:07] 6operations, 10Continuous-Integration-Config: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#22553 (10hashar) [10:17:34] 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#1320176 (10hashar) The lint check is currently disabled in `/.puppet-lint.rc` I don't think it can be ignored for a specific file hierarchy. So one... [10:32:05] 6operations, 6Release-Engineering, 7Performance: performance testing environment - https://phabricator.wikimedia.org/T67394#1320211 (10hashar) [10:32:42] 6operations, 6Release-Engineering, 7Performance: performance testing environment - https://phabricator.wikimedia.org/T67394#1320215 (10hashar) 5Open>3stalled >>! In T282#1262238, @greg wrote: > Setting to Stalled, it's probably something that will come up again, but you're right, not on the plan for now. [10:33:34] 6operations, 10Continuous-Integration-Infrastructure, 7Graphite, 7Upstream, 7Zuul: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#1320220 (10hashar) [10:36:57] 6operations: Backport & test firmware-linux 0.44 - https://phabricator.wikimedia.org/T100771#1320225 (10faidon) 3NEW [10:38:38] 6operations: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1320242 (10faidon) 3NEW [10:38:51] 6operations: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1320249 (10faidon) [10:38:53] 6operations: Backport and include linux-tools-3.19 to our jessie repository - https://phabricator.wikimedia.org/T100216#1320250 (10faidon) [10:39:06] moritzm: ^^ :) [10:42:16] paravoid: the switch in d-i is done in gerrit, I wanted to hold back the push until I have perf ready, but I can go ahead with it earlier [10:42:25] I'll claim the Phab tasks [10:43:10] 6operations: Backport & test firmware-linux 0.44 - https://phabricator.wikimedia.org/T100771#1320253 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [10:43:16] yeah, I added a blocked by too, so that sounds sane to me [10:43:34] just installed a host yesterday and realized it needed a reboot for 3.19 :) [10:44:26] 6operations: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1320258 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [10:44:34] 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1320260 (10faidon) Ubuntu's partman-auto-lvm changelog for trusty [[ http://changelogs.ubuntu.com/changelogs/pool/main/p/partman-auto-lvm/partman-auto-lvm_51ubuntu1/changelog | mention... [10:45:37] 6operations: LVM recipes broken for jessie, set up all remaining LVM space as swap - https://phabricator.wikimedia.org/T100636#1320261 (10faidon) [10:45:45] godog: ^ :) [10:47:07] indeed [10:55:34] 6operations, 10Wikimedia-Logstash, 7Elasticsearch, 7Monitoring: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090#1320280 (10hashar) [10:55:47] 6operations, 10Wikimedia-Logstash, 7Elasticsearch, 7Monitoring: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090#789335 (10hashar) Moving that monitoring task from #releng to #ops [10:55:53] 6operations: LVM recipes broken for jessie, set up all remaining LVM space as swap - https://phabricator.wikimedia.org/T100636#1320283 (10fgiunchedi) on the debian side similar issues are reported as [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=517935 | #517935 ]] or [[ https://bugs.debian.org/cgi-bin/bu... [10:57:53] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1320303 (10faidon) >>! In T98003#1318539, @hashar wrote: > The operations-dns-lint job runs on Jenkins slaves in prod (gallium and lanthanum) and is one of the last job still running there. I tried earlie... [11:00:42] 6operations, 7Icinga: Icinga: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777#1320316 (10hashar) 3NEW [11:02:44] 6operations, 7Icinga: Icinga: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777#1320333 (10faidon) That doesn't sound like a very good idea. It should be the other way around: we set hosts under maintenance somewhere else (e.g. etcd ;)) an... [11:08:44] (03PS1) 10Faidon Liambotis: authdns: small fix for the Ganglia gdnsd plugin [puppet] - 10https://gerrit.wikimedia.org/r/214591 [11:08:57] (03PS12) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) [11:09:16] (03CR) 10Faidon Liambotis: [C: 032] authdns: small fix for the Ganglia gdnsd plugin [puppet] - 10https://gerrit.wikimedia.org/r/214591 (owner: 10Faidon Liambotis) [11:29:30] 6operations, 10MediaWiki-Logging, 6Release-Engineering, 7HHVM: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1320406 (10hashar) 5Open>3Resolved a:3hashar The hhvm SlowTimer errors are still written to hhvm.log. We got logstash now thou... [11:41:02] !log redirecting ns0 traffic to baham (= ns1) in preparation for rubidium upgrade [11:41:05] Logged the message, Master [11:41:06] 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1320416 (10hashar) 3NEW [11:45:36] <_joe_> !log restart nutcracker on mw1150 [11:45:40] Logged the message, Master [11:47:34] (03PS1) 10Faidon Liambotis: autoinstall: change rubidium's recipe to raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/214593 [11:48:07] (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: change rubidium's recipe to raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/214593 (owner: 10Faidon Liambotis) [11:48:17] 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1320424 (10Joe) All servers have been ejected around 3 AM UTC and never recovered. We can probably monitor this kind of problems, and maybe also try to pin down a bit b... [11:48:37] 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1320426 (10Joe) 5Open>3Resolved a:3Joe [11:49:09] <_joe_> hashar: it's basically a duplicate of the former ticket I named there, so resolving. [11:49:23] sure [11:49:28] just making sure something is filled :D [11:53:25] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [11:53:33] !log reimaging rubidium [11:53:37] Logged the message, Master [11:55:16] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [11:55:45] RECOVERY - Host rubidium is UPING OK - Packet loss = 0%, RTA = 1.24 ms [11:56:17] <_joe_> hashar: if you see that happening again let me know, I may have some more time to understand what's the situation a bit better [11:56:42] I found out another dashboard at https://logstash.wikimedia.org/#/dashboard/elasticsearch/memcached-serious [11:57:00] do we have any Icinga check that rely on logstash ? [11:59:06] PROBLEM - RAID on rubidium is CRITICAL: Connection refused by host [11:59:06] PROBLEM - dhclient process on rubidium is CRITICAL: Connection refused by host [11:59:06] PROBLEM - configured eth on rubidium is CRITICAL: Connection refused by host [11:59:16] PROBLEM - Disk space on rubidium is CRITICAL: Connection refused by host [11:59:45] PROBLEM - puppet last run on rubidium is CRITICAL: Connection refused by host [12:00:06] PROBLEM - salt-minion processes on rubidium is CRITICAL: Connection refused by host [12:00:15] PROBLEM - Auth DNS on rubidium is CRITICAL - Plugin timed out while executing system call [12:00:46] PROBLEM - DPKG on rubidium is CRITICAL: Connection refused by host [12:01:39] PROBLEM - puppet last run on mw2055 is CRITICAL Puppet has 1 failures [12:02:37] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:04:08] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [12:09:37] PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail [12:15:39] (03CR) 10Muehlenhoff: "At least for jessie installations we could rely on systemd-detect-virt (returns "kvm" on a jessie labs instance and "none" on standard har" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [12:18:07] RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [12:20:48] _joe_: the ipresolve patch works, just tested it. going to merge it now [12:21:02] <_joe_> YuviPanda: yeah seems good to me [12:21:04] (03PS5) 10Yuvipanda: wmflib: Add nameserver parameter to ipresolve function [puppet] - 10https://gerrit.wikimedia.org/r/212784 (https://phabricator.wikimedia.org/T99833) [12:21:10] (03PS2) 10Alexandros Kosiaris: install-server: Accomodate virtualization [puppet] - 10https://gerrit.wikimedia.org/r/214377 [12:21:13] (03CR) 10Yuvipanda: [C: 032 V: 032] wmflib: Add nameserver parameter to ipresolve function [puppet] - 10https://gerrit.wikimedia.org/r/212784 (https://phabricator.wikimedia.org/T99833) (owner: 10Yuvipanda) [12:22:06] (03CR) 10Alexandros Kosiaris: "I came up with an alternative approach, blending both your proposals together. @Faidon, unfortunately stil not virtio-scsi for ganeti http" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [12:26:28] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:32:36] (03PS2) 10Yuvipanda: tools: Make redis failover-able [puppet] - 10https://gerrit.wikimedia.org/r/212792 (https://phabricator.wikimedia.org/T99737) [12:35:17] RECOVERY - Host ns0-v6 is UPING OK - Packet loss = 0%, RTA = 1.36 ms [12:36:19] YuviPanda, FYI: lag on labsdb1 for s1 https://phabricator.wikimedia.org/P701 I am not doing nothing about it for now [12:36:53] jynus: alright. you should feel free to kill terrible queries on labsdb without prejudice [12:37:18] everyone will highly appreciate it :) [12:37:28] I will if it continues, but saw on the graphs it is "common" occurence [12:37:31] 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1320527 (10hashar) > Well, time to start actively maintaining it then :) We probably have more jessie hosts than precise nowadays and testing our DNS config in a distribution that is 5 years older than pr... [12:37:44] but I think I can fix it long term on configuration [12:37:50] oooooh cool ;D [12:41:40] (03CR) 10Yuvipanda: [C: 032] tools: Make redis failover-able [puppet] - 10https://gerrit.wikimedia.org/r/212792 (https://phabricator.wikimedia.org/T99737) (owner: 10Yuvipanda) [12:43:11] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [12:45:01] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [12:46:02] RECOVERY - Host ns0-v6 is UPING OK - Packet loss = 0%, RTA = 2.88 ms [12:46:12] RECOVERY - Host rubidium is UPING OK - Packet loss = 0%, RTA = 0.79 ms [12:53:26] (03PS1) 10Yuvipanda: toollabs: Specify IPv4 as addresstype for ipresolve [puppet] - 10https://gerrit.wikimedia.org/r/214604 [12:53:34] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Specify IPv4 as addresstype for ipresolve [puppet] - 10https://gerrit.wikimedia.org/r/214604 (owner: 10Yuvipanda) [12:55:00] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1320554 (10BBlack) The $interfaces list is completely arbitrary, much like $ipaddress. Looking back at add_ip6_mapped, it relies on that sort of magic as well. So yeah, perhaps a custom fact or... [13:10:28] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1320581 (10Joe) 3NEW [13:11:12] <_joe_> paravoid, godog, akosiaris, bblack any input very welcome on ^^. A reasonably fast one if possible :) [13:12:07] <_joe_> (as in: before the end of next week) [13:14:48] _joe_: ack [13:17:09] !log roll-restart cassandra on cerium / xenon / praseodymium following java upgrade [13:17:16] Logged the message, Master [13:18:02] PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40) [13:18:42] PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100% [13:18:51] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [13:19:39] (03PS1) 10Yuvipanda: tools: Don't attempt to replicate from master to master [puppet] - 10https://gerrit.wikimedia.org/r/214605 [13:20:17] (03PS10) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 [13:21:01] jynus: I haven’t settled down to work properly yet, but let’s do the holmium db migration in a few minutes if you’re available. [13:21:21] (03CR) 10Ottomata: Add varnishlog python module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [13:21:30] yep it's ok, let me check that everithing is working on my side [13:21:34] (03CR) 10Yuvipanda: [C: 032] tools: Don't attempt to replicate from master to master [puppet] - 10https://gerrit.wikimedia.org/r/214605 (owner: 10Yuvipanda) [13:21:38] andrewbogott, ^ [13:22:22] (03CR) 10Ottomata: "Just in case you haven't seen this: this one will have hostname reports: https://gerrit.wikimedia.org/r/#/c/212041/" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [13:25:45] (03PS1) 10Yuvipanda: tools: Simplify and fix tools-redis master selection [puppet] - 10https://gerrit.wikimedia.org/r/214606 [13:27:09] (03Abandoned) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/212302 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo) [13:27:22] * aude is having some issues with css styling on test.wikidata [13:27:26] (03CR) 10Yuvipanda: [C: 032] tools: Simplify and fix tools-redis master selection [puppet] - 10https://gerrit.wikimedia.org/r/214606 (owner: 10Yuvipanda) [13:27:42] would like to touch those files and sync Wikibase stuff on wmf8 [13:27:56] suppose no one is deploying now or minds... [13:30:01] !log aude Synchronized php-1.26wmf8/extensions/Wikidata: touch js and css files to try to fix issues on test.wikidata (duration: 00m 26s) [13:30:05] Logged the message, Master [13:31:06] (03CR) 10Ottomata: "The diamond collector stuff needs reworked again anyway. The stuff I pushed through yesterday broke something in labs." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata) [13:33:42] (03PS1) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 [13:34:05] jynus: I’m back now, mostly [13:34:45] (03PS2) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 [13:35:14] I have abandoned the other patch [13:35:23] ^this is the new one [13:35:29] but do not +2 [13:35:41] have to checl conoectivity [13:36:55] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [13:38:56] (03PS1) 10Filippo Giunchedi: install-server: add WMF5842 back as d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/214608 (https://phabricator.wikimedia.org/T100636) [13:40:05] paravoid: ^ [13:40:16] (03PS2) 10Filippo Giunchedi: install-server: add WMF5842 back as d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/214608 (https://phabricator.wikimedia.org/T100636) [13:46:09] jynus: nothing for me to do except watch and cross my fingers, right? [13:46:31] yep, cannot currently access from holium to the new db [13:46:42] something maybe wrong on the grants [13:46:53] checking it now [13:48:59] andrewbogott, confirm me that this is wrong: https://gerrit.wikimedia.org/r/#/c/214607/2/templates/mariadb/production-grants-m5.sql.erb [13:49:12] the host should be 20X.XXXXX [13:49:15] correct? [13:49:36] 208.80.154.12 [13:49:56] yes, I’m pretty sure holmium doesn’t have an internal IP [13:51:40] 208.80.154.12 is right [13:52:42] strange [13:52:55] on the current database, it is using live 10.x ips [13:54:49] I’m surprised that works [13:55:12] oh [13:55:22] I think it is because the proxy [13:56:20] yep, that is the ip of the proxy [13:56:37] for now, as there are no slaves, we will have to connect directly [13:57:06] (on the good side, you will have exclusive use of the server) [13:57:48] oh, m5 is just labs stuff? [13:58:23] m5 is right now openstack stuff (+pdns and designate) [13:59:09] (03PS3) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 [14:03:09] now pdns and pdns_admin work, but not designate [14:04:45] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 4961.00381896 [14:05:21] that password changed and is not puppetized [14:05:30] amending again [14:06:44] the designate db password isn’t puppetized? [14:07:03] it is [14:07:17] but not the user creation on maridb side [14:07:24] oh, I see [14:07:51] just check the last patch and you will see it [14:08:04] one thing to keep in mind about all these dbs — periodically openstack will do scripted upgrades from the client side. So we need a way for the users to make schema changes; it doesn’t have to be allowed all the time as long as it’s easy to switch on and off. [14:08:09] Probably you’re on top of that already [14:08:17] (03PS4) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 [14:10:58] yep, yep [14:11:09] ok, connections and grants testes [14:11:23] do a +1 [14:11:27] if it seems ok [14:11:48] and we will go into maintenance mode and deploy [14:12:47] you will now better than me what things may go down, andrewbogott [14:13:01] (03CR) 10Andrew Bogott: [C: 031] Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 (owner: 10Jcrespo) [14:13:09] yeah, I’m thinking about what to test [14:14:20] (03PS5) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 [14:14:53] (03CR) 10Jcrespo: [C: 032] Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 (owner: 10Jcrespo) [14:16:58] shall I do a puppet update on holmium or are you doing that already? [14:17:18] I won't be able to set the original server in read-only mode, as it has othe stuff [14:17:23] but we should be ok [14:17:38] not merged en puppet yet, can I? [14:17:44] yep, let’s do it [14:17:57] !log Moving pdns and designate databases from m1 to m5 [14:18:00] Logged the message, Master [14:18:43] "puppet already in progress" [14:18:50] that’s me! try again [14:18:54] :-) [14:19:57] I do not see accesses from dns, but could be normal [14:20:14] can you do something that may read/write to the db? [14:20:21] sure [14:21:09] we are looking mainly from "could not connect to mysql" errors [14:21:12] *for [14:21:30] I just created a new dns entry and it’s working [14:21:33] do you see the writes? [14:21:41] it should’ve writtent to both designate and pdns I think [14:22:16] I see homium still connected to de proxy [14:22:31] maybe it only gets the conf on service restart? [14:23:47] Surprised puppet didn’t do that... [14:23:49] shall I restart? [14:24:27] on netstat: holmium.wikimedia:41861 dbproxy1001.eqiad:mysql ESTABLISHED [14:24:51] if they are persistent connections, I wouldn't be surprised [14:25:08] I can drop the connections as an alternative [14:26:13] I will kill the mysql connections first and see if the new conf takes place [14:26:18] ok [14:28:17] nope, they reconnect to the same host- config has not been applied [14:29:13] <_joe_> jynus, andrewbogott do a netstat on the db server and the client [14:29:26] <_joe_> and find out which process on the client is recreating said connection [14:29:29] <_joe_> *s [14:30:09] puppet wasn't on the latesd config change, not it is [14:33:33] _joe_ what is easy- but I think it requires a restart [14:33:47] <_joe_> probably, yes [14:33:50] 24777/pdns_server-i [14:33:57] 30474/python [14:33:58] <_joe_> some services don't pick up config live [14:34:28] if only mysql did, it would make my lide 10000x easier [14:34:38] :-) [14:35:04] <_joe_> eheh, well you have ways to modify a great deal of config on a live mysql server [14:35:10] let me try killing mysql again after having run puppet [14:35:15] not mysql [14:35:20] the mysql connections [14:35:26] <_joe_> yeah I was about to ask :P [14:35:27] _joe_, not the important stuff [14:35:59] although 5.6 and 5.7 is getting better [14:36:03] <_joe_> jynus: of course, live-reloading the important stuff would be a pain to implemnent :) [14:36:14] dynamic buffer pool size [14:36:22] tablespace management [14:37:15] ops, we have contact, andrewbogott [14:37:26] jynus: great, want me to run another test? [14:37:34] wait [14:37:38] * andrewbogott waits [14:37:47] let me see if there are threads hanging out on the old server [14:38:13] yes, they are [14:38:47] (03PS1) 10BBlack: new facts for canonical ipv4 addr/interface [puppet] - 10https://gerrit.wikimedia.org/r/214617 (https://phabricator.wikimedia.org/T100690) [14:40:03] and if I kill then they come back [14:41:10] I can force them to fail [14:42:06] does that mean there are still services on holmium with the old config? [14:43:05] yep [14:43:10] 30474/python is ok [14:43:22] 24777/pdns_server-i is not [14:43:37] (03CR) 10BBlack: [C: 032] new facts for canonical ipv4 addr/interface [puppet] - 10https://gerrit.wikimedia.org/r/214617 (https://phabricator.wikimedia.org/T100690) (owner: 10BBlack) [14:44:05] we can kill gracefully/restart 24777/pdns_server-i ? [14:44:13] sure [14:44:14] one second [14:44:16] (03CR) 10Alexandros Kosiaris: "@Muehnlenhoff, that's d-i. unfortunately systemd-detect-virt is not installed" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [14:44:33] the pdns config is still pointing to m1 [14:44:42] on dick? [14:44:46] *disk [14:44:54] yes, let me check... [14:45:22] I will fix, will take a few minutes [14:45:57] maybe the old thread has the file open and it has been overwriten, but not for that thread? [14:46:22] or some cache issue? [14:46:45] RECOVERY - Host ns0-v6 is UPING OK - Packet loss = 0%, RTA = 2.41 ms [14:47:24] no, impossible, I can see it now [14:47:36] I think the puppet code is just wrong [14:48:42] (03PS1) 10Andrew Bogott: Switch labs pdns/mysql to the new db host [puppet] - 10https://gerrit.wikimedia.org/r/214618 [14:48:57] jynus: ^ [14:49:27] wow [14:49:43] I thought I had greped m1 everyware [14:49:50] is that recent? [14:50:14] I don’t think so… [14:50:15] (03CR) 10Giuseppe Lavagetto: "I assumed this patch was superseding that one, not complementing it. Sorry for the confusion." [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [14:50:24] *shrug* I missed it too. [14:50:29] any reason not to merge? [14:50:30] thanks :) [14:50:35] merge, merge [14:50:54] (03CR) 10Andrew Bogott: [C: 032] Switch labs pdns/mysql to the new db host [puppet] - 10https://gerrit.wikimedia.org/r/214618 (owner: 10Andrew Bogott) [14:51:10] good news is that this is not causing any outages, as both servers are coexisting [14:52:11] !log re-redirecting ns0 traffic back to rubidium [14:52:14] Logged the message, Master [14:52:41] jynus: puppet claims that the service refreshed… do things look right now? [14:53:16] yep [14:53:22] everithing is on db1009 now [14:53:35] let me do an aditional test [14:54:10] which is blocking connections to pdns, pdns_admin and designate to the old db [14:54:15] RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.642 second response time [14:54:20] and we can call it a day [14:58:15] ok,locked account on the old server [14:58:30] replication stopped [14:58:43] cool. One last test... [15:00:17] you may have had a smallist time where users cannot log [15:00:29] but it should be ok now [15:01:33] yeah, created a new instance, dns is working fine for it [15:03:13] I am getting ‘DuplicateRecord’ errors when I create new instances. I’m pretty sure that didn’t happen before. [15:03:19] Could that be because of master/master? [15:03:28] yep [15:03:48] do you have more debug info? [15:04:07] (03PS14) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:04:23] not especially. [15:04:27] looking [15:04:36] but, also, we could just turn of syncing and see if that fixes it? [15:04:59] sync is off [15:05:13] here’s a log snippet https://dpaste.de/jwFP [15:05:43] that is ok [15:06:00] I just created a new entry, got the same exception [15:06:03] do you know what’s happening? [15:06:18] Looks like there are a couple of those from yesterday as well, so maybe this is unrelated to the switch-over [15:09:44] jynus? [15:10:08] (03PS15) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh) [15:10:35] (03CR) 10Filippo Giunchedi: "overriding like that sounds good to me, it seems easy to miss the fact that echo is two files instead of one, a separate case statement af" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [15:10:36] I am doing a hack, andrewbogott to fix it, be it caused by the migration or not [15:11:04] jynus: ok — things are working anyway, I’m just concerned that it might be a bug in designate or one of the drivers I wrote [15:11:33] lets talk in private [15:16:30] andrewbogott: q: virt1000 puppetmaster does not have a database backend, does it ? [15:16:41] no exported resources, no storedconfigs, no nothing [15:16:49] I don’t think it does [15:17:13] ok thanks. there will be a couple of PS on your plate for review soon [15:17:21] heads up ;-) [15:19:07] !log anomie Synchronized php-1.26wmf8/extensions/ConfirmEdit/: Update ConfirmEdit to fix API breakage [[gerrit:214620]] (duration: 00m 14s) [15:19:10] Logged the message, Master [15:25:47] andrewbogott, forgot: old db on virt1000 was depupetized but not deleted [15:36:07] jynus: that’s fine, I’m going to rebuild that box entirely sometime soon [15:36:40] can I delete the dbs on m1 once the backups are up too? [15:36:40] * andrewbogott has a migraine, laying low today [15:36:50] jynus: Sure, I don’t know why not. [15:37:00] thank you, take it easy! [15:42:16] (03CR) 10Jcrespo: "For the record, this patch was incomplete without https://gerrit.wikimedia.org/r/#/c/214618/" [puppet] - 10https://gerrit.wikimedia.org/r/214607 (owner: 10Jcrespo) [15:55:15] (03PS11) 10Ori.livneh: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [15:55:22] (03CR) 10Ori.livneh: [C: 032 V: 032] Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata) [15:58:13] (03CR) 10Filippo Giunchedi: [C: 04-1] "holding this off" [puppet] - 10https://gerrit.wikimedia.org/r/214608 (https://phabricator.wikimedia.org/T100636) (owner: 10Filippo Giunchedi) [16:09:54] (03PS7) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 [16:10:00] (03CR) 10jenkins-bot: [V: 04-1] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [16:10:17] (03PS8) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 [16:12:24] (03CR) 10Ori.livneh: [C: 032] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh) [16:15:29] (03PS1) 10Alexandros Kosiaris: role::puppet::server::labs Remove unused configuration [puppet] - 10https://gerrit.wikimedia.org/r/214637 [16:15:31] (03PS1) 10Alexandros Kosiaris: role::puppet::server::labs clean up allow_from [puppet] - 10https://gerrit.wikimedia.org/r/214638 [16:15:33] (03PS1) 10Alexandros Kosiaris: lint: fully qualify puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/214639 [16:15:35] (03PS1) 10Alexandros Kosiaris: Move certmanager hostname configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/214640 [16:15:37] (03PS1) 10Alexandros Kosiaris: Rename role::puppet::server::labs [puppet] - 10https://gerrit.wikimedia.org/r/214641 [16:15:39] (03PS1) 10Alexandros Kosiaris: Rename role::puppet::self to role::puppetmaster::self [puppet] - 10https://gerrit.wikimedia.org/r/214642 [16:18:43] (03PS1) 10Rush: dumps: create misc dir under other [puppet] - 10https://gerrit.wikimedia.org/r/214643 [16:20:48] (03CR) 10ArielGlenn: [C: 032] dumps: create misc dir under other [puppet] - 10https://gerrit.wikimedia.org/r/214643 (owner: 10Rush) [16:25:41] 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1321181 (10bd808) I'm going to work on getting log event rates into graphite with the hope of using that to set some general "go look at logstash" alerts (T100735) for... [16:29:12] ss -t -a dst "*:mysql" | wc -l -> 24654 [16:29:16] cat /proc/sys/net/ipv4/ip_local_port_range -> 32768 61000 [16:29:19] mmmm [16:30:50] echo 61000-32768 | bc -> 28232 [16:37:23] PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures [16:38:59] netstat -s -> 1723817 connections reset due to early user close [16:39:11] (03PS1) 10Ori.livneh: Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 [16:39:36] ^ ottomata, godog [16:39:52] (03CR) 10jenkins-bot: [V: 04-1] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh) [16:40:16] (03PS2) 10Ori.livneh: Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 [16:40:21] why does one host have 24K connections outbound to mysql? :) [16:40:46] I think we're supposed to play sad_trombone.wav when jenkins -1s [16:40:56] (03CR) 10jenkins-bot: [V: 04-1] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh) [16:42:37] bblack, ss -t -a dst "*:mysql" | grep TIME-WAIT | wc -l -> 22154 [16:42:47] (03PS3) 10Ori.livneh: Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 [16:43:33] maybe we are wrong for T98489 ? [16:43:46] godog: i wrapped the new config with "if $::hostname == 'cp1048'", so that if there is some annoying puppet bug it doesn't cause puppet failures across all varnishes [16:44:06] it's currently impossible to test this on labs, the labs varnishes are really behind and some recent puppet patches don't apply there [16:44:11] it is not 1 host, it is ALL hosts [16:44:18] like _joe_ / bblack's backend retry thing [16:44:38] well, all mw* I mean [16:44:49] jynus: is this new? [16:45:20] ori: why do you need a local statsd? [16:45:31] ottomata: i don't [16:45:34] oh i see, you don't [16:45:34] worry [16:45:36] sorry [16:45:39] you are just reusing the class [16:45:39] ori [16:45:39] hmm [16:45:52] ori, I am digging right now, cannot say- it is not a problem, I am trying to debug the mentioned ticket [16:46:23] but running out of local ports could be a clue? [16:46:28] ori: yup I'll take a look in '15 [16:46:39] if it's all systems, then local_port_range isn't an issue [16:46:58] well, all systems are creating the same error [16:47:05] ok [16:47:25] logstash, filter by dberror [16:47:30] what I mean is, how on earth can the processes on one system need 24K outbound mysql connections [16:47:40] they are closing connections [16:48:04] even if it is not the reason, we should reduce time_wait time or disable it [16:48:31] disabling time_wait or reducing it drastically is usually a bad idea, usually there's a better way to fix the issue [16:48:38] yeah, i'm looking, so is bd808 [16:48:58] the good logstash dashboard for this is -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError [16:49:29] the errors seem to be spread across mw hosts and db hosts pretty evenly [16:49:30] that timeout change should have reduced the number of disconnect/reconnects, not increased them, but yeah who knows... [16:50:06] I see no change [16:50:12] the nominal error rate is the same for the last 24 hours (minus normal traffic gradient) [16:50:13] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [16:50:13] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused [16:50:13] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [16:50:21] so maybe client-related, not server-related [16:50:30] bd808, I agree [16:50:44] I was looking for an alternative explanation [16:51:28] it is know to be worse on higher load, which could explain it (but I have not proof) [16:51:37] *known [16:51:41] <_joe_> bd808: can you look at the canaries? [16:51:48] <_joe_> are they also having the same error? [16:52:18] oh, so it has not been aplied system-wide? [16:52:26] <_joe_> I think so, yes [16:52:27] _joe_: I think we have the same config everywhere now [16:52:30] <_joe_> ok [16:53:00] https://gerrit.wikimedia.org/r/#/c/214295/ is merged [16:53:50] it's not uniform: https://dpaste.de/mneu/raw [16:54:02] <_joe_> so, looking at the hhvm code I wasn't so sure that hhvm.mysql.connect_timeout would solve something [16:54:32] RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:54:33] ori, that is nice- is there any correlation [16:54:49] <_joe_> ori: try to correlate that with rows/racks maybe :) [16:54:58] ha ha [16:55:01] I've done this before [16:55:03] <_joe_> I have friends waiting for me at the pub [16:55:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [16:55:12] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [16:55:12] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused [16:55:12] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [16:55:27] we used to have DB connections errors when row D had just two uplinks [16:55:29] (03CR) 10Filippo Giunchedi: [C: 031] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh) [16:55:50] correlating this to rack/row wasn't trivial, but doable and useful [16:56:01] <_joe_> paravoid: the errors seem to concentrate on the newer appservers [16:56:02] so it was ip level error? [16:56:17] _joe_: are they weighted the same? [16:56:21] <_joe_> but as just stated, I'm bolting out :) [16:56:38] <_joe_> paravoid: not all of the servers, no [16:57:07] <_joe_> paravoid: the one listed here, some are API, some are normal appservers I guess [16:57:13] jynus: well at the time it was related, although that wasn't properly explainable either (the packet loss was tiny, shouldn't have affected php->mysql all that much) [16:57:14] I would expect more http connections -> more errors, but that would be very normal [16:57:19] <_joe_> and they're all from the group with higher weight [16:57:46] <_joe_> jynus: I suppose it's a superposition of the two, actually [16:57:53] (03CR) 10Ori.livneh: [C: 032 V: 032] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh) [16:58:07] try counting the cross-row traffic [16:58:07] <_joe_> but well, see you on monday! [16:58:17] take care _joe_ [16:58:25] have a good weekend [16:58:26] what we now now is that it is not application level error on server side, connections fail at tcp level [16:58:30] ori: you really should move some of your bash aliases to a global space [16:58:33] <_joe_> you too :) [16:58:42] same, _joe_ [16:59:00] ori: some of them are awesome :) [16:59:09] yeah could do [16:59:36] netstat -s is meaningless for me, but probably because the error rate is low [17:00:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:00:12] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [17:00:13] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused [17:00:13] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [17:00:29] I suppose the way to go should be tcpdump and wait [17:00:43] all of the hosts in ori's histogram list are in rows C & D [17:00:55] it's more complicated than that [17:00:59] but I don't know what total set of hosts that comes from, in that sense. maybe all possible affected are there [17:01:02] you have to account on where the dbs are [17:01:34] maybe crunching all combinations of clientes and hosts [17:01:49] (03CR) 10Alexandros Kosiaris: "@Filippo, I thought about that approach too. Then I realized that both would have the exact same case check:" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [17:02:09] godog: applied on cp1048, metrics should show up in graphite shortly [17:02:16] ori: cool! [17:02:43] gah [17:02:51] the metric key prefix is not being honored [17:02:54] * ori fixes [17:03:06] I will do some statistics when I have time, which probably it will not be soon :-) [17:03:39] (03PS1) 10BBlack: Revert "new facts for canonical ipv4 addr/interface" [puppet] - 10https://gerrit.wikimedia.org/r/214648 [17:03:43] (03PS2) 10BBlack: Revert "new facts for canonical ipv4 addr/interface" [puppet] - 10https://gerrit.wikimedia.org/r/214648 [17:03:49] (03CR) 10BBlack: [C: 032 V: 032] Revert "new facts for canonical ipv4 addr/interface" [puppet] - 10https://gerrit.wikimedia.org/r/214648 (owner: 10BBlack) [17:03:51] ori: which prefix? [17:04:03] godog: it should include the dc name [17:04:14] but it's not, because the systemd file is not passing the argument to the service [17:04:20] so it's defaulting to the python script's internal default [17:04:42] 'varnish.backends.XXX' [17:04:45] ah ok [17:04:51] looks good otherwise [17:05:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:05:13] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [17:05:13] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused [17:05:13] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [17:05:31] 17:05:29.051947 IP cp1048.eqiad.wmnet.33463 > graphite1001.eqiad.wmnet.8125: UDP, length 1397 [17:08:04] (03PS1) 10Ori.livneh: varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 [17:08:08] ^ godog [17:10:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:10:13] PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx [17:10:13] PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused [17:10:13] PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures [17:10:47] (03CR) 10Filippo Giunchedi: [C: 031] varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 (owner: 10Ori.livneh) [17:11:02] thanks :) [17:11:06] (03PS2) 10Ori.livneh: varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 [17:11:15] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 (owner: 10Ori.livneh) [17:11:57] ori: I'll wipe the old hierarchy [17:12:21] godog: thanks. don't forget graphite2001 [17:12:32] yup, that too [17:14:09] much appreciated [17:15:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:15:12] RECOVERY - check_nginx on payments1001 is OK: PROCS OK: 49 processes with command name nginx [17:15:13] RECOVERY - check_payments_wiki on payments1001 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.029 second response time [17:15:13] RECOVERY - check_puppetrun on payments1001 is OK Puppet is currently enabled, last run 177 seconds ago with 0 failures [17:15:21] (03PS1) 10Ori.livneh: Apply ::varnish::logging::statsd on all varnishes, not just cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/214651 [17:15:49] mutante: ^ seems like it may be clearing, partially at least [17:15:53] ori: /var/lib/carbon/whisper/_varnish/eqiad/backends_/ [17:16:20] godog: is that after the change or before as well? [17:16:57] robh: that was me fixing it [17:17:02] well manually [17:17:05] after, looks like [17:17:07] still have to check if it's in puppet [17:17:12] doh [17:17:21] heh [17:18:17] ori: sad_trombone.wav [17:18:51] !log fix client_max_body_size syntax error in nginx config of payments1001 [17:18:58] Logged the message, Master [17:19:39] godog: does systemd treat quoting differently than bash? [17:19:47] i think it does [17:19:54] the " are being taken literally [17:20:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:20:13] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [17:21:02] so what data are we wanting for the db problem? (mw host, db server) tuples? [17:22:31] ori: yep I think that's it, quotes in ps [17:24:16] (03PS1) 10Ori.livneh: varnishstatsd: fix argument quoting in systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/214655 [17:25:11] bd808, not sure, maybe each combination of host creating the error and the ip of the mysql shown in the error message [17:25:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:25:12] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [17:27:06] o, no I see there is an specific field for db_server [17:28:19] I'll try to use my new logstash cli tool to make something useful -- https://github.com/bd808/ggml [17:29:08] bd808: can you puppetize that? it looks useful as hell [17:29:26] yep, looks fancy [17:29:55] I was going to try and figure out how to make a golang deb but haven't gotten to it [17:30:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:30:12] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [17:31:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnishstatsd: fix argument quoting in systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/214655 (owner: 10Ori.livneh) [17:31:43] ori: merged, looks good [17:32:19] (03PS2) 10Ori.livneh: Apply ::varnish::logging::statsd on all varnishes, not just cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/214651 [17:34:39] godog: last step ^ [17:35:05] ori: I think we can also wait monday for full rollout, it isn't a lot of metrics but potentially over >10k/s (metrics, less packets of course) [17:35:12] PROBLEM - check_puppetrun on barium is CRITICAL puppet fail [17:35:12] PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail [17:36:09] ^ those are an additional issue in FR that Jeff is on now [17:36:17] godog: I am impatient as usual but I am OK with waiting [17:36:54] that also gives me a chance to play with the data from cp1048 a little over the weekend and tweak the script if i need to [17:37:19] ori: hehe just to avoid sabotaging ourselves [17:37:44] godog: monday is april fools, tho :P [17:37:58] (03CR) 10Ori.livneh: [C: 04-1] "On hold until Monday, April 1." [puppet] - 10https://gerrit.wikimedia.org/r/214651 (owner: 10Ori.livneh) [17:38:32] that will take a while [17:38:35] it's already almost June [17:38:44] what the fuck [17:38:47] i am insane [17:38:50] ori: yeah the april's fool will be "graphite doesn't suck" [17:38:53] please /clear right now [17:39:01] and forget the last 10 lines or so ever happened [17:39:07] june 1st, not april 1st [17:39:26] haha [17:39:41] brain fart [17:39:47] (03CR) 10Ori.livneh: "That's June 1st." [puppet] - 10https://gerrit.wikimedia.org/r/214651 (owner: 10Ori.livneh) [17:40:11] or maybe he meant April 1, next year, you slacker! [17:40:12] RECOVERY - check_puppetrun on barium is OK Puppet is currently enabled, last run 152 seconds ago with 0 failures [17:40:12] RECOVERY - check_puppetrun on samarium is OK Puppet is currently enabled, last run 218 seconds ago with 0 failures [17:40:19] ori: very nice how little cpu varnishstatsd uses [17:40:39] godog: varnishapi.so does all the heavy lifting [17:41:17] hehe yeah I guess as long as the interpreter doesn't see any of that it isn't a big deal [17:41:30] hahaha ori [17:41:41] I SAID /clear DAMN IT [17:42:30] the only thing that's weird is that i would have expected other http methods to show up by now [17:42:36] oh nevermind [17:42:39] i guess it's an upload varnish [17:42:45] not too many POSTs [17:42:57] none legitimate ones [17:43:10] and PURGEs don't result in a backend connection, and hence aren't measured [17:43:35] here's the raw host -> db server pairs for the last hour -- https://phabricator.wikimedia.org/P702 [17:44:28] bd808: make that | uniq -c, convert mw* hostnames to IPs, drop the last byte on both src and dst IP [17:44:48] godog: in that case, would you mind if i applied it on one additional varnish, a text varnish? [17:44:53] ah. vlan to vlan [17:44:59] yup! [17:45:13] I think I can do that [17:45:16] ori: WFM [17:46:30] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1321314 (10chasemp) 5Open>3declined We decided to create a public dump of "safe" data instead http://dumps.wikimedia.org/other/misc/phabricator_public.dump [17:46:47] 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1321318 (10chasemp) [17:47:52] cp1048 has a varnishkafka alert [17:47:56] WARNING: 11.11% of data above the warning threshold [0.0] [17:47:57] (03PS1) 10Ori.livneh: Apply varnishstatsd on cp1066 (text varnish) as well [puppet] - 10https://gerrit.wikimedia.org/r/214657 [17:48:17] paravoid: hasn't that been flapping for some days now? [17:48:26] s/days/months/ :) [17:49:33] ganglia is super slow for some reason [17:51:31] paravoid: https://phabricator.wikimedia.org/P703 [17:52:40] (03CR) 10Ori.livneh: [C: 032] Apply varnishstatsd on cp1066 (text varnish) as well [puppet] - 10https://gerrit.wikimedia.org/r/214657 (owner: 10Ori.livneh) [17:54:33] 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1321350 (10bd808) Maybe we have some cross vlan communication issues: {P703} [17:57:37] https://phabricator.wikimedia.org/P704 [17:57:46] bd808: the servers aren't balanced per row (neither mw* nor db*) so you should account for that as well :/ [17:58:22] hmmm... count of uniq ips per subnet is needed then? [17:58:34] ugly [17:58:58] 450 for same 10.64.32 -> 10.64.32, in any case [18:00:31] is there a better way to count the hosts than trying to grok the regex patterns in site.pp? [18:02:43] but I get 2323 combinations out of 4994 possible [18:02:51] bd808, salt, did it before [18:03:29] but not all db* hosts are production hosts [18:11:15] 6operations, 6Commons, 10MediaWiki-Database, 6Multimedia, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error while (mass) deleting file over api - https://phabricator.wikimedia.org/T98706#1321408 (10Umherirrender) [18:11:51] which role installs the apache on the puppetmaster? [18:12:08] not puppetmaster::frontend? [18:13:02] I have a great correlation [18:13:08] using tendril [18:13:30] top db offenders are also top QPS [18:14:51] the opposite is also true [18:16:00] mutante: puppetmaster::passenger looks like it installs apache [18:17:34] (03PS1) 10RobH: blog.wikimedia.org sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214659 [18:17:49] JohnFLewis: :) ah, tx [18:18:13] jynus: cool! not sure I got the last part, what's opposite? [18:18:37] low qps -> low error rate? [18:19:56] less QPS -> less errors [18:20:00] (03CR) 10RobH: [C: 032] blog.wikimedia.org sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214659 (owner: 10RobH) [18:20:09] JohnFLewis: alright, it does, and the site config is there, cool, just that the relevant config is in apache2.conf not sites-enabled/foo [18:20:09] fewer if you will [18:24:16] (03PS1) 10Ori.livneh: varnishstatsd: coerce BackendXID to unsigned [puppet] - 10https://gerrit.wikimedia.org/r/214660 [18:24:22] paravoid: ^ this may amuse you [18:24:46] jynus: well that makes sense doesn't it? [18:25:28] the ratio makes more sense as a metric [18:25:36] if you're looking from a db perspective [18:25:39] (03PS2) 10Ori.livneh: varnishstatsd: coerce BackendXID to unsigned [puppet] - 10https://gerrit.wikimedia.org/r/214660 [18:25:57] (03CR) 10Ori.livneh: [C: 032 V: 032] varnishstatsd: coerce BackendXID to unsigned [puppet] - 10https://gerrit.wikimedia.org/r/214660 (owner: 10Ori.livneh) [18:27:03] paravoid, agreed [18:27:06] ottomata: ^ be aware, in case this bites you [18:27:10] (see commit message) [18:27:20] ori: wth really [18:27:33] "Work around that by coercing BackendXID to a signed integer." -- s/signed/unsigned/? [18:27:43] right [18:28:31] (03CR) 10Ori.livneh: "(unsigned, that is)" [puppet] - 10https://gerrit.wikimedia.org/r/214660 (owner: 10Ori.livneh) [18:28:47] problem with ratio is that we do not have real time data [18:30:04] PROBLEM - puppet last run on db1059 is CRITICAL Puppet has 1 failures [18:31:20] ^not really [18:31:44] RECOVERY - puppet last run on db1059 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:32:59] 6operations, 7network: asw2-a5-eqiad.mgmt.eqiad.wmnet xe-0/0/36 reporting errors - https://phabricator.wikimedia.org/T100820#1321485 (10fgiunchedi) [18:37:37] (03PS1) 10RobH: wikitech.wikimeida.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) [18:38:21] 6operations, 10wikitech.wikimedia.org, 7HTTPS, 5Patch-For-Review: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1321521 (10RobH) once the above patchset is merged live and wikitech is using the sha256, please assign this task to m... [18:38:24] 6operations, 6Labs, 7database: Santitize recent wikis: wikimania 2016 and cn.wikimedia.org at labs dbs - https://phabricator.wikimedia.org/T100441#1321522 (10Krenair) [18:39:02] PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail [18:39:30] (03CR) 10RobH: [C: 04-1] "still needs the config changes for wikitech done in the SAME patchset or it will break things." [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [18:44:07] jynus: dbstore1002 error [18:44:15] bblack: plenty of ocsp warnings [18:44:30] jynus: error = replag [18:44:56] 6operations: meta ticket - open up RT permissions on domains queue - https://phabricator.wikimedia.org/T83378#1321559 (10Dzahn) [18:45:05] 6operations, 10Wikimedia-Mailing-lists: mailman emails taking long time for delivery, getting stuck in sodium - https://phabricator.wikimedia.org/T61731#1321562 (10chasemp) 5Open>3declined a:3chasemp Since this is nonspecific and a few months since anyone was bitten enough to update I am closing for now.... [18:45:10] 6operations: meta ticket - open up RT permissions on domains queue - https://phabricator.wikimedia.org/T83378#1321565 (10Dzahn) 5Invalid>3Resolved [18:46:55] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321574 (10RobH) [18:47:25] 6operations, 7database: Drop database table "optin_survey" from Wikimedia wikis - https://phabricator.wikimedia.org/T54934#1321576 (10chasemp) [18:47:32] 6operations, 7database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1321578 (10chasemp) [18:47:43] 6operations, 7database: Drop database table "namespaces" from Wikimedia wikis - https://phabricator.wikimedia.org/T54929#1321580 (10chasemp) [18:47:55] 6operations, 7database: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#1321582 (10chasemp) [18:48:30] 6operations, 7HTTPS: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1321584 (10RobH) 3NEW [18:49:33] (03PS1) 10RobH: ganglia's sha1 cert to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214670 (https://phabricator.wikimedia.org/T100825) [18:49:35] 6operations, 6Labs, 7Tracking, 7database: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1321591 (10chasemp) [18:49:52] (03CR) 10RobH: [C: 04-1] "please note this initial patchset is ONLY the certificate change, NOT" [puppet] - 10https://gerrit.wikimedia.org/r/214670 (https://phabricator.wikimedia.org/T100825) (owner: 10RobH) [18:51:24] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321608 (10RobH) [18:57:34] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:59:11] 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1321619 (10RobH) [18:59:19] 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1321584 (10RobH) [19:01:27] 6operations, 7HTTPS: replace git's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100827#1321624 (10RobH) 3NEW a:3RobH [19:05:02] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [19:06:02] (03PS1) 10RobH: git.wikimedia.org.crt sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214673 (https://phabricator.wikimedia.org/T100827) [19:07:36] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321642 (10RobH) [19:13:05] 6operations, 7HTTPS: replace icinga's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100830#1321659 (10RobH) 3NEW [19:13:35] (03PS1) 10RobH: icinga.wikimedia.org cert sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214674 (https://phabricator.wikimedia.org/T100830) [19:14:47] (03CR) 10RobH: [C: 04-1] "DO NOT MERGE THIS PATCHSET without updating it with the associated" [puppet] - 10https://gerrit.wikimedia.org/r/214674 (https://phabricator.wikimedia.org/T100830) (owner: 10RobH) [19:16:47] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321670 (10RobH) [19:17:23] paravoid: possibly our webproxy? [19:18:11] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#753287 (10RobH) [19:22:23] paravoid: no, it's cloudflare/globalsign [19:22:31] it's not complete, just higher failure rate than normal [19:22:40] 6operations, 7HTTPS: replace librenms's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100831#1321685 (10RobH) 3NEW [19:22:59] (03PS1) 10RobH: replace librenms's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) [19:23:30] (03CR) 10RobH: [C: 04-1] "DO NOT MERGE UNLESS YOU APPEND CONFIGURATION CHANGES" [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) (owner: 10RobH) [19:23:43] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:24:45] the OCSP warnings at are the 4h-old mark, CRIT at the 8h-old mark, and then in the common case the data becomes too old at the 12h mark for actual client usage [19:24:56] (and it's retrying once every 2 hours) [19:28:13] 6operations, 7HTTPS: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1321709 (10RobH) 3NEW [19:28:33] (03PS1) 10RobH: lists.wikimedia.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214680 (https://phabricator.wikimedia.org/T100832) [19:29:31] (03CR) 10RobH: [C: 04-1] "DO NOT MERGE until the configuration changes are appended to this" [puppet] - 10https://gerrit.wikimedia.org/r/214680 (https://phabricator.wikimedia.org/T100832) (owner: 10RobH) [19:29:54] robh: may be a good thing to attack on Tuesday? [19:30:10] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321718 (10RobH) [19:30:15] JohnFLewis: thats my plan yep [19:30:23] since i'll already have the window [19:31:01] 6operations, 7HTTPS, 5Patch-For-Review: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1321709 (10RobH) [19:31:03] 6operations, 3Roadmap, 7notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1321724 (10RobH) [19:34:41] (03PS2) 10Dzahn: wikitech.wikimeida.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [19:35:16] (03PS3) 10Dzahn: wikitech.wikimedia.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH) [19:36:05] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321732 (10RobH) [19:37:34] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#777907 (10RobH) [19:38:52] (03PS1) 10BBlack: double OCSP fetch rate to cope with upstream error rate [puppet] - 10https://gerrit.wikimedia.org/r/214685 [19:39:56] (03CR) 10BBlack: [C: 032] double OCSP fetch rate to cope with upstream error rate [puppet] - 10https://gerrit.wikimedia.org/r/214685 (owner: 10BBlack) [19:43:06] 6operations, 7HTTPS: replace tendril.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100835#1321758 (10RobH) 3NEW [19:43:25] (03PS1) 10RobH: replace tendril.wikimedia.org's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214692 (https://phabricator.wikimedia.org/T100835) [19:43:45] (03CR) 10RobH: [C: 04-1] "DO NOT MERGE THIS PATCHSET until it has been appended with the" [puppet] - 10https://gerrit.wikimedia.org/r/214692 (https://phabricator.wikimedia.org/T100835) (owner: 10RobH) [19:44:21] 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321770 (10RobH) [19:50:30] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321787 (10hashar) 3NEW [19:50:30] (03PS14) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [19:52:08] (03PS2) 10RobH: replace tendril.wikimedia.org's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214692 (https://phabricator.wikimedia.org/T100835) [19:52:19] (03PS1) 10John F. Lewis: pywikipedia->pywikibot in mailman [puppet] - 10https://gerrit.wikimedia.org/r/214694 (https://phabricator.wikimedia.org/T100707) [19:58:38] (03PS2) 10Dzahn: icinga.wikimedia.org cert sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214674 (https://phabricator.wikimedia.org/T100830) (owner: 10RobH) [20:04:33] 6operations: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1321824 (10Dzahn) [20:08:23] 6operations, 7HTTPS: Ganglia server doesn't send intermediary certificates - https://phabricator.wikimedia.org/T72326#1321832 (10Dzahn) this should be fixed once we merge https://gerrit.wikimedia.org/r/197341 [20:08:40] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321833 (10BBlack) Note for tool labs, there's already some motd stuff in puppet like this: https://github.com/wikimedia/o... [20:14:43] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [20:15:03] PROBLEM - puppet last run on labstore1001 is CRITICAL Puppet last ran 5 hours ago [20:15:54] (03PS1) 10Faidon Liambotis: Remove ssh::hostkeys-collect from mira [puppet] - 10https://gerrit.wikimedia.org/r/214699 [20:15:57] (03CR) 10Faidon Liambotis: [C: 032] Remove ssh::hostkeys-collect from mira [puppet] - 10https://gerrit.wikimedia.org/r/214699 (owner: 10Faidon Liambotis) [20:16:06] (03CR) 10Faidon Liambotis: [V: 032] Remove ssh::hostkeys-collect from mira [puppet] - 10https://gerrit.wikimedia.org/r/214699 (owner: 10Faidon Liambotis) [20:16:43] RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:18:54] 6operations, 10Traffic, 7HTTPS: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1321852 (10Dzahn) p:5Normal>3High [20:20:33] PROBLEM - puppet last run on mw2211 is CRITICAL puppet fail [20:20:55] the spike of 500 are database related [20:20:55] my english is crap [20:21:07] 6operations, 6Labs, 7database: Santitize recent wikis: wikimania 2016 and cn.wikimedia.org at labs dbs - https://phabricator.wikimedia.org/T100441#1321864 (10scfc) @jcrespo: I don't know if you mean that, but the views in `enwiki_p` & Co. visible from Labs are maintained with `maintain-replicas/maintain-repl... [20:21:10] (03PS1) 10Faidon Liambotis: strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 [20:22:35] bblack, jgage ^ [20:23:01] 6operations, 3Roadmap, 7notice, 7user-notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1318965 (10gpaumier) [20:24:48] (03CR) 10Gage: [C: 031] strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 (owner: 10Faidon Liambotis) [20:25:05] jgage: are we on 5.2.1-6 or later? [20:25:12] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:27:02] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1321934 (10hashar) p:5High>3Normal Bringing back to normal priority. No clue why I bumped it :-] Task is stalled pending per RobH just above. [20:28:40] yeah we are apparently [20:28:51] (03CR) 10BBlack: [C: 031] strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 (owner: 10Faidon Liambotis) [20:29:30] dbstore1002 [20:29:30] MariaDB Slave Lag: s7 [20:29:32] CRITICAL 2015-05-29 20:27:44 0d 19h 44m 2s 3/3 CRITICAL slave_sql_lag Seconds_Behind_Master: 210359 [20:29:36] ^ ?? [20:30:19] we had a bunch of mysql can't connect all day but I assumed it to be part of the regular spam [20:37:22] (03PS1) 10BBlack: allow robots to use RL on domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214705 [20:39:03] (03CR) 10Ori.livneh: [C: 032] allow robots to use RL on domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214705 (owner: 10BBlack) [20:39:09] (03PS1) 10Tim Landscheidt: Tools: Add database alias for wikimania2016wiki [puppet] - 10https://gerrit.wikimedia.org/r/214718 (https://phabricator.wikimedia.org/T96638) [20:39:22] RECOVERY - puppet last run on mw2211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:40:26] (03Merged) 10jenkins-bot: allow robots to use RL on domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214705 (owner: 10BBlack) [20:42:56] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321967 (10scfc) [20:43:00] (03CR) 10Faidon Liambotis: [C: 032] strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 (owner: 10Faidon Liambotis) [20:43:04] !log ori Synchronized robots.txt: I7b321b62d: allow robots to use RL on domains (duration: 00m 14s) [20:43:08] Logged the message, Master [20:43:53] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321974 (10yuvipanda) Title should probably refer to beta cluster. [20:46:14] international house of pain [20:46:20] and mischan [20:52:19] jgage: jump around [20:54:34] nice [20:57:16] (03PS1) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 [20:57:33] paravoid: (if you're still around) ^ [20:57:56] (03CR) 10jenkins-bot: [V: 04-1] Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 (owner: 10Ori.livneh) [20:59:00] (03PS2) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 [21:03:32] (03PS3) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 [21:05:03] (03PS1) 10Odder: Add another IP address for Santiago edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214765 (https://phabricator.wikimedia.org/T100051) [21:05:09] (03PS1) 10Yuvipanda: tools: Stop bigbrother attempting to look at webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/214766 [21:07:36] (03CR) 10BBlack: "Won't the webservice lines fall down to the final else for unsupported command error now? or are they gone from the input?" [puppet] - 10https://gerrit.wikimedia.org/r/214766 (owner: 10Yuvipanda) [21:08:16] (03CR) 10Yuvipanda: "bah, they will. Good catch :)" [puppet] - 10https://gerrit.wikimedia.org/r/214766 (owner: 10Yuvipanda) [21:11:58] (03PS2) 10Yuvipanda: tools: Stop bigbrother attempting to look at webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/214766 [21:12:04] bblack: thanks :) [21:12:11] bblack: ^ should fix them maybe [21:14:24] (03CR) 10BBlack: [C: 031] tools: Stop bigbrother attempting to look at webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/214766 (owner: 10Yuvipanda) [21:20:15] (03CR) 10Dzahn: [C: 032] add "docs" as alias for "doc" in wm.org/mw.org [dns] - 10https://gerrit.wikimedia.org/r/214416 (https://phabricator.wikimedia.org/T100349) (owner: 10Dzahn) [21:21:02] (03CR) 10Ori.livneh: [C: 032] Add another IP address for Santiago edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214765 (https://phabricator.wikimedia.org/T100051) (owner: 10Odder) [21:21:08] (03Merged) 10jenkins-bot: Add another IP address for Santiago edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214765 (https://phabricator.wikimedia.org/T100051) (owner: 10Odder) [21:21:45] (03CR) 10Dzahn: [C: 032] redirect "docs" to doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/214418 (https://phabricator.wikimedia.org/T100349) (owner: 10Dzahn) [21:21:47] !log ori Synchronized wmf-config/throttle.php: Ife45684c5: Add another IP address for Santiago edit-a-thon (duration: 00m 13s) [21:21:50] Logged the message, Master [21:31:32] 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) - https://phabricator.wikimedia.org/T100846#1322108 (10JAufrecht) 3NEW a:3chasemp [21:36:23] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [21:38:00] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1322133 (10Dzahn) a:3Dzahn [21:38:24] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#939806 (10Dzahn) done ! [21:38:31] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1322135 (10Dzahn) done ! http://dumps.wikimedia.org/other/bugzilla/ [21:38:46] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Bugzilla HTML static version and database dump - https://phabricator.wikimedia.org/T1198#1322143 (10Dzahn) [21:39:18] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322146 (10Dzahn) [21:39:22] 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Bugzilla HTML static version and database dump - https://phabricator.wikimedia.org/T1198#1322144 (10Dzahn) 5Open>3Resolved both blocker tasks are resolved, so this should be resolved too [21:39:34] (03PS1) 10John F. Lewis: add bugzilla to dumps.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/214770 [21:39:40] mutante: ^^ [21:39:56] puppet overrode what was on index.html :) [21:40:37] oh, the index page, very true. thanks [21:40:52] needs a
,, sec [21:41:57] (03PS2) 10Dzahn: add bugzilla to dumps.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/214770 (owner: 10John F. Lewis) [21:42:31] (03CR) 10Dzahn: [C: 032] add bugzilla to dumps.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/214770 (owner: 10John F. Lewis) [21:44:36] 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) - https://phabricator.wikimedia.org/T100846#1322165 (10chasemp) @mmodell do you know where story points are stored? [21:45:38] (03PS1) 10Odder: Sysops to add users to import group on muiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) [21:47:19] (03PS2) 10Odder: Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) [21:47:44] (03PS3) 10Alex Monk: Sysops to add users to import group on muiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder) [21:48:56] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322184 (10Dzahn) [21:49:48] (03PS4) 10Odder: Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) [21:50:29] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1182562 (10Dzahn) T95266 - i don't know if anyone actually still wants to mail users who have not migrated so i did not close it as rejected, but i did remove it... [21:50:59] 6operations, 10Wikimedia-Bugzilla: analyze Bugzilla access logs - https://phabricator.wikimedia.org/T86859#1322191 (10Dzahn) 5Open>3Resolved a:3Dzahn [21:51:05] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322193 (10Dzahn) [21:51:24] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:55:42] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322202 (10Dzahn) [21:58:16] 6operations, 3Roadmap, 7notice, 7user-notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1322204 (10RobH) [21:59:27] 6operations, 7HTTPS, 5Patch-For-Review: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1322209 (10RobH) a:3RobH I'll be handling the implementation of this during my mailing list maintenance window scheduled on T100711 [21:59:40] 6operations, 7HTTPS: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1322213 (10RobH) [22:00:34] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322218 (10Dzahn) [22:02:27] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322229 (10Dzahn) T95267 - removed as a blocker because the dump exists now which makes it possible to build one without needing old-bz [22:02:45] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1322231 (10RobH) So then it seems this is an ideal use case for a public IP based ganeti VM, correct? (If so, we can create a request ticket per the instructions on: https://wikitech.wikimedia.org/wiki/Operations_requests#Virtual... [22:03:57] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322232 (10Dzahn) a:3Dzahn [22:04:53] 6operations, 7HTTPS: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1322233 (10RobH) p:5High>3Normal [22:05:06] 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1322234 (10Dzahn) resolved. these links are now redirects: https://docs.wikimedia.org/ https://docs.mediawiki.org/ [22:05:18] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1322235 (10RobH) p:5High>3Normal [22:05:22] 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1322236 (10Dzahn) 5Open>3Resolved [22:09:46] (03PS1) 10Ori.livneh: people.wikimedia.org: HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/214773 [22:09:51] ^ mutante [22:10:52] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [22:11:32] ori: technically totally +1, but.. i kind of vaguely remember there was a reason for that [22:12:00] hmm.. there was a ticket that was about switching them all [22:16:38] I'm going to do an elasticsearch upgrade on the logstash cluster. there will probably be some icinga alerts as a result [22:20:51] bd808: Schedule a downtime? :P [22:20:54] i'll do it [22:21:11] 1001-1003 are done [22:21:21] the next 3 will probably take longer [22:21:28] hm ok, no alerts so far :) [22:21:39] but i'll schedule it anyway [22:21:46] the first 3 were client only nodes [22:21:51] ah yeah [22:21:57] how many hours shall i set it for? [22:22:10] I *hope* only about 2 [22:22:26] otherwise it will cut into my drinking time [22:22:28] k [22:23:04] Elasticsearch. So slow to upgrade it makes you healthier! [22:23:21] Oh I'll just stay up later ;) [22:23:33] that scotch isn't going to drink itself [22:23:55] ok, downtime scheduled [22:24:51] i set it for 3 hours because sysadmin [22:25:08] I can't wait for 1.6.x where we can mark the indexes as "sealed" and avoid this dumb resync business [22:25:26] oh that sounds nice [22:25:44] 6operations, 10Deployment-Systems, 10wikitech.wikimedia.org: Merge as many configuration hacks in wikitech.php configuration file as possible into InitialiseSettings.php - https://phabricator.wikimedia.org/T75939#1322286 (10Krenair) [22:25:52] (03CR) 10Dzahn: [C: 031] "technically correct. i thought we had some reason not do it with this service from the beginning, but i can't remember or find it, so let'" [puppet] - 10https://gerrit.wikimedia.org/r/214773 (owner: 10Ori.livneh) [22:26:04] ori: +1 , i can't remember why not, so yes [22:26:34] Question for maybe-ops -- how do I get permission to edit this page? https://meta.wikimedia.org/wiki/Www.wikipedia.org_template [22:26:50] Re: https://phabricator.wikimedia.org/T100673 [22:27:21] jgage: https://github.com/elastic/elasticsearch/issues/10032 [22:27:53] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [22:28:05] earldouglas: https://meta.wikimedia.org/wiki/Meta:Requests_for_adminship [22:28:30] thumbs-up.jpg [22:28:33] bd808: neat! [22:28:37] earldouglas, should be possible to do from your staff account? [22:28:57] I only see meta admin edits there [22:29:06] 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1322290 (10Krenair) [22:29:15] MaxSem: nope [22:29:24] you could also copy it, edit and ask an admin to publish [22:29:25] err, I blame Philippe! :P [22:29:40] bd808: I prefer that to becoming admin myself. [22:29:40] earldouglas, does your staff account have the uber-staff-flag-thing? [22:29:44] no, it's not a good idea to give staff flag to all WMF users [22:29:57] createAndPromote.php ;) [22:29:57] +1 I don't really want admin. :] [22:30:03] if not and you need to perform admin actions, see philippe/CA [22:30:19] earldouglas, also, bits is about to die so EL might suddenly stop woring [22:30:23] ..or ask to be admin on meta, why not [22:30:32] * earldouglas shrugs [22:30:46] bits won't die [22:30:54] uh? [22:30:55] is there are known problem with connection to wikimedia / esams right know? [22:30:57] it'll get folded into the text cache role [22:31:01] earldouglas: i would say depends how often you have to edit it [22:31:11] but old links have to not totally break for the foreseeable [22:31:13] *future [22:31:29] yep, but the host itself is going down sometimes, ori? [22:31:31] so we won't be actively using it to serve stuff but it'll still be supported [22:31:51] mutante: ideally, exactly one time. [22:31:53] the bits varnishes will get decommissioned and repurposed, but the bits hostname will by then be served by the text varnishes [22:32:14] se4598, I can ping hooft okay? [22:32:34] se4598: what are you seeing? [22:33:16] can't connect (tracert: 17 * * 72 ms text-lb.esams.wikimedia.org [91.198.174.192]) [22:33:24] earldouglas: then i would go with the "copy, edit offline, send to philippe/ any admin" [22:33:47] and the sudden drop here makes me think there is something: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=LVS+loadbalancers+esams&m=cpu_report&s=by+name&mc=2&g=network_report [22:34:01] yes, that looks suspicious [22:34:03] bblack: [22:34:07] mutante: Roger, thanks. [22:34:11] se4598: can you run mtr? [22:34:25] ( https://www.bitwizard.nl/mtr/ ) [22:35:59] normal tracert: http://pastebin.com/EELaLwJU [22:36:11] can be external/ some carrier thing [22:37:15] ok, works for me again, and the ganglia graph also shows pre-drop activity :) [22:37:52] sounds like a peering session flapped [22:39:09] hm no mail from librenms about a flap, but it did mail about some port saturations 9 minutes ago [22:40:59] the node missing from my trace while I had this issue vs now is: wikimedia-ic-129908-adm-b3.c.telia.net [213.248.93.86] [22:47:10] 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1322345 (10Krenair) >>! In T87588#1282471, @demon wrote: > It's still wrong as it tells you to add wikis to dblists prior to running addwiki. You can't do... [22:52:21] 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1322376 (10bd808) Changing the HHVM default connection timeout doesn't seem to have had a measurable effect of the error rate. @faidon said on irc that he remembered seeing simil... [22:53:48] jgage: the cluster recovery is going really slow :/ At this rate the 3 hours will probably only get us through the upgrade of 1004 [22:54:19] ok, how about i just set it for 12 hours or so [22:54:31] works for me [22:54:36] or 24? doesn't matter much [22:54:43] bd808, whatever - just start drinking by the keyboard:P [22:54:59] if you haven't already [22:55:09] jgage, 12h means it might awaken somebody tonight [22:55:22] good point [22:55:23] it's 5 minutes till beer o'clock in my timezone [22:55:45] yeah set it for 24 hrs if your cool with that [22:56:12] a yellow logstash cluster is no big deal and nobody should get bothered about it on the weekend [22:56:39] 6operations, 6WMF-Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#1322392 (10Dzahn) [22:56:56] ok, updated. and i'll keep an eye on it. [22:58:38] (03CR) 10Dereckson: [C: 031] Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder) [23:03:30] 6operations, 6WMF-Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#1322396 (10Dzahn) done. -optoutresearch: aripstra, dchen, tbeasley +# Opt out Research RT7871, T86551, T100860 +optoutresearch: aripstra, dchen, dkrysiak [23:04:16] 6operations, 6WMF-Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#1322398 (10Dzahn) 5Open>3Resolved [23:48:51] (03PS11) 10Alex Monk: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [23:50:26] (03CR) 10Alex Monk: "(Let's leave wikiversions.json to the merger to update, it changes too often)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn) [23:53:24] (03PS1) 10Alex Monk: Optimise project logos added since I8c9a6a56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214783 [23:55:30] (03CR) 10Ori.livneh: [C: 032] Optimise project logos added since I8c9a6a56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214783 (owner: 10Alex Monk) [23:55:36] (03Merged) 10jenkins-bot: Optimise project logos added since I8c9a6a56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214783 (owner: 10Alex Monk) [23:56:27] !log ori Synchronized w/static/images/project-logos: Ic62747f37: Optimise project logos added since I8c9a6a56 (duration: 00m 13s) [23:56:31] Logged the message, Master [23:58:30] (03PS4) 10Ori.livneh: hhvm: add memory leak isolation scripts [puppet] - 10https://gerrit.wikimedia.org/r/212187 [23:58:41] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: add memory leak isolation scripts [puppet] - 10https://gerrit.wikimedia.org/r/212187 (owner: 10Ori.livneh)