[00:06:52] <logmsgbot>	 !log ori Synchronized php-1.26wmf7/extensions/Echo/includes/DiffParser.php: 41d27c4a26: Update Echo for cherry-picks (duration: 00m 13s)
[00:06:56] <morebots>	 Logged the message, Master
[00:07:07] <logmsgbot>	 !log ori Synchronized php-1.26wmf7/includes/diff/UnifiedDiffFormatter.php: d95cac90c7: Make the output of UnifiedDiffFormatter match diff -u (duration: 00m 14s)
[00:07:11] <morebots>	 Logged the message, Master
[00:33:15] <wikibugs>	 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1319422 (10Dzahn) achernar: 951 acamar:    951 baham:   0 cobalt:  n/a  - doesn't exist except mgmt lead:  951 lithium: 951 polonium: 951 rhodium: 951 argon: 951 bast4001: 7628 copper:...
[00:39:31] <wikibugs>	 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1319432 (10Dzahn) It's a bit strange. For example i installed "subra" and "suhail" in codfw, so i know it's not that long ago and they are ok.   A git log on the "raid1-lvm.cfg" shows...
[00:46:49] <grrrit-wm>	 (03CR) 10BryanDavis: "Fix by _joe_ in I45ddfd4c0ec63b4feeae19d9a42f7a870f34d451. Followup for non-canary servers in I526107099be5c9b9093110b94a7c3ec9856fdb3c." [puppet] - 10https://gerrit.wikimedia.org/r/211155 (https://phabricator.wikimedia.org/T98489) (owner: 10BryanDavis)
[00:50:57] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1319454 (10csteipp) After Dzahn dropped the bugs_deleted table, I think this looks ok now.  There are still a lot of full emails around, but I think those are all from places where th...
[00:56:39] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] access: add dbrant to researchers [puppet] - 10https://gerrit.wikimedia.org/r/213970 (https://phabricator.wikimedia.org/T99798) (owner: 10Dzahn)
[01:00:11] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1319459 (10Dzahn) alright, thanks. has approval and the waiting period is over.  merged. ran puppet on stat1003.  --  [stat1003:~] $ id dbrant uid=4910(dbrant...
[01:00:26] <wikibugs>	 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting addition to researchers group on stat1003 - https://phabricator.wikimedia.org/T99798#1319460 (10Dzahn) 5Open>3Resolved
[01:05:13] <wikibugs>	 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1319479 (10Dzahn) p:5Triage>3Low
[02:31:07] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 06m 54s)
[02:31:18] <morebots>	 Logged the message, Master
[02:36:13] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf7) at 2015-05-29 02:35:10+00:00
[02:36:20] <morebots>	 Logged the message, Master
[02:54:41] <wikibugs>	 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319531 (10BBlack) >>! In T100690#1318724, @Dzahn wrote: > or did you mean to automatically add this in base and stop doing it on individual nodes?  yes :)
[03:01:59] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 10m 08s)
[03:02:06] <morebots>	 Logged the message, Master
[03:07:55] <wikibugs>	 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319553 (10BBlack) Just for reference, I looked into what facter returns for the `$ipaddress` fact, and it appears to be based on whatever it sees first in ifconfig output which isn't in the 127/...
[03:09:18] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf8) at 2015-05-29 03:08:15+00:00
[03:09:23] <morebots>	 Logged the message, Master
[03:57:05] <icinga-wm>	 PROBLEM - puppet last run on ganeti2003 is CRITICAL puppet fail
[04:13:56] <icinga-wm>	 RECOVERY - puppet last run on ganeti2003 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures
[04:46:04] <kart_>	 Krenair: oops. logged out now.
[04:46:13] <kart_>	 Krenair: did it created any issue?
[04:51:22] <grrrit-wm>	 (03CR) 10Ori.livneh: hhvm: add memory leak isolation scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212187 (owner: 10Ori.livneh)
[04:51:39] <grrrit-wm>	 (03PS3) 10Ori.livneh: hhvm: add memory leak isolation scripts [puppet] - 10https://gerrit.wikimedia.org/r/212187 
[05:08:55] <apergos>	 ok kids, I'm going to regen all the salt keys on prod now.  if something goes awry it could be a half hour til we have new ones.
[05:19:45] <apergos>	 regen is in process. 
[05:23:37] <icinga-wm>	 PROBLEM - puppet last run on cp4007 is CRITICAL puppet fail
[05:24:26] <wikibugs>	 6operations, 7Epic, 10Wikimedia-Mailing-lists: Rename <s>all</s> some mailing lists with -l suffixes to get rid of that suffix - https://phabricator.wikimedia.org/T99138#1319714 (10Dzahn)
[05:25:54] <apergos>	 keys regened, now waiting for them to show up o the master to be accepted
[05:30:17] <wikibugs>	 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1319725 (10Dzahn) a:3Dzahn
[05:30:46] <icinga-wm>	 PROBLEM - puppet last run on tin is CRITICAL Puppet has 1 failures
[05:32:12] <wikibugs>	 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1319728 (10Dzahn) 5Open>3stalled
[05:32:50] <apergos>	 well it looks like we're going to be waitng the half hour.  bah. nt that things broke but just that the minions don't want to check back in without a kick 
[05:35:45] <apergos>	 heh as soon as I say that they stat coming in. so prolly done in 2 minutes :-D
[05:40:15] <icinga-wm>	 PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 1 failures
[05:40:36] <icinga-wm>	 RECOVERY - puppet last run on cp4007 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures
[05:40:56] <icinga-wm>	 PROBLEM - puppet last run on mw1104 is CRITICAL Puppet has 1 failures
[05:41:43] <apergos>	 waiting to make sure all the minions are happily reconnected
[05:41:51] <ori>	 apergos: !log
[05:42:01] <apergos>	 I shall when complete
[05:42:12] <apergos>	 ori
[05:42:22] <ori>	 :)
[05:42:30] <apergos>	 prolly about 2 more minutes
[05:47:36] <icinga-wm>	 RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures
[05:56:56] <icinga-wm>	 RECOVERY - puppet last run on mw1113 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[05:57:35] <icinga-wm>	 RECOVERY - puppet last run on mw1104 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures
[06:03:31] <apergos>	 oops, of course I got lost in testing
[06:03:47] <apergos>	 !log salt keys regenerated on all production hosts (minions, not master key)
[06:03:51] <morebots>	 Logged the message, Master
[06:12:47] <logmsgbot>	 !log ori Synchronized php-1.26wmf8/includes/deferred/SiteStatsUpdate.php: Icc12c07ab: Update context stats in SiteStatsUpdate (duration: 00m 14s)
[06:12:50] <morebots>	 Logged the message, Master
[06:13:00] <logmsgbot>	 !log ori Synchronized php-1.26wmf7/includes/deferred/SiteStatsUpdate.php: Icc12c07ab: Update context stats in SiteStatsUpdate (duration: 00m 13s)
[06:13:04] <morebots>	 Logged the message, Master
[06:29:46] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 2 failures
[06:29:55] <icinga-wm>	 PROBLEM - puppet last run on mw1046 is CRITICAL Puppet has 2 failures
[06:29:55] <icinga-wm>	 PROBLEM - puppet last run on elastic1022 is CRITICAL puppet fail
[06:30:16] <icinga-wm>	 PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures
[06:30:46] <icinga-wm>	 PROBLEM - puppet last run on lvs2004 is CRITICAL Puppet has 2 failures
[06:31:08] <grrrit-wm>	 (03PS1) 10Ori.livneh: carbon-c-relay: blackhole stddev and sum_sq [puppet] - 10https://gerrit.wikimedia.org/r/214576 
[06:31:15] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures
[06:31:15] <icinga-wm>	 PROBLEM - puppet last run on mw2145 is CRITICAL Puppet has 1 failures
[06:31:26] <icinga-wm>	 PROBLEM - puppet last run on multatuli is CRITICAL puppet fail
[06:31:27] <icinga-wm>	 PROBLEM - puppet last run on labvirt1003 is CRITICAL Puppet has 2 failures
[06:31:33] <ori>	 good morning puppetmaster
[06:32:46] <icinga-wm>	 PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures
[06:33:46] <icinga-wm>	 PROBLEM - puppet last run on mw1025 is CRITICAL Puppet has 3 failures
[06:34:17] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL Puppet has 1 failures
[06:34:45] <icinga-wm>	 PROBLEM - puppet last run on mw2079 is CRITICAL Puppet has 1 failures
[06:35:07] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 1 failures
[06:35:45] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures
[06:44:08] <wikibugs>	 6operations: salt broken after the upgrade - https://phabricator.wikimedia.org/T100502#1319755 (10ArielGlenn) After salt key regeneration, I get consistently good results with -b 100 cmd.run uptime   (no timeout = default timeout of 5 seconds per batch).  I'd suggest -b 50 for anything that does real work, and s...
[06:44:56] <icinga-wm>	 RECOVERY - puppet last run on mw1046 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures
[06:46:26] <icinga-wm>	 RECOVERY - puppet last run on mw2145 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[06:46:35] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[06:47:06] <icinga-wm>	 RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:56] <icinga-wm>	 RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:48:16] <icinga-wm>	 RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[06:52:48] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 29 06:51:45 UTC 2015 (duration 51m 44s)
[06:52:52] <morebots>	 Logged the message, Master
[07:06:46] <icinga-wm>	 RECOVERY - puppet last run on labvirt1003 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures
[07:07:06] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 15 seconds ago with 0 failures
[07:07:25] <icinga-wm>	 RECOVERY - puppet last run on mw1025 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:07:37] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures
[07:07:46] <icinga-wm>	 RECOVERY - puppet last run on lvs2004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:07:56] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:08:06] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures
[07:08:16] <icinga-wm>	 RECOVERY - puppet last run on mw2079 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:08:27] <icinga-wm>	 RECOVERY - puppet last run on elastic1022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:24:51] <wikibugs>	 7Blocked-on-Operations, 6operations, 10Maps, 6Scrum-of-Scrums, 10hardware-requests: Eqiad Spare allocation: 1 hardware access request for OSM Maps project - https://phabricator.wikimedia.org/T97638#1319924 (10akosiaris) 5Open>3Resolved yurik confirmed access on IRC and at T100548. Resolving this.
[07:35:34] <Steinsplitter>	 andre__: there is a complain on otrs about your profil picures. can you please attribute it?
[07:42:42] <petan>	 fuck otrs
[07:42:46] <petan>	 er. hi :D
[07:49:52] <wikibugs>	 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319959 (10BBlack) Thinking through options how we could discover the primary interface more elegantly:  1. We could use ipresolve($fqdn) and match that against the set of $ipaddress_INTF facts,...
[07:50:06] <wikibugs>	 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1319960 (10BBlack)
[08:09:12] <wikibugs>	 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1319968 (10akosiaris) @Robh, no the process is not that demanding, neither in CPU cycles or Disk I/O.  For disk I/O, @Dzahn has added an I/O check in icinga that up to now only triggers on bacula backing up the machine, not during n...
[08:19:09] <andre__>	 Steinsplitter: (does not sound like an operations topic?): It does not require attribution so I don't plan to do that.
[08:19:36] <andre__>	 Steinsplitter: for the records, I commented on that a while ago already on https://www.mediawiki.org/wiki/User_talk:AKlapper_%28WMF%29#Your_profile_picture_on_Phabricator
[08:20:17] <andre__>	 Steinsplitter, if I somehow misinterpret the requirements that I link to there, please feel free to correct me on that talk discussion and I'm happy to attribute if really required.
[08:22:55] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] Add varnish request stats diamond collector (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[08:33:08] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "minor nit but LGTM otherwise" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata)
[08:36:16] <grrrit-wm>	 (03PS6) 10Filippo Giunchedi: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[08:36:33] <Steinsplitter>	 andre__: thanks for the link to mw.
[08:36:41] <andre__>	 heh, sure :)
[08:39:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[08:41:03] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "personally I'm fine with the change but audience should be wider if we're removing stats e.g. phab" [puppet] - 10https://gerrit.wikimedia.org/r/214576 (owner: 10Ori.livneh)
[08:42:02] <grrrit-wm>	 (03CR) 10Mobrovac: [C: 04-1] CX: Log to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry)
[08:46:36] <icinga-wm>	 PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 856.488270475
[09:11:18] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1320035 (10Multichill) Oh wait, what? We're renaming one of the lists? We actually have 4 lists (https://lists.wikimedia.org/mailman/listinfo) with 3 of them still using the old naming scheme....
[09:12:31] <wikibugs>	 6operations, 10pywikibot-core, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1320038 (10Multichill)
[09:13:26] <wikibugs>	 6operations, 10pywikibot-core, 10Wikimedia-Mailing-lists: Rename pywikipedia-l to pywikibot - https://phabricator.wikimedia.org/T100707#1320041 (10JohnLewis) We can rename all four and regarding the announcement I for some reason though ladsgroup may poked the list(s). No worries, I'll send an email shortly...
[09:13:52] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Without even looking at the code (which I'm sure is ok), I don't like the idea that we lose a priori any indication of which varnish serve" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[09:16:39] <wikibugs>	 6operations, 10pywikibot-core, 10Wikimedia-Mailing-lists: Rename pywikipedia list prefixes to pywikibot  - https://phabricator.wikimedia.org/T100707#1320048 (10JohnLewis)
[09:24:39] <godog>	 apergos: I might have upset the salt master by running salt-run jobs.active (not sure I did, just letting you know)
[09:39:08] <wikibugs>	 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1320063 (10faidon) >>! In T100636#1319432, @Dzahn wrote: > It's a bit strange. For example i installed "subra" and "suhail" in codfw, so i know it's not that long ago and they are ok....
[09:39:33] <apergos>	 godog: if you did it's already happy again
[09:42:58] <godog>	 cool
[09:49:44] <wikibugs>	 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1320076 (10faidon) Note how add_ip6_mapped defaults to using interfaces[0], which is correctly set on all hosts but dataset1001. dataset1001 is set to eth2 because that's the 10G port, but I thin...
[10:09:47] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Release-Engineering, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1320153 (10hashar) p:5Normal>3High
[10:11:41] <wikibugs>	 6operations, 10Continuous-Integration-Config: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#1320156 (10hashar)
[10:11:51] <wikibugs>	 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#1320157 (10hashar)
[10:12:34] <wikibugs>	 6operations, 10Continuous-Integration-Config: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#22553 (10hashar) This is up to #operations to adjust the `/.puppet-lint.rc` file in operations/puppet.git.
[10:12:41] <wikibugs>	 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#768414 (10hashar) This is up to #operations to adjust the `/.puppet-lint.rc` file in operations/puppet.git.
[10:13:08] <wikibugs>	 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#1320163 (10hashar) p:5Normal>3Low
[10:14:05] <wikibugs>	 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#768414 (10hashar)
[10:14:07] <wikibugs>	 6operations, 10Continuous-Integration-Config: Suggestion: disable autoloader_layout checks in our jenkins puppet-lint - https://phabricator.wikimedia.org/T1289#22553 (10hashar)
[10:17:34] <wikibugs>	 6operations, 10Continuous-Integration-Config: Add --no-autoloader_layout-check to operations-puppet-puppetlint-lenient - https://phabricator.wikimedia.org/T75117#1320176 (10hashar) The lint check is currently disabled in `/.puppet-lint.rc`  I don't think it can be ignored for a specific file hierarchy. So one...
[10:32:05] <wikibugs>	 6operations, 6Release-Engineering, 7Performance: performance testing environment - https://phabricator.wikimedia.org/T67394#1320211 (10hashar)
[10:32:42] <wikibugs>	 6operations, 6Release-Engineering, 7Performance: performance testing environment - https://phabricator.wikimedia.org/T67394#1320215 (10hashar) 5Open>3stalled >>! In T282#1262238, @greg wrote: > Setting to Stalled, it's probably something that will come up again, but you're right, not on the plan for now.
[10:33:34] <wikibugs>	 6operations, 10Continuous-Integration-Infrastructure, 7Graphite, 7Upstream, 7Zuul: Let us customize Zuul metrics reported to statsd - https://phabricator.wikimedia.org/T1369#1320220 (10hashar)
[10:36:57] <wikibugs>	 6operations: Backport & test firmware-linux 0.44 - https://phabricator.wikimedia.org/T100771#1320225 (10faidon) 3NEW
[10:38:38] <wikibugs>	 6operations: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1320242 (10faidon) 3NEW
[10:38:51] <wikibugs>	 6operations: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1320249 (10faidon)
[10:38:53] <wikibugs>	 6operations: Backport and include linux-tools-3.19 to our jessie repository - https://phabricator.wikimedia.org/T100216#1320250 (10faidon)
[10:39:06] <paravoid>	 moritzm: ^^ :)
[10:42:16] <moritzm>	 paravoid: the switch in d-i is done in gerrit, I wanted to hold back the push until I have perf ready, but I can go ahead with it earlier
[10:42:25] <moritzm>	 I'll claim the Phab tasks
[10:43:10] <wikibugs>	 6operations: Backport & test firmware-linux 0.44 - https://phabricator.wikimedia.org/T100771#1320253 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff
[10:43:16] <paravoid>	 yeah, I added a blocked by too, so that sounds sane to me
[10:43:34] <paravoid>	 just installed a host yesterday and realized it needed a reboot for 3.19 :)
[10:44:26] <wikibugs>	 6operations: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1320258 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff
[10:44:34] <wikibugs>	 6operations: raid1-lvm recipe broken for jessie, sets up available LVM space as swap - https://phabricator.wikimedia.org/T100636#1320260 (10faidon) Ubuntu's partman-auto-lvm changelog for trusty [[ http://changelogs.ubuntu.com/changelogs/pool/main/p/partman-auto-lvm/partman-auto-lvm_51ubuntu1/changelog | mention...
[10:45:37] <wikibugs>	 6operations: LVM recipes broken for jessie, set up all remaining LVM space as swap - https://phabricator.wikimedia.org/T100636#1320261 (10faidon)
[10:45:45] <paravoid>	 godog: ^ :)
[10:47:07] <godog>	 indeed
[10:55:34] <wikibugs>	 6operations, 10Wikimedia-Logstash, 7Elasticsearch, 7Monitoring: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090#1320280 (10hashar)
[10:55:47] <wikibugs>	 6operations, 10Wikimedia-Logstash, 7Elasticsearch, 7Monitoring: Icinga monitoring for elasticsearch doesn't notice OOM conditions - https://phabricator.wikimedia.org/T76090#789335 (10hashar) Moving that monitoring task from #releng to #ops
[10:55:53] <wikibugs>	 6operations: LVM recipes broken for jessie, set up all remaining LVM space as swap - https://phabricator.wikimedia.org/T100636#1320283 (10fgiunchedi) on the debian side similar issues are reported as [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=517935 | #517935 ]] or [[ https://bugs.debian.org/cgi-bin/bu...
[10:57:53] <wikibugs>	 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1320303 (10faidon) >>! In T98003#1318539, @hashar wrote: > The operations-dns-lint job runs on Jenkins slaves in prod (gallium and lanthanum) and is one of the last job still running there. I tried earlie...
[11:00:42] <wikibugs>	 6operations, 7Icinga: Icinga: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777#1320316 (10hashar) 3NEW
[11:02:44] <wikibugs>	 6operations, 7Icinga: Icinga: expose hosts in maintenance state so we can prevent scap from running on them - https://phabricator.wikimedia.org/T100777#1320333 (10faidon) That doesn't sound like a very good idea. It should be the other way around: we set hosts under maintenance somewhere else (e.g. etcd ;)) an...
[11:08:44] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: authdns: small fix for the Ganglia gdnsd plugin [puppet] - 10https://gerrit.wikimedia.org/r/214591 
[11:08:57] <grrrit-wm>	 (03PS12) 10Giuseppe Lavagetto: confd: create module [puppet] - 10https://gerrit.wikimedia.org/r/208399 (https://phabricator.wikimedia.org/T97974) 
[11:09:16] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] authdns: small fix for the Ganglia gdnsd plugin [puppet] - 10https://gerrit.wikimedia.org/r/214591 (owner: 10Faidon Liambotis)
[11:29:30] <wikibugs>	 6operations, 10MediaWiki-Logging, 6Release-Engineering, 7HHVM: SlowTimer logs should go to their own location, instead of hhvm.log - https://phabricator.wikimedia.org/T94855#1320406 (10hashar) 5Open>3Resolved a:3hashar The hhvm SlowTimer errors are still written to hhvm.log.  We got logstash now thou...
[11:41:02] <paravoid>	 !log redirecting ns0 traffic to baham (= ns1) in preparation for rubidium upgrade
[11:41:05] <morebots>	 Logged the message, Master
[11:41:06] <wikibugs>	 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1320416 (10hashar) 3NEW
[11:45:36] <_joe_>	 !log restart nutcracker on mw1150
[11:45:40] <morebots>	 Logged the message, Master
[11:47:34] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: autoinstall: change rubidium's recipe to raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/214593 
[11:48:07] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032 V: 032] autoinstall: change rubidium's recipe to raid1-lvm [puppet] - 10https://gerrit.wikimedia.org/r/214593 (owner: 10Faidon Liambotis)
[11:48:17] <wikibugs>	 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1320424 (10Joe) All servers have been ejected around 3 AM UTC and never recovered. We can probably monitor this kind of problems, and maybe also try to pin down a bit b...
[11:48:37] <wikibugs>	 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1320426 (10Joe) 5Open>3Resolved a:3Joe
[11:49:09] <_joe_>	 hashar: it's basically a duplicate of the former ticket I named there, so resolving.
[11:49:23] <hashar>	 sure
[11:49:28] <hashar>	 just making sure something is filled :D
[11:53:25] <icinga-wm>	 PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40)
[11:53:33] <paravoid>	 !log reimaging rubidium
[11:53:37] <morebots>	 Logged the message, Master
[11:55:16] <icinga-wm>	 PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100%
[11:55:45] <icinga-wm>	 RECOVERY - Host rubidium is UPING OK - Packet loss = 0%, RTA = 1.24 ms
[11:56:17] <_joe_>	 hashar: if you see that happening again let me know, I may have some more time to understand what's the situation a bit better
[11:56:42] <hashar>	 I found out another dashboard at https://logstash.wikimedia.org/#/dashboard/elasticsearch/memcached-serious 
[11:57:00] <hashar>	 do we have any Icinga check that rely on logstash ?
[11:59:06] <icinga-wm>	 PROBLEM - RAID on rubidium is CRITICAL: Connection refused by host
[11:59:06] <icinga-wm>	 PROBLEM - dhclient process on rubidium is CRITICAL: Connection refused by host
[11:59:06] <icinga-wm>	 PROBLEM - configured eth on rubidium is CRITICAL: Connection refused by host
[11:59:16] <icinga-wm>	 PROBLEM - Disk space on rubidium is CRITICAL: Connection refused by host
[11:59:45] <icinga-wm>	 PROBLEM - puppet last run on rubidium is CRITICAL: Connection refused by host
[12:00:06] <icinga-wm>	 PROBLEM - salt-minion processes on rubidium is CRITICAL: Connection refused by host
[12:00:15] <icinga-wm>	 PROBLEM - Auth DNS on rubidium is CRITICAL - Plugin timed out while executing system call
[12:00:46] <icinga-wm>	 PROBLEM - DPKG on rubidium is CRITICAL: Connection refused by host
[12:01:39] <icinga-wm>	 PROBLEM - puppet last run on mw2055 is CRITICAL Puppet has 1 failures
[12:02:37] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[12:04:08] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[12:09:37] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL puppet fail
[12:15:39] <grrrit-wm>	 (03CR) 10Muehlenhoff: "At least for jessie installations we could rely on systemd-detect-virt (returns "kvm" on a jessie labs instance and "none" on standard har" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris)
[12:18:07] <icinga-wm>	 RECOVERY - puppet last run on mw2055 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[12:20:48] <YuviPanda>	 _joe_: the ipresolve patch works, just tested it. going to merge it now
[12:21:02] <_joe_>	 YuviPanda: yeah seems good to me
[12:21:04] <grrrit-wm>	 (03PS5) 10Yuvipanda: wmflib: Add nameserver parameter to ipresolve function [puppet] - 10https://gerrit.wikimedia.org/r/212784 (https://phabricator.wikimedia.org/T99833) 
[12:21:10] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: install-server: Accomodate virtualization [puppet] - 10https://gerrit.wikimedia.org/r/214377 
[12:21:13] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] wmflib: Add nameserver parameter to ipresolve function [puppet] - 10https://gerrit.wikimedia.org/r/212784 (https://phabricator.wikimedia.org/T99833) (owner: 10Yuvipanda)
[12:22:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "I came up with an alternative approach, blending both your proposals together. @Faidon, unfortunately stil not virtio-scsi for ganeti http" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris)
[12:26:28] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[12:32:36] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Make redis failover-able [puppet] - 10https://gerrit.wikimedia.org/r/212792 (https://phabricator.wikimedia.org/T99737) 
[12:35:17] <icinga-wm>	 RECOVERY - Host ns0-v6 is UPING OK - Packet loss = 0%, RTA = 1.36 ms
[12:36:19] <jynus>	 YuviPanda, FYI: lag on labsdb1 for s1 https://phabricator.wikimedia.org/P701 I am not doing nothing about it for now
[12:36:53] <YuviPanda>	 jynus: alright. you should feel free to kill terrible queries on labsdb without prejudice
[12:37:18] <YuviPanda>	 everyone will highly appreciate it :)
[12:37:28] <jynus>	 I will if it continues, but saw on the graphs it is "common" occurence
[12:37:31] <wikibugs>	 6operations, 10Traffic: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1320527 (10hashar) > Well, time to start actively maintaining it then :) We probably have more jessie hosts than precise nowadays and testing our DNS config in a distribution that is 5 years older than pr...
[12:37:44] <jynus>	 but I think I can fix it long term on configuration
[12:37:50] <YuviPanda>	 oooooh cool ;D
[12:41:40] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Make redis failover-able [puppet] - 10https://gerrit.wikimedia.org/r/212792 (https://phabricator.wikimedia.org/T99737) (owner: 10Yuvipanda)
[12:43:11] <icinga-wm>	 PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40)
[12:45:01] <icinga-wm>	 PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100%
[12:46:02] <icinga-wm>	 RECOVERY - Host ns0-v6 is UPING OK - Packet loss = 0%, RTA = 2.88 ms
[12:46:12] <icinga-wm>	 RECOVERY - Host rubidium is UPING OK - Packet loss = 0%, RTA = 0.79 ms
[12:53:26] <grrrit-wm>	 (03PS1) 10Yuvipanda: toollabs: Specify IPv4 as addresstype for ipresolve [puppet] - 10https://gerrit.wikimedia.org/r/214604 
[12:53:34] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Specify IPv4 as addresstype for ipresolve [puppet] - 10https://gerrit.wikimedia.org/r/214604 (owner: 10Yuvipanda)
[12:55:00] <wikibugs>	 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1320554 (10BBlack) The $interfaces list is completely arbitrary, much like $ipaddress.  Looking back at add_ip6_mapped, it relies on that sort of magic as well.  So yeah, perhaps a custom fact or...
[13:10:28] <wikibugs>	 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1320581 (10Joe) 3NEW
[13:11:12] <_joe_>	 paravoid, godog, akosiaris, bblack any input very welcome on ^^. A reasonably fast one if possible :)
[13:12:07] <_joe_>	 (as in: before the end of next week)
[13:14:48] <godog>	 _joe_: ack
[13:17:09] <godog>	 !log roll-restart cassandra on cerium / xenon / praseodymium following java upgrade
[13:17:16] <morebots>	 Logged the message, Master
[13:18:02] <icinga-wm>	 PROBLEM - Host rubidium is DOWN: CRITICAL - Host Unreachable (208.80.154.40)
[13:18:42] <icinga-wm>	 PROBLEM - Host ns0-v6 is DOWN: PING CRITICAL - Packet loss = 100%
[13:18:51] <icinga-wm>	 PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail
[13:19:39] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Don't attempt to replicate from master to master [puppet] - 10https://gerrit.wikimedia.org/r/214605 
[13:20:17] <grrrit-wm>	 (03PS10) 10Ottomata: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 
[13:21:01] <andrewbogott>	 jynus: I haven’t settled down to work properly yet, but let’s do the holmium db migration in a few minutes if you’re available.
[13:21:21] <grrrit-wm>	 (03CR) 10Ottomata: Add varnishlog python module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata)
[13:21:30] <jynus>	 yep it's ok, let me check that everithing is working on my side
[13:21:34] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Don't attempt to replicate from master to master [puppet] - 10https://gerrit.wikimedia.org/r/214605 (owner: 10Yuvipanda)
[13:21:38] <jynus>	 andrewbogott, ^
[13:22:22] <grrrit-wm>	 (03CR) 10Ottomata: "Just in case you haven't seen this: this one will have hostname reports: https://gerrit.wikimedia.org/r/#/c/212041/" [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[13:25:45] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Simplify and fix tools-redis master selection [puppet] - 10https://gerrit.wikimedia.org/r/214606 
[13:27:09] <grrrit-wm>	 (03Abandoned) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/212302 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo)
[13:27:22] * aude is having some issues with css styling on test.wikidata
[13:27:26] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] tools: Simplify and fix tools-redis master selection [puppet] - 10https://gerrit.wikimedia.org/r/214606 (owner: 10Yuvipanda)
[13:27:42] <aude>	 would like to touch those files and sync Wikibase stuff on wmf8
[13:27:56] <aude>	 suppose no one is deploying now or minds...
[13:30:01] <logmsgbot>	 !log aude Synchronized php-1.26wmf8/extensions/Wikidata: touch js and css files to try to fix issues on test.wikidata (duration: 00m 26s)
[13:30:05] <morebots>	 Logged the message, Master
[13:31:06] <grrrit-wm>	 (03CR) 10Ottomata: "The diamond collector stuff needs reworked again anyway. The stuff I pushed through yesterday broke something in labs." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/212041 (https://phabricator.wikimedia.org/T83580) (owner: 10Ottomata)
[13:33:42] <grrrit-wm>	 (03PS1) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 
[13:34:05] <andrewbogott>	 jynus: I’m back now, mostly
[13:34:45] <grrrit-wm>	 (03PS2) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 
[13:35:14] <jynus>	 I have abandoned the other patch
[13:35:23] <jynus>	 ^this is the new one
[13:35:29] <jynus>	 but do not +2
[13:35:41] <jynus>	 have to checl conoectivity
[13:36:55] <icinga-wm>	 RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures
[13:38:56] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: install-server: add WMF5842 back as d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/214608 (https://phabricator.wikimedia.org/T100636) 
[13:40:05] <godog>	 paravoid: ^
[13:40:16] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: install-server: add WMF5842 back as d-i-test [puppet] - 10https://gerrit.wikimedia.org/r/214608 (https://phabricator.wikimedia.org/T100636) 
[13:46:09] <andrewbogott>	 jynus: nothing for me to do except watch and cross my fingers, right?
[13:46:31] <jynus>	 yep, cannot currently access from holium to the new db
[13:46:42] <jynus>	 something maybe wrong on the grants
[13:46:53] <jynus>	 checking it now
[13:48:59] <jynus>	 andrewbogott, confirm me that this is wrong: https://gerrit.wikimedia.org/r/#/c/214607/2/templates/mariadb/production-grants-m5.sql.erb
[13:49:12] <jynus>	 the host should be 20X.XXXXX
[13:49:15] <jynus>	 correct?
[13:49:36] <jynus>	 208.80.154.12
[13:49:56] <andrewbogott>	 yes, I’m pretty sure holmium doesn’t have an internal IP
[13:51:40] <andrewbogott>	 208.80.154.12 is right
[13:52:42] <jynus>	 strange
[13:52:55] <jynus>	 on the current database, it is using live 10.x ips
[13:54:49] <andrewbogott>	 I’m surprised that works
[13:55:12] <jynus>	 oh
[13:55:22] <jynus>	 I think it is because the proxy
[13:56:20] <jynus>	 yep, that is the ip of the proxy
[13:56:37] <jynus>	 for now, as there are no slaves, we will have to connect directly
[13:57:06] <jynus>	 (on the good side, you will have exclusive use of the server)
[13:57:48] <andrewbogott>	 oh, m5 is just labs stuff?
[13:58:23] <jynus>	 m5 is right now openstack stuff (+pdns and designate)
[13:59:09] <grrrit-wm>	 (03PS3) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 
[14:03:09] <jynus>	 now pdns and pdns_admin work, but not designate
[14:04:45] <icinga-wm>	 RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 4961.00381896
[14:05:21] <jynus>	 that password changed and is not puppetized
[14:05:30] <jynus>	 amending again
[14:06:44] <andrewbogott>	 the designate db password isn’t puppetized?
[14:07:03] <jynus>	 it is
[14:07:17] <jynus>	 but not the user creation on maridb side
[14:07:24] <andrewbogott>	 oh, I see
[14:07:51] <jynus>	 just check the last patch and you will see it
[14:08:04] <andrewbogott>	 one thing to keep in mind about all these dbs — periodically openstack will do scripted upgrades from the client side.  So we need a way for the users to make schema changes; it doesn’t have to be allowed all the time as long as it’s easy to switch on and off.
[14:08:09] <andrewbogott>	 Probably you’re on top of that already
[14:08:17] <grrrit-wm>	 (03PS4) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 
[14:10:58] <jynus>	 yep, yep
[14:11:09] <jynus>	 ok, connections and grants testes
[14:11:23] <jynus>	 do a +1
[14:11:27] <jynus>	 if it seems ok
[14:11:48] <jynus>	 and we will go into maintenance mode and deploy 
[14:12:47] <jynus>	 you will now better than me what things may go down, andrewbogott 
[14:13:01] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 031] Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 (owner: 10Jcrespo)
[14:13:09] <andrewbogott>	 yeah, I’m thinking about what to test
[14:14:20] <grrrit-wm>	 (03PS5) 10Jcrespo: Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 
[14:14:53] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Moving pdns and designate databases from m1 to m5 [puppet] - 10https://gerrit.wikimedia.org/r/214607 (owner: 10Jcrespo)
[14:16:58] <andrewbogott>	 shall I do a puppet update on holmium or are you doing that already?
[14:17:18] <jynus>	 I won't be able to set the original server in read-only mode, as it has othe stuff
[14:17:23] <jynus>	 but we should be ok
[14:17:38] <jynus>	 not merged en puppet yet, can I?
[14:17:44] <andrewbogott>	 yep, let’s do it
[14:17:57] <jynus>	 !log Moving pdns and designate databases from m1 to m5
[14:18:00] <morebots>	 Logged the message, Master
[14:18:43] <jynus>	 "puppet already in progress"
[14:18:50] <andrewbogott>	 that’s me!  try again
[14:18:54] <jynus>	 :-)
[14:19:57] <jynus>	 I do not see accesses from dns, but could be normal
[14:20:14] <jynus>	 can you do something that may read/write to the db?
[14:20:21] <andrewbogott>	 sure
[14:21:09] <jynus>	 we are looking mainly from "could not connect to mysql" errors
[14:21:12] <jynus>	 *for
[14:21:30] <andrewbogott>	 I just created a new dns entry and it’s working
[14:21:33] <andrewbogott>	 do you see the writes?
[14:21:41] <andrewbogott>	 it should’ve writtent to both designate and pdns I think
[14:22:16] <jynus>	 I see homium still connected to de proxy
[14:22:31] <jynus>	 maybe it only gets the conf on service restart?
[14:23:47] <andrewbogott>	 Surprised puppet didn’t do that...
[14:23:49] <andrewbogott>	 shall I restart?
[14:24:27] <jynus>	 on netstat: holmium.wikimedia:41861 dbproxy1001.eqiad:mysql ESTABLISHED
[14:24:51] <jynus>	 if they are persistent connections, I wouldn't be surprised
[14:25:08] <jynus>	 I can drop the connections as an alternative
[14:26:13] <jynus>	 I will kill the mysql connections first and see if the new conf takes place
[14:26:18] <andrewbogott>	 ok
[14:28:17] <jynus>	 nope, they reconnect to the same host- config has not been applied
[14:29:13] <_joe_>	 jynus, andrewbogott do a netstat on the db server and the client
[14:29:26] <_joe_>	 and find out which process on the client is recreating said connection
[14:29:29] <_joe_>	 *s
[14:30:09] <jynus>	 puppet wasn't on the latesd config change, not it is
[14:33:33] <jynus>	 _joe_ what is easy- but I think it requires a restart
[14:33:47] <_joe_>	 probably, yes
[14:33:50] <jynus>	 24777/pdns_server-i
[14:33:57] <jynus>	 30474/python
[14:33:58] <_joe_>	 some services don't pick up config live
[14:34:28] <jynus>	 if only mysql did, it would make my lide 10000x easier
[14:34:38] <jynus>	 :-)
[14:35:04] <_joe_>	 eheh, well you have ways to modify a great deal of config on a live mysql server
[14:35:10] <jynus>	 let me try killing mysql again after having run puppet
[14:35:15] <jynus>	 not mysql
[14:35:20] <jynus>	 the mysql connections
[14:35:26] <_joe_>	 yeah I was about to ask :P
[14:35:27] <jynus>	 _joe_, not the important stuff
[14:35:59] <jynus>	 although 5.6 and 5.7 is getting better
[14:36:03] <_joe_>	 jynus: of course, live-reloading the important stuff would be a pain to implemnent :)
[14:36:14] <jynus>	 dynamic buffer pool size
[14:36:22] <jynus>	 tablespace management
[14:37:15] <jynus>	 ops, we have contact, andrewbogott 
[14:37:26] <andrewbogott>	 jynus: great, want me to run another test?
[14:37:34] <jynus>	 wait
[14:37:38] * andrewbogott waits
[14:37:47] <jynus>	 let me see if there are threads hanging out on the old server
[14:38:13] <jynus>	 yes, they are
[14:38:47] <grrrit-wm>	 (03PS1) 10BBlack: new facts for canonical ipv4 addr/interface [puppet] - 10https://gerrit.wikimedia.org/r/214617 (https://phabricator.wikimedia.org/T100690) 
[14:40:03] <jynus>	 and if  I kill then they come back
[14:41:10] <jynus>	 I can force them to fail
[14:42:06] <andrewbogott>	 does that mean there are still services on holmium with the old config?
[14:43:05] <jynus>	 yep
[14:43:10] <jynus>	 30474/python is ok
[14:43:22] <jynus>	 24777/pdns_server-i is not
[14:43:37] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] new facts for canonical ipv4 addr/interface [puppet] - 10https://gerrit.wikimedia.org/r/214617 (https://phabricator.wikimedia.org/T100690) (owner: 10BBlack)
[14:44:05] <jynus>	 we can kill gracefully/restart 24777/pdns_server-i ?
[14:44:13] <andrewbogott>	 sure
[14:44:14] <andrewbogott>	 one second
[14:44:16] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@Muehnlenhoff, that's d-i. unfortunately systemd-detect-virt is not installed" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris)
[14:44:33] <andrewbogott>	 the pdns config is still pointing to m1
[14:44:42] <jynus>	 on dick?
[14:44:46] <jynus>	 *disk
[14:44:54] <andrewbogott>	 yes, let me check...
[14:45:22] <andrewbogott>	 I will fix, will take a few minutes
[14:45:57] <jynus>	 maybe the old thread has the file open and it has been overwriten, but not for that thread?
[14:46:22] <jynus>	 or some cache issue?
[14:46:45] <icinga-wm>	 RECOVERY - Host ns0-v6 is UPING OK - Packet loss = 0%, RTA = 2.41 ms
[14:47:24] <jynus>	 no, impossible, I can see it now
[14:47:36] <andrewbogott>	 I think the puppet code is just wrong
[14:48:42] <grrrit-wm>	 (03PS1) 10Andrew Bogott: Switch labs pdns/mysql to the new db host [puppet] - 10https://gerrit.wikimedia.org/r/214618 
[14:48:57] <andrewbogott>	 jynus: ^
[14:49:27] <jynus>	 wow
[14:49:43] <jynus>	 I thought I had greped m1 everyware
[14:49:50] <jynus>	 is that recent?
[14:50:14] <andrewbogott>	 I don’t think so…
[14:50:15] <grrrit-wm>	 (03CR) 10Giuseppe Lavagetto: "I assumed this patch was superseding that one, not complementing it. Sorry for the confusion." [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[14:50:24] <andrewbogott>	 *shrug* I missed it too.
[14:50:29] <andrewbogott>	 any reason not to merge?
[14:50:30] <ori>	 thanks :)
[14:50:35] <jynus>	 merge, merge
[14:50:54] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] Switch labs pdns/mysql to the new db host [puppet] - 10https://gerrit.wikimedia.org/r/214618 (owner: 10Andrew Bogott)
[14:51:10] <jynus>	 good news is that this is not causing any outages, as both servers are coexisting
[14:52:11] <paravoid>	 !log re-redirecting ns0 traffic back to rubidium
[14:52:14] <morebots>	 Logged the message, Master
[14:52:41] <andrewbogott>	 jynus: puppet claims that the service refreshed… do things look right now?
[14:53:16] <jynus>	 yep
[14:53:22] <jynus>	 everithing is on db1009 now
[14:53:35] <jynus>	 let me do an aditional test
[14:54:10] <jynus>	 which is blocking connections to pdns, pdns_admin and designate to the old db
[14:54:15] <icinga-wm>	 RECOVERY - puppetmaster backend https on rhodium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 1.642 second response time
[14:54:20] <jynus>	 and we can call it a day
[14:58:15] <jynus>	 ok,locked account on the old server
[14:58:30] <jynus>	 replication stopped
[14:58:43] <andrewbogott>	 cool.  One last test...
[15:00:17] <jynus>	 you may have had a smallist time where users cannot log
[15:00:29] <jynus>	 but it should be ok now
[15:01:33] <andrewbogott>	 yeah, created a new instance, dns is working fine for it
[15:03:13] <andrewbogott>	 I am getting ‘DuplicateRecord’ errors when I create new instances.  I’m pretty sure that didn’t happen before.
[15:03:19] <andrewbogott>	 Could that be because of master/master?
[15:03:28] <jynus>	 yep
[15:03:48] <jynus>	 do you have more debug info?
[15:04:07] <grrrit-wm>	 (03PS14) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh)
[15:04:23] <andrewbogott>	 not especially.
[15:04:27] <andrewbogott>	 looking
[15:04:36] <andrewbogott>	 but, also, we could just turn of syncing and see if that fixes it?
[15:04:59] <jynus>	 sync is off
[15:05:13] <andrewbogott>	 here’s a log snippet https://dpaste.de/jwFP
[15:05:43] <jynus>	 that is ok
[15:06:00] <andrewbogott>	 I just created a new entry, got the same exception
[15:06:03] <andrewbogott>	 do you know what’s happening?
[15:06:18] <andrewbogott>	 Looks like there are a couple of those from yesterday as well, so maybe this is unrelated to the switch-over
[15:09:44] <andrewbogott>	 jynus?
[15:10:08] <grrrit-wm>	 (03PS15) 10Mobrovac: mathoid to service::node [puppet] - 10https://gerrit.wikimedia.org/r/167413 (https://phabricator.wikimedia.org/T97124) (owner: 10Ori.livneh)
[15:10:35] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: "overriding like that sounds good to me, it seems easy to miss the fact that echo is two files instead of one, a separate case statement af" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris)
[15:10:36] <jynus>	 I am doing a hack, andrewbogott to fix it, be it caused by the migration or not
[15:11:04] <andrewbogott>	 jynus: ok — things are working anyway, I’m just concerned that it might be a bug in designate or one of the drivers I wrote
[15:11:33] <jynus>	 lets talk in private
[15:16:30] <akosiaris>	 andrewbogott: q: virt1000 puppetmaster does not have a database backend, does it ?
[15:16:41] <akosiaris>	 no exported resources, no storedconfigs, no nothing
[15:16:49] <andrewbogott>	 I don’t think it does
[15:17:13] <akosiaris>	 ok thanks. there will be a couple of PS on your plate for review soon
[15:17:21] <akosiaris>	 heads up ;-)
[15:19:07] <logmsgbot>	 !log anomie Synchronized php-1.26wmf8/extensions/ConfirmEdit/: Update ConfirmEdit to fix API breakage [[gerrit:214620]] (duration: 00m 14s)
[15:19:10] <morebots>	 Logged the message, Master
[15:25:47] <jynus>	 andrewbogott, forgot: old db on virt1000 was depupetized but not deleted
[15:36:07] <andrewbogott>	 jynus: that’s fine, I’m going to rebuild that box entirely sometime soon
[15:36:40] <jynus>	 can I delete the dbs on m1 once the backups are up too?
[15:36:40] * andrewbogott has a migraine, laying low today
[15:36:50] <andrewbogott>	 jynus: Sure, I don’t know why not.
[15:37:00] <jynus>	 thank you, take it easy!
[15:42:16] <grrrit-wm>	 (03CR) 10Jcrespo: "For the record, this patch was incomplete without https://gerrit.wikimedia.org/r/#/c/214618/" [puppet] - 10https://gerrit.wikimedia.org/r/214607 (owner: 10Jcrespo)
[15:55:15] <grrrit-wm>	 (03PS11) 10Ori.livneh: Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata)
[15:55:22] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Add varnishlog python module [puppet] - 10https://gerrit.wikimedia.org/r/214261 (owner: 10Ottomata)
[15:58:13] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "holding this off" [puppet] - 10https://gerrit.wikimedia.org/r/214608 (https://phabricator.wikimedia.org/T100636) (owner: 10Filippo Giunchedi)
[16:09:54] <grrrit-wm>	 (03PS7) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 
[16:10:00] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[16:10:17] <grrrit-wm>	 (03PS8) 10Ori.livneh: add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 
[16:12:24] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] add varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214147 (owner: 10Ori.livneh)
[16:15:29] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: role::puppet::server::labs Remove unused configuration [puppet] - 10https://gerrit.wikimedia.org/r/214637 
[16:15:31] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: role::puppet::server::labs clean up allow_from [puppet] - 10https://gerrit.wikimedia.org/r/214638 
[16:15:33] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: lint: fully qualify puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/214639 
[16:15:35] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Move certmanager hostname configuration to hiera [puppet] - 10https://gerrit.wikimedia.org/r/214640 
[16:15:37] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Rename role::puppet::server::labs [puppet] - 10https://gerrit.wikimedia.org/r/214641 
[16:15:39] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Rename role::puppet::self to role::puppetmaster::self [puppet] - 10https://gerrit.wikimedia.org/r/214642 
[16:18:43] <grrrit-wm>	 (03PS1) 10Rush: dumps: create misc dir under other [puppet] - 10https://gerrit.wikimedia.org/r/214643 
[16:20:48] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 032] dumps: create misc dir under other [puppet] - 10https://gerrit.wikimedia.org/r/214643 (owner: 10Rush)
[16:25:41] <wikibugs>	 6operations, 7Wikimedia-log-errors: mw1150 spams "memcached error for key" since May 29 3:00am UTC - https://phabricator.wikimedia.org/T100780#1321181 (10bd808) I'm going to work on getting log event rates into graphite with the hope of using that to set some general "go look at logstash" alerts (T100735) for...
[16:29:12] <jynus>	 ss -t -a dst "*:mysql" | wc -l -> 24654
[16:29:16] <jynus>	 cat /proc/sys/net/ipv4/ip_local_port_range -> 32768	61000
[16:29:19] <jynus>	 mmmm
[16:30:50] <jynus>	 echo 61000-32768 | bc -> 28232
[16:37:23] <icinga-wm>	 PROBLEM - puppet last run on oxygen is CRITICAL Puppet has 1 failures
[16:38:59] <jynus>	 netstat -s -> 1723817 connections reset due to early user close
[16:39:11] <grrrit-wm>	 (03PS1) 10Ori.livneh: Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 
[16:39:36] <ori>	 ^ ottomata, godog
[16:39:52] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh)
[16:40:16] <grrrit-wm>	 (03PS2) 10Ori.livneh: Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 
[16:40:21] <bblack>	 why does one host have 24K connections outbound to mysql? :)
[16:40:46] <godog>	 I think we're supposed to play sad_trombone.wav when jenkins -1s
[16:40:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh)
[16:42:37] <jynus>	 bblack, ss -t -a dst "*:mysql" | grep TIME-WAIT | wc -l -> 22154
[16:42:47] <grrrit-wm>	 (03PS3) 10Ori.livneh: Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 
[16:43:33] <jynus>	 maybe we are wrong for T98489 ?
[16:43:46] <ori>	 godog: i wrapped the new config with "if $::hostname == 'cp1048'", so that if there is some annoying puppet bug it doesn't cause puppet failures across all varnishes
[16:44:06] <ori>	 it's currently impossible to test this on labs, the labs varnishes are really behind and some recent puppet patches don't apply there
[16:44:11] <jynus>	 it is not 1 host, it is ALL hosts
[16:44:18] <ori>	 like _joe_ / bblack's backend retry thing
[16:44:38] <jynus>	 well, all mw* I mean
[16:44:49] <ori>	 jynus: is this new?
[16:45:20] <ottomata>	 ori:  why do you need a local statsd?
[16:45:31] <ori>	 ottomata: i don't
[16:45:34] <ottomata>	 oh i see, you don't
[16:45:34] <ottomata>	 worry
[16:45:36] <ottomata>	 sorry
[16:45:39] <ottomata>	 you are just reusing the class
[16:45:39] <ori>	 ori
[16:45:39] <ottomata>	 hmm
[16:45:52] <jynus>	 ori, I am digging right now, cannot say- it is not a problem, I am trying to debug the mentioned ticket
[16:46:23] <jynus>	 but running out of local ports could be a clue?
[16:46:28] <godog>	 ori: yup I'll take a look in '15
[16:46:39] <bblack>	 if it's all systems, then local_port_range isn't an issue
[16:46:58] <jynus>	 well, all systems are creating the same error
[16:47:05] <bblack>	 ok
[16:47:25] <jynus>	 logstash, filter by dberror
[16:47:30] <bblack>	 what I mean is, how on earth can the processes on one system need 24K outbound mysql connections
[16:47:40] <jynus>	 they are closing connections
[16:48:04] <jynus>	 even if it is not the reason, we should reduce time_wait time or disable it
[16:48:31] <bblack>	 disabling time_wait or reducing it drastically is usually a bad idea, usually there's a better way to fix the issue
[16:48:38] <ori>	 yeah, i'm looking, so is bd808
[16:48:58] <bd808>	 the good logstash dashboard for this is -- https://logstash.wikimedia.org/#/dashboard/elasticsearch/wfLogDBError
[16:49:29] <bd808>	 the errors seem to be spread across mw hosts and db hosts pretty evenly
[16:49:30] <bblack>	 that timeout change should have reduced the number of disconnect/reconnects, not increased them, but yeah who knows...
[16:50:06] <jynus>	 I see no change
[16:50:12] <bd808>	 the nominal error rate is the same for the last 24 hours (minus normal traffic gradient)
[16:50:13] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx
[16:50:13] <icinga-wm>	 PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused
[16:50:13] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures
[16:50:21] <jynus>	 so maybe client-related, not server-related
[16:50:30] <jynus>	 bd808, I agree
[16:50:44] <jynus>	 I was looking for an alternative explanation
[16:51:28] <jynus>	 it is know to be worse on higher load, which could explain it (but I have not proof)
[16:51:37] <jynus>	 *known
[16:51:41] <_joe_>	 bd808: can you look at the canaries?
[16:51:48] <_joe_>	 are they also having the same error?
[16:52:18] <jynus>	 oh, so it has not been aplied system-wide?
[16:52:26] <_joe_>	 I think so, yes
[16:52:27] <bd808>	 _joe_: I think we have the same config everywhere now
[16:52:30] <_joe_>	 ok
[16:53:00] <bd808>	 https://gerrit.wikimedia.org/r/#/c/214295/ is merged
[16:53:50] <ori>	 it's not uniform: https://dpaste.de/mneu/raw
[16:54:02] <_joe_>	 so, looking at the hhvm code I wasn't so sure that hhvm.mysql.connect_timeout would solve something
[16:54:32] <icinga-wm>	 RECOVERY - puppet last run on oxygen is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[16:54:33] <jynus>	 ori, that is nice- is there any correlation
[16:54:49] <_joe_>	 ori: try to correlate that with rows/racks maybe :)
[16:54:58] <jynus>	 ha ha
[16:55:01] <paravoid>	 I've done this before
[16:55:03] <_joe_>	 I have friends waiting for me at the pub
[16:55:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[16:55:12] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx
[16:55:12] <icinga-wm>	 PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused
[16:55:12] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures
[16:55:27] <paravoid>	 we used to have DB connections errors when row D had just two uplinks
[16:55:29] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh)
[16:55:50] <paravoid>	 correlating this to rack/row wasn't trivial, but doable and useful
[16:56:01] <_joe_>	 paravoid: the errors seem to concentrate on the newer appservers
[16:56:02] <jynus>	 so it was ip level error?
[16:56:17] <paravoid>	 _joe_: are they weighted the same?
[16:56:21] <_joe_>	 but as just stated, I'm bolting out :)
[16:56:38] <_joe_>	 paravoid: not all of the servers, no
[16:57:07] <_joe_>	 paravoid: the one listed here, some are API, some are normal appservers I guess
[16:57:13] <paravoid>	 jynus: well at the time it was related, although that wasn't properly explainable either (the packet loss was tiny, shouldn't have affected php->mysql all that much)
[16:57:14] <jynus>	 I would expect more http connections -> more errors, but that would be very normal
[16:57:19] <_joe_>	 and they're all from the group with higher weight
[16:57:46] <_joe_>	 jynus: I suppose it's a superposition of the two, actually
[16:57:53] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Provision varnishstatsd [puppet] - 10https://gerrit.wikimedia.org/r/214645 (owner: 10Ori.livneh)
[16:58:07] <paravoid>	 try counting the cross-row traffic
[16:58:07] <_joe_>	 but well, see you on monday!
[16:58:17] <ori>	 take care _joe_
[16:58:25] <ori>	 have a good weekend
[16:58:26] <jynus>	 what we now now is that it is not application level error on server side, connections fail at tcp level
[16:58:30] <paravoid>	 ori: you really should move some of your bash aliases to a global space
[16:58:33] <_joe_>	 you too :)
[16:58:42] <jynus>	 same, _joe_ 
[16:59:00] <paravoid>	 ori: some of them are awesome :)
[16:59:09] <ori>	 yeah could do
[16:59:36] <jynus>	 netstat -s is meaningless for me, but probably because the error rate is low
[17:00:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:00:12] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx
[17:00:13] <icinga-wm>	 PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused
[17:00:13] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures
[17:00:29] <jynus>	 I suppose the way to go should be tcpdump and wait
[17:00:43] <bblack>	 all of the hosts in ori's histogram list are in rows C & D
[17:00:55] <paravoid>	 it's more complicated than that
[17:00:59] <bblack>	 but I don't know what total set of hosts that comes from, in that sense.  maybe all possible affected are there
[17:01:02] <paravoid>	 you have to account on where the dbs are
[17:01:34] <jynus>	 maybe crunching all combinations of clientes and hosts
[17:01:49] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "@Filippo, I thought about that approach too. Then I realized that both would have the exact same case check:" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris)
[17:02:09] <ori>	 godog: applied on cp1048, metrics should show up in graphite shortly
[17:02:16] <godog>	 ori: cool!
[17:02:43] <ori>	 gah
[17:02:51] <ori>	 the metric key prefix is not being honored
[17:02:54] * ori fixes
[17:03:06] <jynus>	 I will do some statistics when I have time, which probably it will not be soon :-)
[17:03:39] <grrrit-wm>	 (03PS1) 10BBlack: Revert "new facts for canonical ipv4 addr/interface" [puppet] - 10https://gerrit.wikimedia.org/r/214648 
[17:03:43] <grrrit-wm>	 (03PS2) 10BBlack: Revert "new facts for canonical ipv4 addr/interface" [puppet] - 10https://gerrit.wikimedia.org/r/214648 
[17:03:49] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] Revert "new facts for canonical ipv4 addr/interface" [puppet] - 10https://gerrit.wikimedia.org/r/214648 (owner: 10BBlack)
[17:03:51] <godog>	 ori: which prefix?
[17:04:03] <ori>	 godog: it should include the dc name
[17:04:14] <ori>	 but it's not, because the systemd file is not passing the argument to the service
[17:04:20] <ori>	 so it's defaulting to the python script's internal default
[17:04:42] <ori>	 'varnish.backends.XXX'
[17:04:45] <godog>	 ah ok
[17:04:51] <godog>	 looks good otherwise
[17:05:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:05:13] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx
[17:05:13] <icinga-wm>	 PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused
[17:05:13] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures
[17:05:31] <godog>	 17:05:29.051947 IP cp1048.eqiad.wmnet.33463 > graphite1001.eqiad.wmnet.8125: UDP, length 1397
[17:08:04] <grrrit-wm>	 (03PS1) 10Ori.livneh: varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 
[17:08:08] <ori>	 ^ godog
[17:10:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:10:13] <icinga-wm>	 PROBLEM - check_nginx on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name nginx
[17:10:13] <icinga-wm>	 PROBLEM - check_payments_wiki on payments1001 is CRITICAL: Connection refused
[17:10:13] <icinga-wm>	 PROBLEM - check_puppetrun on payments1001 is CRITICAL Puppet has 1 failures
[17:10:47] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 (owner: 10Ori.livneh)
[17:11:02] <ori>	 thanks :)
[17:11:06] <grrrit-wm>	 (03PS2) 10Ori.livneh: varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 
[17:11:15] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] varnishstatsd: honor key prefix parameter [puppet] - 10https://gerrit.wikimedia.org/r/214650 (owner: 10Ori.livneh)
[17:11:57] <godog>	 ori: I'll wipe the old hierarchy
[17:12:21] <ori>	 godog: thanks. don't forget graphite2001
[17:12:32] <godog>	 yup, that too
[17:14:09] <ori>	 much appreciated
[17:15:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:15:12] <icinga-wm>	 RECOVERY - check_nginx on payments1001 is OK: PROCS OK: 49 processes with command name nginx
[17:15:13] <icinga-wm>	 RECOVERY - check_payments_wiki on payments1001 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.029 second response time
[17:15:13] <icinga-wm>	 RECOVERY - check_puppetrun on payments1001 is OK Puppet is currently enabled, last run 177 seconds ago with 0 failures
[17:15:21] <grrrit-wm>	 (03PS1) 10Ori.livneh: Apply ::varnish::logging::statsd on all varnishes, not just cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/214651 
[17:15:49] <robh>	 mutante: ^ seems like it may be clearing, partially at least
[17:15:53] <godog>	 ori: /var/lib/carbon/whisper/_varnish/eqiad/backends_/
[17:16:20] <ori>	 godog: is that after the change or before as well?
[17:16:57] <mutante>	 robh: that was me fixing it
[17:17:02] <mutante>	 well manually
[17:17:05] <ori>	 after, looks like
[17:17:07] <mutante>	 still have to check if it's in puppet
[17:17:12] <ori>	 doh
[17:17:21] <robh>	 heh
[17:18:17] <godog>	 ori: sad_trombone.wav
[17:18:51] <mutante>	 !log fix client_max_body_size syntax error in nginx config of payments1001
[17:18:58] <morebots>	 Logged the message, Master
[17:19:39] <ori>	 godog: does systemd treat quoting differently than bash?
[17:19:47] <ori>	 i think it does
[17:19:54] <ori>	 the " are being taken literally
[17:20:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:20:13] <icinga-wm>	 PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail
[17:21:02] <bd808>	 so what data are we wanting for the db problem? (mw host, db server) tuples?
[17:22:31] <godog>	 ori: yep I think that's it, quotes in ps
[17:24:16] <grrrit-wm>	 (03PS1) 10Ori.livneh: varnishstatsd: fix argument quoting in systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/214655 
[17:25:11] <jynus>	 bd808, not sure, maybe each combination of host creating the error and the ip of the mysql shown in the error message
[17:25:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:25:12] <icinga-wm>	 PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail
[17:27:06] <jynus>	 o, no I see there is an specific field for db_server
[17:28:19] <bd808>	 I'll try to use my new logstash cli tool to make something useful -- https://github.com/bd808/ggml
[17:29:08] <ori>	 bd808: can you puppetize that? it looks useful as hell
[17:29:26] <jynus>	 yep, looks fancy
[17:29:55] <bd808>	 I was going to try and figure out how to make a golang deb but haven't gotten to it
[17:30:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:30:12] <icinga-wm>	 PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail
[17:31:16] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] varnishstatsd: fix argument quoting in systemd unit file [puppet] - 10https://gerrit.wikimedia.org/r/214655 (owner: 10Ori.livneh)
[17:31:43] <godog>	 ori: merged, looks good
[17:32:19] <grrrit-wm>	 (03PS2) 10Ori.livneh: Apply ::varnish::logging::statsd on all varnishes, not just cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/214651 
[17:34:39] <ori>	 godog: last step ^
[17:35:05] <godog>	 ori: I think we can also wait monday for full rollout, it isn't a lot of metrics but potentially over >10k/s (metrics, less packets of course)
[17:35:12] <icinga-wm>	 PROBLEM - check_puppetrun on barium is CRITICAL puppet fail
[17:35:12] <icinga-wm>	 PROBLEM - check_puppetrun on samarium is CRITICAL puppet fail
[17:36:09] <mutante>	 ^ those are an additional issue in FR that Jeff is on now
[17:36:17] <ori>	 godog: I am impatient as usual but I am OK with waiting
[17:36:54] <ori>	 that also gives me a chance to play with the data from cp1048 a little over the weekend and tweak the script if i need to
[17:37:19] <godog>	 ori: hehe just to avoid sabotaging ourselves
[17:37:44] <ori>	 godog: monday is april fools, tho :P
[17:37:58] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 04-1] "On hold until Monday, April 1." [puppet] - 10https://gerrit.wikimedia.org/r/214651 (owner: 10Ori.livneh)
[17:38:32] <mutante>	 that will take a while
[17:38:35] <mutante>	 it's already almost June
[17:38:44] <ori>	 what the fuck
[17:38:47] <ori>	 i am insane
[17:38:50] <godog>	 ori: yeah the april's fool will be "graphite doesn't suck"
[17:38:53] <ori>	 please /clear right now
[17:39:01] <ori>	 and forget the last 10 lines or so ever happened
[17:39:07] <ori>	 june 1st, not april 1st
[17:39:26] <godog>	 haha
[17:39:41] <jynus>	 brain fart
[17:39:47] <grrrit-wm>	 (03CR) 10Ori.livneh: "That's June 1st." [puppet] - 10https://gerrit.wikimedia.org/r/214651 (owner: 10Ori.livneh)
[17:40:11] <jynus>	 or maybe he meant April 1, next year, you slacker!
[17:40:12] <icinga-wm>	 RECOVERY - check_puppetrun on barium is OK Puppet is currently enabled, last run 152 seconds ago with 0 failures
[17:40:12] <icinga-wm>	 RECOVERY - check_puppetrun on samarium is OK Puppet is currently enabled, last run 218 seconds ago with 0 failures
[17:40:19] <godog>	 ori: very nice how little cpu varnishstatsd uses
[17:40:39] <ori>	 godog: varnishapi.so does all the heavy lifting
[17:41:17] <godog>	 hehe yeah I guess as long as the interpreter doesn't see any of that it isn't a big deal
[17:41:30] <paravoid>	 hahaha ori
[17:41:41] <ori>	 I SAID /clear DAMN IT
[17:42:30] <ori>	 the only thing that's weird is that i would have expected other http methods to show up by now
[17:42:36] <ori>	 oh nevermind
[17:42:39] <ori>	 i guess it's an upload varnish
[17:42:45] <ori>	 not too many POSTs
[17:42:57] <paravoid>	 none legitimate ones
[17:43:10] <ori>	 and PURGEs don't result in a backend connection, and hence aren't measured
[17:43:35] <bd808>	 here's the raw host -> db server pairs for the last hour -- https://phabricator.wikimedia.org/P702
[17:44:28] <paravoid>	 bd808: make that | uniq -c, convert mw* hostnames to IPs, drop the last byte on both src and dst IP
[17:44:48] <ori>	 godog: in that case, would you mind if i applied it on one additional varnish, a text varnish?
[17:44:53] <bd808>	 ah. vlan to vlan
[17:44:59] <paravoid>	 yup!
[17:45:13] <bd808>	 I think I can do that
[17:45:16] <godog>	 ori: WFM
[17:46:30] <wikibugs>	 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1321314 (10chasemp) 5Open>3declined We decided to create a public dump of "safe" data instead  http://dumps.wikimedia.org/other/misc/phabricator_public.dump
[17:46:47] <wikibugs>	 6operations, 6Phabricator, 7database: Phabricator database access for Joel Aufrecht - https://phabricator.wikimedia.org/T99295#1321318 (10chasemp)
[17:47:52] <paravoid>	 cp1048 has a varnishkafka alert
[17:47:56] <paravoid>	 WARNING: 11.11% of data above the warning threshold [0.0] 
[17:47:57] <grrrit-wm>	 (03PS1) 10Ori.livneh: Apply varnishstatsd on cp1066 (text varnish) as well [puppet] - 10https://gerrit.wikimedia.org/r/214657 
[17:48:17] <ori>	 paravoid: hasn't that been flapping for some days now?
[17:48:26] <paravoid>	 s/days/months/ :)
[17:49:33] <ori>	 ganglia is super slow for some reason
[17:51:31] <bd808>	 paravoid: https://phabricator.wikimedia.org/P703
[17:52:40] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Apply varnishstatsd on cp1066 (text varnish) as well [puppet] - 10https://gerrit.wikimedia.org/r/214657 (owner: 10Ori.livneh)
[17:54:33] <wikibugs>	 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1321350 (10bd808) Maybe we have some cross vlan communication issues:  {P703}
[17:57:37] <jynus>	 https://phabricator.wikimedia.org/P704
[17:57:46] <paravoid>	 bd808: the servers aren't balanced per row (neither mw* nor db*) so you should account for that as well :/
[17:58:22] <bd808>	 hmmm... count of uniq ips per subnet is needed then?
[17:58:34] <bd808>	 ugly
[17:58:58] <paravoid>	 450 for same 10.64.32 -> 10.64.32, in any case
[18:00:31] <bd808>	 is there a better way to count the hosts than trying to grok the regex patterns in site.pp?
[18:02:43] <jynus>	 but I get 2323 combinations out of 4994 possible 
[18:02:51] <jynus>	 bd808, salt, did it before
[18:03:29] <jynus>	 but not all db* hosts are production hosts
[18:11:15] <wikibugs>	 6operations, 6Commons, 10MediaWiki-Database, 6Multimedia, 7Wikimedia-log-errors: internal_api_error_DBQueryError: Database query error while (mass) deleting file over api - https://phabricator.wikimedia.org/T98706#1321408 (10Umherirrender)
[18:11:51] <mutante>	 which role installs the apache on the puppetmaster?
[18:12:08] <mutante>	 not puppetmaster::frontend?
[18:13:02] <jynus>	 I have a great correlation
[18:13:08] <jynus>	 using tendril
[18:13:30] <jynus>	 top db offenders are also top QPS
[18:14:51] <jynus>	 the opposite is also true
[18:16:00] <JohnFLewis>	 mutante: puppetmaster::passenger looks like it installs apache
[18:17:34] <grrrit-wm>	 (03PS1) 10RobH: blog.wikimedia.org sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214659 
[18:17:49] <mutante>	 JohnFLewis: :) ah, tx
[18:18:13] <godog>	 jynus: cool! not sure I got the last part, what's opposite?
[18:18:37] <bd808>	 low qps -> low error rate?
[18:19:56] <jynus>	 less QPS -> less errors
[18:20:00] <grrrit-wm>	 (03CR) 10RobH: [C: 032] blog.wikimedia.org sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214659 (owner: 10RobH)
[18:20:09] <mutante>	 JohnFLewis: alright, it does, and the site config is there, cool, just that the relevant config is in apache2.conf not sites-enabled/foo
[18:20:09] <jynus>	 fewer if you will
[18:24:16] <grrrit-wm>	 (03PS1) 10Ori.livneh: varnishstatsd: coerce BackendXID to unsigned [puppet] - 10https://gerrit.wikimedia.org/r/214660 
[18:24:22] <ori>	 paravoid: ^ this may amuse you
[18:24:46] <paravoid>	 jynus: well that makes sense doesn't it?
[18:25:28] <paravoid>	 the ratio makes more sense as a metric
[18:25:36] <paravoid>	 if you're looking from a db perspective
[18:25:39] <grrrit-wm>	 (03PS2) 10Ori.livneh: varnishstatsd: coerce BackendXID to unsigned [puppet] - 10https://gerrit.wikimedia.org/r/214660 
[18:25:57] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] varnishstatsd: coerce BackendXID to unsigned [puppet] - 10https://gerrit.wikimedia.org/r/214660 (owner: 10Ori.livneh)
[18:27:03] <jynus>	 paravoid, agreed
[18:27:06] <ori>	 ottomata: ^ be aware, in case this bites you
[18:27:10] <ori>	 (see commit message)
[18:27:20] <paravoid>	 ori: wth really
[18:27:33] <paravoid>	 "Work around that by coercing BackendXID to a signed integer." -- s/signed/unsigned/?
[18:27:43] <ori>	 right
[18:28:31] <grrrit-wm>	 (03CR) 10Ori.livneh: "(unsigned, that is)" [puppet] - 10https://gerrit.wikimedia.org/r/214660 (owner: 10Ori.livneh)
[18:28:47] <jynus>	 problem with ratio is that we do not have real time data
[18:30:04] <icinga-wm>	 PROBLEM - puppet last run on db1059 is CRITICAL Puppet has 1 failures
[18:31:20] <jynus>	 ^not really
[18:31:44] <icinga-wm>	 RECOVERY - puppet last run on db1059 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures
[18:32:59] <wikibugs>	 6operations, 7network: asw2-a5-eqiad.mgmt.eqiad.wmnet xe-0/0/36 reporting errors - https://phabricator.wikimedia.org/T100820#1321485 (10fgiunchedi)
[18:37:37] <grrrit-wm>	 (03PS1) 10RobH: wikitech.wikimeida.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) 
[18:38:21] <wikibugs>	 6operations, 10wikitech.wikimedia.org, 7HTTPS, 5Patch-For-Review: wikitech.wikimedia.org SSL certificate considered "outdated security" in Chrome - https://phabricator.wikimedia.org/T92709#1321521 (10RobH) once the above patchset is merged live and wikitech is using the sha256, please assign this task to m...
[18:38:24] <wikibugs>	 6operations, 6Labs, 7database: Santitize recent wikis: wikimania 2016 and cn.wikimedia.org at labs dbs - https://phabricator.wikimedia.org/T100441#1321522 (10Krenair)
[18:39:02] <icinga-wm>	 PROBLEM - puppet last run on sca1001 is CRITICAL puppet fail
[18:39:30] <grrrit-wm>	 (03CR) 10RobH: [C: 04-1] "still needs the config changes for wikitech done in the SAME patchset or it will break things." [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH)
[18:44:07] <paravoid>	 jynus: dbstore1002 error
[18:44:15] <paravoid>	 bblack: plenty of ocsp warnings
[18:44:30] <paravoid>	 jynus: error = replag
[18:44:56] <wikibugs>	 6operations: meta ticket - open up RT permissions on domains queue - https://phabricator.wikimedia.org/T83378#1321559 (10Dzahn)
[18:45:05] <wikibugs>	 6operations, 10Wikimedia-Mailing-lists: mailman emails taking long time for delivery, getting stuck in sodium - https://phabricator.wikimedia.org/T61731#1321562 (10chasemp) 5Open>3declined a:3chasemp Since this is nonspecific and a few months since anyone was bitten enough to update I am closing for now....
[18:45:10] <wikibugs>	 6operations: meta ticket - open up RT permissions on domains queue - https://phabricator.wikimedia.org/T83378#1321565 (10Dzahn) 5Invalid>3Resolved
[18:46:55] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321574 (10RobH)
[18:47:25] <wikibugs>	 6operations, 7database: Drop database table "optin_survey" from Wikimedia wikis - https://phabricator.wikimedia.org/T54934#1321576 (10chasemp)
[18:47:32] <wikibugs>	 6operations, 7database: Drop *_old database tables from Wikimedia wikis - https://phabricator.wikimedia.org/T54932#1321578 (10chasemp)
[18:47:43] <wikibugs>	 6operations, 7database: Drop database table "namespaces" from Wikimedia wikis - https://phabricator.wikimedia.org/T54929#1321580 (10chasemp)
[18:47:55] <wikibugs>	 6operations, 7database: Drop database table "hashs" from Wikimedia wikis - https://phabricator.wikimedia.org/T54927#1321582 (10chasemp)
[18:48:30] <wikibugs>	 6operations, 7HTTPS: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1321584 (10RobH) 3NEW
[18:49:33] <grrrit-wm>	 (03PS1) 10RobH: ganglia's sha1 cert to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214670 (https://phabricator.wikimedia.org/T100825) 
[18:49:35] <wikibugs>	 6operations, 6Labs, 7Tracking, 7database: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1321591 (10chasemp)
[18:49:52] <grrrit-wm>	 (03CR) 10RobH: [C: 04-1] "please note this initial patchset is ONLY the certificate change, NOT" [puppet] - 10https://gerrit.wikimedia.org/r/214670 (https://phabricator.wikimedia.org/T100825) (owner: 10RobH)
[18:51:24] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321608 (10RobH)
[18:57:34] <icinga-wm>	 RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[18:59:11] <wikibugs>	 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1321619 (10RobH)
[18:59:19] <wikibugs>	 6operations, 7HTTPS, 5Patch-For-Review: replace ganglia's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100825#1321584 (10RobH)
[19:01:27] <wikibugs>	 6operations, 7HTTPS: replace git's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100827#1321624 (10RobH) 3NEW a:3RobH
[19:05:02] <icinga-wm>	 PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail
[19:06:02] <grrrit-wm>	 (03PS1) 10RobH: git.wikimedia.org.crt sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214673 (https://phabricator.wikimedia.org/T100827) 
[19:07:36] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321642 (10RobH)
[19:13:05] <wikibugs>	 6operations, 7HTTPS: replace icinga's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100830#1321659 (10RobH) 3NEW
[19:13:35] <grrrit-wm>	 (03PS1) 10RobH: icinga.wikimedia.org cert sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214674 (https://phabricator.wikimedia.org/T100830) 
[19:14:47] <grrrit-wm>	 (03CR) 10RobH: [C: 04-1] "DO NOT MERGE THIS PATCHSET without updating it with the associated" [puppet] - 10https://gerrit.wikimedia.org/r/214674 (https://phabricator.wikimedia.org/T100830) (owner: 10RobH)
[19:16:47] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321670 (10RobH)
[19:17:23] <bblack>	 paravoid: possibly our webproxy?
[19:18:11] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#753287 (10RobH)
[19:22:23] <bblack>	 paravoid: no, it's cloudflare/globalsign
[19:22:31] <bblack>	 it's not complete, just higher failure rate than normal
[19:22:40] <wikibugs>	 6operations, 7HTTPS: replace librenms's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100831#1321685 (10RobH) 3NEW
[19:22:59] <grrrit-wm>	 (03PS1) 10RobH: replace librenms's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) 
[19:23:30] <grrrit-wm>	 (03CR) 10RobH: [C: 04-1] "DO NOT MERGE UNLESS YOU APPEND CONFIGURATION CHANGES" [puppet] - 10https://gerrit.wikimedia.org/r/214676 (https://phabricator.wikimedia.org/T100831) (owner: 10RobH)
[19:23:43] <icinga-wm>	 RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[19:24:45] <bblack>	 the OCSP warnings at are the 4h-old mark, CRIT at the 8h-old mark, and then in the common case the data becomes too old at the 12h mark for actual client usage
[19:24:56] <bblack>	 (and it's retrying once every 2 hours)
[19:28:13] <wikibugs>	 6operations, 7HTTPS: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1321709 (10RobH) 3NEW
[19:28:33] <grrrit-wm>	 (03PS1) 10RobH: lists.wikimedia.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214680 (https://phabricator.wikimedia.org/T100832) 
[19:29:31] <grrrit-wm>	 (03CR) 10RobH: [C: 04-1] "DO NOT MERGE until the configuration changes are appended to this" [puppet] - 10https://gerrit.wikimedia.org/r/214680 (https://phabricator.wikimedia.org/T100832) (owner: 10RobH)
[19:29:54] <JohnFLewis>	 robh: may be a good thing to attack on Tuesday?
[19:30:10] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321718 (10RobH)
[19:30:15] <robh>	 JohnFLewis: thats my plan yep
[19:30:23] <robh>	 since i'll already have the window
[19:31:01] <wikibugs>	 6operations, 7HTTPS, 5Patch-For-Review: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1321709 (10RobH)
[19:31:03] <wikibugs>	 6operations, 3Roadmap, 7notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1321724 (10RobH)
[19:34:41] <grrrit-wm>	 (03PS2) 10Dzahn: wikitech.wikimeida.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH)
[19:35:16] <grrrit-wm>	 (03PS3) 10Dzahn: wikitech.wikimedia.org certificate sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214666 (https://phabricator.wikimedia.org/T92709) (owner: 10RobH)
[19:36:05] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321732 (10RobH)
[19:37:34] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#777907 (10RobH)
[19:38:52] <grrrit-wm>	 (03PS1) 10BBlack: double OCSP fetch rate to cope with upstream error rate [puppet] - 10https://gerrit.wikimedia.org/r/214685 
[19:39:56] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] double OCSP fetch rate to cope with upstream error rate [puppet] - 10https://gerrit.wikimedia.org/r/214685 (owner: 10BBlack)
[19:43:06] <wikibugs>	 6operations, 7HTTPS: replace tendril.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100835#1321758 (10RobH) 3NEW
[19:43:25] <grrrit-wm>	 (03PS1) 10RobH: replace tendril.wikimedia.org's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214692 (https://phabricator.wikimedia.org/T100835) 
[19:43:45] <grrrit-wm>	 (03CR) 10RobH: [C: 04-1] "DO NOT MERGE THIS PATCHSET until it has been appended with the" [puppet] - 10https://gerrit.wikimedia.org/r/214692 (https://phabricator.wikimedia.org/T100835) (owner: 10RobH)
[19:44:21] <wikibugs>	 6operations, 7HTTPS: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#1321770 (10RobH)
[19:50:30] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321787 (10hashar) 3NEW
[19:50:30] <grrrit-wm>	 (03PS14) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis)
[19:52:08] <grrrit-wm>	 (03PS2) 10RobH: replace tendril.wikimedia.org's sha1 cert with sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214692 (https://phabricator.wikimedia.org/T100835) 
[19:52:19] <grrrit-wm>	 (03PS1) 10John F. Lewis: pywikipedia->pywikibot in mailman [puppet] - 10https://gerrit.wikimedia.org/r/214694 (https://phabricator.wikimedia.org/T100707) 
[19:58:38] <grrrit-wm>	 (03PS2) 10Dzahn: icinga.wikimedia.org cert sha1 to sha256 [puppet] - 10https://gerrit.wikimedia.org/r/214674 (https://phabricator.wikimedia.org/T100830) (owner: 10RobH)
[20:04:33] <wikibugs>	 6operations: Use SSLCertificateChainFile in Gerrit Apache configuration - https://phabricator.wikimedia.org/T82319#1321824 (10Dzahn)
[20:08:23] <wikibugs>	 6operations, 7HTTPS: Ganglia server doesn't send intermediary certificates - https://phabricator.wikimedia.org/T72326#1321832 (10Dzahn) this should be fixed once we merge https://gerrit.wikimedia.org/r/197341
[20:08:40] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321833 (10BBlack) Note for tool labs, there's already some motd stuff in puppet like this:  https://github.com/wikimedia/o...
[20:14:43] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0]
[20:15:03] <icinga-wm>	 PROBLEM - puppet last run on labstore1001 is CRITICAL Puppet last ran 5 hours ago
[20:15:54] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: Remove ssh::hostkeys-collect from mira [puppet] - 10https://gerrit.wikimedia.org/r/214699 
[20:15:57] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] Remove ssh::hostkeys-collect from mira [puppet] - 10https://gerrit.wikimedia.org/r/214699 (owner: 10Faidon Liambotis)
[20:16:06] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [V: 032] Remove ssh::hostkeys-collect from mira [puppet] - 10https://gerrit.wikimedia.org/r/214699 (owner: 10Faidon Liambotis)
[20:16:43] <icinga-wm>	 RECOVERY - puppet last run on labstore1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:18:54] <wikibugs>	 6operations, 10Traffic, 7HTTPS: Make OCSP Stapling support more generic and robust - https://phabricator.wikimedia.org/T93927#1321852 (10Dzahn) p:5Normal>3High
[20:20:33] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL puppet fail
[20:20:55] <hashar>	 the spike of 500 are database related
[20:20:55] <hashar>	 my english is crap
[20:21:07] <wikibugs>	 6operations, 6Labs, 7database: Santitize recent wikis: wikimania 2016 and cn.wikimedia.org at labs dbs - https://phabricator.wikimedia.org/T100441#1321864 (10scfc) @jcrespo: I don't know if you mean that, but the views in `enwiki_p` & Co. visible from Labs are maintained with `maintain-replicas/maintain-repl...
[20:21:10] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 
[20:22:35] <paravoid>	 bblack, jgage ^
[20:23:01] <wikibugs>	 6operations, 3Roadmap, 7notice, 7user-notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1318965 (10gpaumier)
[20:24:48] <grrrit-wm>	 (03CR) 10Gage: [C: 031] strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 (owner: 10Faidon Liambotis)
[20:25:05] <bblack>	 jgage: are we on 5.2.1-6 or later?
[20:25:12] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[20:27:02] <wikibugs>	 10Ops-Access-Requests, 6operations, 6Release-Engineering, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1321934 (10hashar) p:5High>3Normal Bringing back to normal priority. No clue why I bumped it :-]  Task is stalled pending per RobH just above.
[20:28:40] <bblack>	 yeah we are apparently
[20:28:51] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 (owner: 10Faidon Liambotis)
[20:29:30] <bblack>	 dbstore1002
[20:29:30] <bblack>	 MariaDB Slave Lag: s7
[20:29:32] <bblack>	 CRITICAL	2015-05-29 20:27:44	0d 19h 44m 2s	3/3	CRITICAL slave_sql_lag Seconds_Behind_Master: 210359
[20:29:36] <bblack>	 ^ ??
[20:30:19] <hashar>	 we had a bunch of mysql can't connect all day but I assumed it to be part of the regular spam
[20:37:22] <grrrit-wm>	 (03PS1) 10BBlack: allow robots to use RL on domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214705 
[20:39:03] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] allow robots to use RL on domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214705 (owner: 10BBlack)
[20:39:09] <grrrit-wm>	 (03PS1) 10Tim Landscheidt: Tools: Add database alias for wikimania2016wiki [puppet] - 10https://gerrit.wikimedia.org/r/214718 (https://phabricator.wikimedia.org/T96638) 
[20:39:22] <icinga-wm>	 RECOVERY - puppet last run on mw2211 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:40:26] <grrrit-wm>	 (03Merged) 10jenkins-bot: allow robots to use RL on domains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214705 (owner: 10BBlack)
[20:42:56] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321967 (10scfc)
[20:43:00] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 032] strongswan: remove temporary hack for sysvinit [puppet] - 10https://gerrit.wikimedia.org/r/214701 (owner: 10Faidon Liambotis)
[20:43:04] <logmsgbot>	 !log ori Synchronized robots.txt: I7b321b62d: allow robots to use RL on domains (duration: 00m 14s)
[20:43:08] <morebots>	 Logged the message, Master
[20:43:53] <wikibugs>	 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On labs, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1321974 (10yuvipanda) Title should probably refer to beta cluster.
[20:46:14] <jgage>	 international house of pain
[20:46:20] <jgage>	 and mischan
[20:52:19] <mutante>	 jgage: jump around
[20:54:34] <jgage>	 nice
[20:57:16] <grrrit-wm>	 (03PS1) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 
[20:57:33] <ori>	 paravoid: (if you're still around) ^
[20:57:56] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 (owner: 10Ori.livneh)
[20:59:00] <grrrit-wm>	 (03PS2) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 
[21:03:32] <grrrit-wm>	 (03PS3) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 
[21:05:03] <grrrit-wm>	 (03PS1) 10Odder: Add another IP address for Santiago edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214765 (https://phabricator.wikimedia.org/T100051) 
[21:05:09] <grrrit-wm>	 (03PS1) 10Yuvipanda: tools: Stop bigbrother attempting to look at webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/214766 
[21:07:36] <grrrit-wm>	 (03CR) 10BBlack: "Won't the webservice lines fall down to the final else for unsupported command error now? or are they gone from the input?" [puppet] - 10https://gerrit.wikimedia.org/r/214766 (owner: 10Yuvipanda)
[21:08:16] <grrrit-wm>	 (03CR) 10Yuvipanda: "bah, they will. Good catch :)" [puppet] - 10https://gerrit.wikimedia.org/r/214766 (owner: 10Yuvipanda)
[21:11:58] <grrrit-wm>	 (03PS2) 10Yuvipanda: tools: Stop bigbrother attempting to look at webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/214766 
[21:12:04] <YuviPanda>	 bblack: thanks :) 
[21:12:11] <YuviPanda>	 bblack: ^ should fix them maybe
[21:14:24] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] tools: Stop bigbrother attempting to look at webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/214766 (owner: 10Yuvipanda)
[21:20:15] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] add "docs" as alias for "doc" in wm.org/mw.org [dns] - 10https://gerrit.wikimedia.org/r/214416 (https://phabricator.wikimedia.org/T100349) (owner: 10Dzahn)
[21:21:02] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add another IP address for Santiago edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214765 (https://phabricator.wikimedia.org/T100051) (owner: 10Odder)
[21:21:08] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add another IP address for Santiago edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214765 (https://phabricator.wikimedia.org/T100051) (owner: 10Odder)
[21:21:45] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] redirect "docs" to doc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/214418 (https://phabricator.wikimedia.org/T100349) (owner: 10Dzahn)
[21:21:47] <logmsgbot>	 !log ori Synchronized wmf-config/throttle.php: Ife45684c5: Add another IP address for Santiago edit-a-thon (duration: 00m 13s)
[21:21:50] <morebots>	 Logged the message, Master
[21:31:32] <wikibugs>	 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension)  - https://phabricator.wikimedia.org/T100846#1322108 (10JAufrecht) 3NEW a:3chasemp
[21:36:23] <icinga-wm>	 PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures
[21:38:00] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1322133 (10Dzahn) a:3Dzahn
[21:38:24] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#939806 (10Dzahn) done !
[21:38:31] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Sanitise a Bugzilla database dump - https://phabricator.wikimedia.org/T85141#1322135 (10Dzahn) done !  http://dumps.wikimedia.org/other/bugzilla/
[21:38:46] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Bugzilla HTML static version and database dump - https://phabricator.wikimedia.org/T1198#1322143 (10Dzahn)
[21:39:18] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322146 (10Dzahn)
[21:39:22] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla: Bugzilla HTML static version and database dump - https://phabricator.wikimedia.org/T1198#1322144 (10Dzahn) 5Open>3Resolved both blocker tasks are resolved, so this should be resolved too
[21:39:34] <grrrit-wm>	 (03PS1) 10John F. Lewis: add bugzilla to dumps.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/214770 
[21:39:40] <JohnFLewis>	 mutante: ^^
[21:39:56] <JohnFLewis>	 puppet overrode what was on index.html :)
[21:40:37] <mutante>	 oh, the index page, very true. thanks
[21:40:52] <mutante>	 needs a <br /> ,, sec
[21:41:57] <grrrit-wm>	 (03PS2) 10Dzahn: add bugzilla to dumps.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/214770 (owner: 10John F. Lewis)
[21:42:31] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] add bugzilla to dumps.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/214770 (owner: 10John F. Lewis)
[21:44:36] <wikibugs>	 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) - https://phabricator.wikimedia.org/T100846#1322165 (10chasemp) @mmodell do you know where story points are stored?
[21:45:38] <grrrit-wm>	 (03PS1) 10Odder: Sysops to add users to import group on muiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) 
[21:47:19] <grrrit-wm>	 (03PS2) 10Odder: Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) 
[21:47:44] <grrrit-wm>	 (03PS3) 10Alex Monk: Sysops to add users to import group on muiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder)
[21:48:56] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322184 (10Dzahn)
[21:49:48] <grrrit-wm>	 (03PS4) 10Odder: Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) 
[21:50:29] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1182562 (10Dzahn) T95266 - i don't know if anyone actually still wants to mail users who have not migrated so i did not close it as rejected, but i did remove it...
[21:50:59] <wikibugs>	 6operations, 10Wikimedia-Bugzilla: analyze Bugzilla access logs - https://phabricator.wikimedia.org/T86859#1322191 (10Dzahn) 5Open>3Resolved a:3Dzahn
[21:51:05] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322193 (10Dzahn)
[21:51:24] <icinga-wm>	 RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures
[21:55:42] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322202 (10Dzahn)
[21:58:16] <wikibugs>	 6operations, 3Roadmap, 7notice, 7user-notice: Mailing list maintenance window - 2015-06-02 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T100711#1322204 (10RobH)
[21:59:27] <wikibugs>	 6operations, 7HTTPS, 5Patch-For-Review: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1322209 (10RobH) a:3RobH I'll be handling the implementation of this during my mailing list maintenance window scheduled on T100711
[21:59:40] <wikibugs>	 6operations, 7HTTPS: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1322213 (10RobH)
[22:00:34] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322218 (10Dzahn)
[22:02:27] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322229 (10Dzahn) T95267 - removed as a blocker because the dump exists now which makes it possible to build one without needing old-bz
[22:02:45] <wikibugs>	 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1322231 (10RobH) So then it seems this is an ideal use case for a public IP based ganeti VM, correct?  (If so, we can create a request ticket per the instructions on:   https://wikitech.wikimedia.org/wiki/Operations_requests#Virtual...
[22:03:57] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1322232 (10Dzahn) a:3Dzahn
[22:04:53] <wikibugs>	 6operations, 7HTTPS: replace lists.wikimedia.org's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100832#1322233 (10RobH) p:5High>3Normal
[22:05:06] <wikibugs>	 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1322234 (10Dzahn) resolved. these links are now redirects:  https://docs.wikimedia.org/ https://docs.mediawiki.org/
[22:05:18] <wikibugs>	 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1322235 (10RobH) p:5High>3Normal
[22:05:22] <wikibugs>	 6operations, 5Patch-For-Review: Alias docs.wikimedia.org to doc.wikimedia.org - https://phabricator.wikimedia.org/T100349#1322236 (10Dzahn) 5Open>3Resolved
[22:09:46] <grrrit-wm>	 (03PS1) 10Ori.livneh: people.wikimedia.org: HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/214773 
[22:09:51] <ori>	 ^ mutante
[22:10:52] <icinga-wm>	 PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures
[22:11:32] <mutante>	 ori: technically totally +1, but.. i kind of vaguely remember there was a reason for that
[22:12:00] <mutante>	 hmm.. there was a ticket that was about switching them all
[22:16:38] <bd808>	 I'm going to do an elasticsearch upgrade on the logstash cluster. there will probably be some icinga alerts as a result
[22:20:51] <hoo>	 bd808: Schedule a downtime? :P
[22:20:54] <jgage>	 i'll do it
[22:21:11] <bd808>	 1001-1003 are done
[22:21:21] <bd808>	 the next 3 will probably take longer
[22:21:28] <jgage>	 hm ok, no alerts so far :)
[22:21:39] <jgage>	 but i'll schedule it anyway
[22:21:46] <bd808>	 the first 3 were client only nodes
[22:21:51] <jgage>	 ah yeah
[22:21:57] <jgage>	 how many hours shall i set it for?
[22:22:10] <bd808>	 I *hope* only about 2
[22:22:26] <bd808>	 otherwise it will cut into my drinking time
[22:22:28] <jgage>	 k
[22:23:04] <MaxSem>	 Elasticsearch. So slow to upgrade it makes you healthier!
[22:23:21] <bd808>	 Oh I'll just stay up later ;)
[22:23:33] <bd808>	 that scotch isn't going to drink itself
[22:23:55] <jgage>	 ok, downtime scheduled
[22:24:51] <jgage>	 i set it for 3 hours because sysadmin
[22:25:08] <bd808>	 I can't wait for 1.6.x where we can mark the indexes as "sealed" and avoid this dumb resync business
[22:25:26] <jgage>	 oh that sounds nice
[22:25:44] <wikibugs>	 6operations, 10Deployment-Systems, 10wikitech.wikimedia.org: Merge as many configuration hacks in wikitech.php configuration file as possible into InitialiseSettings.php - https://phabricator.wikimedia.org/T75939#1322286 (10Krenair)
[22:25:52] <grrrit-wm>	 (03CR) 10Dzahn: [C: 031] "technically correct. i thought we had some reason not do it with this service from the beginning, but i can't remember or find it, so let'" [puppet] - 10https://gerrit.wikimedia.org/r/214773 (owner: 10Ori.livneh)
[22:26:04] <mutante>	 ori: +1 , i can't remember why not, so yes
[22:26:34] <earldouglas>	 Question for maybe-ops -- how do I get permission to edit this page? https://meta.wikimedia.org/wiki/Www.wikipedia.org_template
[22:26:50] <earldouglas>	 Re: https://phabricator.wikimedia.org/T100673
[22:27:21] <bd808>	 jgage: https://github.com/elastic/elasticsearch/issues/10032
[22:27:53] <icinga-wm>	 RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:28:05] <mutante>	 earldouglas: https://meta.wikimedia.org/wiki/Meta:Requests_for_adminship
[22:28:30] <earldouglas>	 thumbs-up.jpg
[22:28:33] <jgage>	 bd808: neat!
[22:28:37] <MaxSem>	 earldouglas, should be possible to do from your staff account?
[22:28:57] <bd808>	 I only see meta admin edits there
[22:29:06] <wikibugs>	 6operations, 6Labs, 10wikitech.wikimedia.org: Rename specific account in LDAP, Wikitech, Gerrit and Phabricator - https://phabricator.wikimedia.org/T85913#1322290 (10Krenair)
[22:29:15] <earldouglas>	 MaxSem: nope
[22:29:24] <bd808>	 you could also copy it, edit and ask an admin to publish
[22:29:25] <MaxSem>	 err, I blame Philippe! :P
[22:29:40] <earldouglas>	 bd808: I prefer that to becoming admin myself.
[22:29:40] <Krenair>	 earldouglas, does your staff account have the uber-staff-flag-thing?
[22:29:44] <mutante>	 no, it's not a good idea to give staff flag to all WMF users
[22:29:57] <bd808>	 createAndPromote.php ;)
[22:29:57] <earldouglas>	 +1 I don't really want admin. :]
[22:30:03] <Krenair>	 if not and you need to perform admin actions, see philippe/CA
[22:30:19] <MaxSem>	 earldouglas, also, bits is about to die so EL might suddenly stop woring
[22:30:23] <mutante>	 ..or ask to be admin on meta, why not
[22:30:32] * earldouglas shrugs
[22:30:46] <ori>	 bits won't die
[22:30:54] <MaxSem>	 uh?
[22:30:55] <se4598>	 is there are known problem with connection to wikimedia / esams right know?
[22:30:57] <ori>	 it'll get folded into the text cache role
[22:31:01] <mutante>	 earldouglas: i would say depends how often you have to edit it
[22:31:11] <ori>	 but old links have to not totally break for the foreseeable
[22:31:13] <ori>	 *future
[22:31:29] <MaxSem>	 yep, but the host itself is going down sometimes, ori?
[22:31:31] <ori>	 so we won't be actively using it to serve stuff but it'll still be supported
[22:31:51] <earldouglas>	 mutante: ideally, exactly one time.
[22:31:53] <ori>	 the bits varnishes will get decommissioned and repurposed, but the bits hostname will by then be served by the text varnishes
[22:32:14] <Krenair>	 se4598, I can ping hooft okay?
[22:32:34] <ori>	 se4598: what are you seeing?
[22:33:16] <se4598>	 can't connect (tracert:  17     *        *       72 ms  text-lb.esams.wikimedia.org [91.198.174.192])
[22:33:24] <mutante>	 earldouglas: then i would go with the "copy, edit offline, send to philippe/ any admin"
[22:33:47] <se4598>	 and the sudden drop here makes me think there is something: https://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=LVS+loadbalancers+esams&m=cpu_report&s=by+name&mc=2&g=network_report
[22:34:01] <ori>	 yes, that looks suspicious
[22:34:03] <ori>	 bblack: 
[22:34:07] <earldouglas>	 mutante: Roger, thanks.
[22:34:11] <ori>	 se4598: can you run mtr?
[22:34:25] <ori>	 ( https://www.bitwizard.nl/mtr/ )
[22:35:59] <se4598>	 normal tracert: http://pastebin.com/EELaLwJU
[22:36:11] <se4598>	 can be external/ some carrier thing
[22:37:15] <se4598>	 ok, works for me again, and the ganglia graph also shows pre-drop activity :)
[22:37:52] <jgage>	 sounds like a peering session flapped
[22:39:09] <jgage>	 hm no mail from librenms about a flap, but it did mail about some port saturations 9 minutes ago
[22:40:59] <se4598>	 the node missing from my trace while I had this issue vs now is:  wikimedia-ic-129908-adm-b3.c.telia.net [213.248.93.86]
[22:47:10] <wikibugs>	 6operations, 10Wikimedia-General-or-Unknown, 7Documentation: Add a wiki on wikitech is out of date, incomplete - https://phabricator.wikimedia.org/T87588#1322345 (10Krenair) >>! In T87588#1282471, @demon wrote: > It's still wrong as it tells you to add wikis to dblists prior to running addwiki. You can't do...
[22:52:21] <wikibugs>	 6operations, 7HHVM, 5Patch-For-Review: investigate HHVM mysqlExtension::ConnectTimeout - https://phabricator.wikimedia.org/T98489#1322376 (10bd808) Changing the HHVM default connection timeout doesn't seem to have had a measurable effect of the error rate.  @faidon said on irc that he remembered seeing simil...
[22:53:48] <bd808>	 jgage: the cluster recovery is going really slow :/ At this rate the 3 hours will probably only get us through the upgrade of 1004
[22:54:19] <jgage>	 ok, how about i just set it for 12 hours or so
[22:54:31] <bd808>	 works for me
[22:54:36] <jgage>	 or 24? doesn't matter much
[22:54:43] <MaxSem>	 bd808, whatever - just start drinking by the keyboard:P
[22:54:59] <jgage>	 if you haven't already
[22:55:09] <MaxSem>	 jgage, 12h means it might awaken somebody tonight
[22:55:22] <jgage>	 good point
[22:55:23] <bd808>	 it's 5 minutes till beer o'clock in my timezone
[22:55:45] <bd808>	 yeah set it for 24 hrs if your cool with that
[22:56:12] <bd808>	 a yellow logstash cluster is no big deal and nobody should get bothered about it on the weekend
[22:56:39] <wikibugs>	 6operations, 6WMF-Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#1322392 (10Dzahn)
[22:56:56] <jgage>	 ok, updated. and i'll keep an eye on it.
[22:58:38] <grrrit-wm>	 (03CR) 10Dereckson: [C: 031] Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder)
[23:03:30] <wikibugs>	 6operations, 6WMF-Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#1322396 (10Dzahn) done.  -optoutresearch: aripstra, dchen, tbeasley +# Opt out Research RT7871, T86551, T100860 +optoutresearch: aripstra, dchen, dkrysiak
[23:04:16] <wikibugs>	 6operations, 6WMF-Design-Research: Edit optoutresearch@ mailing list recipients - https://phabricator.wikimedia.org/T100860#1322398 (10Dzahn) 5Open>3Resolved
[23:48:51] <grrrit-wm>	 (03PS11) 10Alex Monk: Create Wikipedia Konkani [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn)
[23:50:26] <grrrit-wm>	 (03CR) 10Alex Monk: "(Let's leave wikiversions.json to the merger to update, it changes too often)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206300 (https://phabricator.wikimedia.org/T96468) (owner: 10Dzahn)
[23:53:24] <grrrit-wm>	 (03PS1) 10Alex Monk: Optimise project logos added since I8c9a6a56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214783 
[23:55:30] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Optimise project logos added since I8c9a6a56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214783 (owner: 10Alex Monk)
[23:55:36] <grrrit-wm>	 (03Merged) 10jenkins-bot: Optimise project logos added since I8c9a6a56 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214783 (owner: 10Alex Monk)
[23:56:27] <logmsgbot>	 !log ori Synchronized w/static/images/project-logos: Ic62747f37: Optimise project logos added since I8c9a6a56 (duration: 00m 13s)
[23:56:31] <morebots>	 Logged the message, Master
[23:58:30] <grrrit-wm>	 (03PS4) 10Ori.livneh: hhvm: add memory leak isolation scripts [puppet] - 10https://gerrit.wikimedia.org/r/212187 
[23:58:41] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: add memory leak isolation scripts [puppet] - 10https://gerrit.wikimedia.org/r/212187 (owner: 10Ori.livneh)