[00:00:04] <jouncebot>	 RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151230T0000).
[00:00:54] <Krenair>	 nothing to swat
[00:02:51] <chasemp>	 andrewbogott: so....40 minutes later
[00:03:06] <chasemp>	 I think 10k is ok I guess but we need to figure out why we are thrashing it
[00:03:17] <chasemp>	 as long as there aren't more one-time assignment ops that this
[00:03:26] <chasemp>	 some truncated query seems livable to track down
[00:03:34] <andrewbogott>	 yeah
[00:03:59] <andrewbogott>	 I’ll log a bug about that
[00:04:19] <andrewbogott>	 and then I’m going to try to get CI working again, and then quit for the day.  Tomorrow is untangle-all-those-overlapping-accounts day
[00:04:29] <andrewbogott>	 aka ‘wmf holiday'
[00:04:54] <chasemp>	 well, can you try to help untangle grid engine with us
[00:05:10] <chasemp>	 we should call releng I guess for CI things if it's critical but atm gridegine is dead and no clue
[00:05:16] <wikibugs>	 6operations, 10Wikimedia-Apache-configuration, 7HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#1909365 (10MaxSem) 3NEW
[00:06:22] <andrewbogott>	 yeah, I’ll try to catch up
[00:08:21] <andrewbogott>	 !log restarting nova-compute on labvirt1002
[00:08:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[00:10:22] <wikibugs>	 6operations, 10Wikimedia-Apache-configuration, 7HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#1909388 (10Krenair) Yeah, IMHO these should be permanent redirects
[00:13:58] <wikibugs>	 6operations, 10Wikimedia-Apache-configuration, 7HHVM: Transition to HHVM broke old links to wiki.phtml - https://phabricator.wikimedia.org/T122629#1909389 (10Rillke) https://www.google.com/?q=wiki.phtml+site:commons.wikimedia.org approx. 1.010 results.
[00:15:14] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[00:15:53] <grrrit-wm>	 (03PS4) 10Andrew Bogott: openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 
[00:18:07] <grrrit-wm>	 (03CR) 10Andrew Bogott: [C: 032] openldap: Drop the novaadmin query limit from 'unlimited' [puppet] - 10https://gerrit.wikimedia.org/r/261588 (owner: 10Andrew Bogott)
[00:19:01] <andrewbogott>	 chasemp, YuviPanda, I’m going to enable puppet on the ldap boxes, which will result in #comment lines being added to a config which will result in restarts of ldap.  Just so you know :)
[00:19:13] <YuviPanda>	 andrewbogott: can we not do that now?
[00:19:21] <andrewbogott>	 sure :)
[00:19:25] <andrewbogott>	 I just hate leaving puppet off
[00:19:35] <YuviPanda>	 once gridengine works :)
[00:36:08] <madhuvishy>	 Don't know if this is related at all to the failures you're dealing with - 
[00:36:12] <madhuvishy>	 https://www.irccloud.com/pastebin/nKBIDhMt/
[00:37:23] <madhuvishy>	 YuviPanda: ^ no hurry and all that, let me know whenever
[00:38:14] <YuviPanda>	 madhuvishy: yeah, unrelated. should clear up after an apt-get hopefully
[00:38:26] <madhuvishy>	 mm hmm, okay
[00:39:52] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[00:41:26] <Krenair>	 okay, this keeps coming up
[00:41:32] <Krenair>	 What does 'Server answer' mean?
[00:42:02] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[00:43:41] <grrrit-wm>	 (03PS1) 10Mark Bergsma: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 
[00:46:28] <grrrit-wm>	 (03PS2) 10Mark Bergsma: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 
[01:14:46] <bd808>	 Krenair: looks like it may be a truncated attempt to output something more -- https://github.com/nagios-plugins/nagios-plugins/blob/master/plugins/check_ssh.c#L234
[01:15:05] <bd808>	 Maybe the bot is splitting on : ?
[01:24:49] <Krenair>	 thanks bd808 
[01:24:52] <Krenair>	 maybe
[01:38:47] <icinga-wm>	 PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: puppet fail
[01:41:42] <grrrit-wm>	 (03PS1) 10Yuvipanda: gridengine: Add berkley db commandline utilities to master [puppet] - 10https://gerrit.wikimedia.org/r/261607 
[01:42:58] <grrrit-wm>	 (03PS2) 10Yuvipanda: gridengine: Add berkley db commandline utilities to master [puppet] - 10https://gerrit.wikimedia.org/r/261607 
[01:43:24] <grrrit-wm>	 (03CR) 10coren: [C: 031] "You hope you never have to use 'em, they are godsent when you do." [puppet] - 10https://gerrit.wikimedia.org/r/261607 (owner: 10Yuvipanda)
[01:46:19] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds
[01:48:03] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 973165 bytes in 12.728 second response time
[01:48:28] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[01:48:38] <icinga-wm>	 PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [24.0]
[01:53:53] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] gridengine: Add berkley db commandline utilities to master [puppet] - 10https://gerrit.wikimedia.org/r/261607 (owner: 10Yuvipanda)
[01:55:32] <YuviPanda>	 nfs is-fine-ish
[01:55:35] <YuviPanda>	 we're doing recovery
[01:56:37] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[02:04:21] <icinga-wm>	 PROBLEM - NFS read/writeable on labs instances on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:05:35] <icinga-wm>	 RECOVERY - NFS read/writeable on labs instances on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.046 second response time
[02:06:23] <icinga-wm>	 RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures
[02:13:30] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[02:14:19] <gwicke>	 !log restbase: canary deploy of 7db8e216 (small bug fix & a security fix) to restbase1001
[02:14:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:14:30] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet).
[02:15:30] <icinga-wm>	 RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0]
[02:17:17] <gwicke>	 !log restbase: starting full deploy of 7db8e216 (small bug fix & a security fix) to production cluster
[02:17:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:25:20] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 50s)
[02:25:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:26:49] <gwicke>	 !log restbase: finished full deploy of 7db8e216 (small bug fix & a security fix) to production cluster
[02:26:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:32:16] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Dec 30 02:32:15 UTC 2015 (duration 6m 55s)
[02:32:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:52:13] <icinga-wm>	 PROBLEM - tools-home on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 383 bytes in 0.006 second response time
[02:56:37] <grrrit-wm>	 (03PS1) 10Dzahn: toollabs: disable paging for tools-home/NFS [puppet] - 10https://gerrit.wikimedia.org/r/261610 
[02:58:11] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[03:00:23] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 031] toollabs: disable paging for tools-home/NFS [puppet] - 10https://gerrit.wikimedia.org/r/261610 (owner: 10Dzahn)
[03:00:58] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] toollabs: disable paging for tools-home/NFS [puppet] - 10https://gerrit.wikimedia.org/r/261610 (owner: 10Dzahn)
[03:02:05] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[03:02:55] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[03:03:56] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[03:24:04] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[03:24:48] <icinga-wm>	 RECOVERY - tools-home on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 958126 bytes in 4.488 second response time
[03:28:14] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[03:49:51] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[03:51:52] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[03:58:01] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[04:04:43] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[04:44:20] <icinga-wm>	 PROBLEM - SSH on technetium is CRITICAL: Server answer
[04:48:29] <icinga-wm>	 RECOVERY - SSH on technetium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[04:53:30] <icinga-wm>	 RECOVERY - NTP on technetium is OK: NTP OK: Offset -0.004868984222 secs
[04:53:50] <icinga-wm>	 RECOVERY - dhclient process on technetium is OK: PROCS OK: 0 processes with command name dhclient
[04:54:10] <icinga-wm>	 RECOVERY - RAID on technetium is OK: OK: no RAID installed
[04:54:21] <icinga-wm>	 RECOVERY - salt-minion processes on technetium is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[04:54:39] <icinga-wm>	 RECOVERY - DPKG on technetium is OK: All packages OK
[04:54:59] <icinga-wm>	 RECOVERY - Check size of conntrack table on technetium is OK: OK: nf_conntrack is 0 % full
[04:56:10] <icinga-wm>	 RECOVERY - puppet last run on technetium is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures
[06:31:18] <icinga-wm>	 PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on cp4016 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:58] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:17] <icinga-wm>	 PROBLEM - puppet last run on mw2023 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:18] <icinga-wm>	 PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:59] <icinga-wm>	 PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:59] <icinga-wm>	 PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:58] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:33:58] <icinga-wm>	 PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:17] <icinga-wm>	 PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:27] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:35:27] <icinga-wm>	 PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: puppet fail
[06:56:58] <icinga-wm>	 RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:07] <icinga-wm>	 RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[06:57:18] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures
[06:57:19] <icinga-wm>	 RECOVERY - puppet last run on cp4016 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[06:57:39] <icinga-wm>	 RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:48] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:48] <icinga-wm>	 RECOVERY - puppet last run on mw2023 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:57:49] <icinga-wm>	 RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[06:58:07] <icinga-wm>	 RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:18] <icinga-wm>	 RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:27] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:28] <icinga-wm>	 RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:01:43] <jynus>	 !log setting dbstore1001 to read_only, converting ruwiki.recentchanges back to InnoDB
[08:01:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[08:06:54] <jynus>	 there is some kind of interaction between toku, database dumps (and multisource replication?), that makes an insertion there fail the first time, then it succeeds after being read, and auto-generates a duplicate key error
[09:05:16] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 631
[09:15:16] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 751102 Threads: 86 Questions: 30429633 Slow queries: 9095 Opens: 58689 Flush tables: 2 Open tables: 418 Queries per second avg: 40.513 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[09:43:56] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909731 (10ArielGlenn) salt updated on deployment-prep except for deployment-restbase01 which is running sid.  I haven't built sid packages and don't plan to.  After the update, one host is no...
[10:00:17] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 616
[10:20:18] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 755002 Threads: 92 Questions: 30508861 Slow queries: 9247 Opens: 58692 Flush tables: 2 Open tables: 419 Queries per second avg: 40.408 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 60
[10:30:18] <icinga-wm>	 PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 660
[10:35:10] <icinga-wm>	 RECOVERY - check_mysql on db1008 is OK: Uptime: 755902 Threads: 93 Questions: 30532287 Slow queries: 9269 Opens: 58692 Flush tables: 2 Open tables: 419 Queries per second avg: 40.391 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[11:21:19] <icinga-wm>	 PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures
[11:46:43] <icinga-wm>	 RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[11:52:10] <grrrit-wm>	 (03PS1) 10Nemo bis: [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 
[11:53:27] <grrrit-wm>	 (03PS2) 10Nemo bis: [English Planet] Add Greg Sabino Mullane [puppet] - 10https://gerrit.wikimedia.org/r/261626 
[12:18:42] <wikibugs>	 6operations, 10Salt: Move salt master to separate host from puppet master - https://phabricator.wikimedia.org/T115287#1909799 (10ArielGlenn)  wikidata-stats.wikidata-dev.eqiad.wmflabs:     Minion did not return. [No response] tools-worker-1002.tools.eqiad.wmflabs:     Minion did not return. [No response]    >>...
[12:29:36] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] "Yeah, this makes sense. Note that labstores have 1Gbps NICs (bnx2) and we've never tried RPS/RSS there before." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma)
[12:37:42] <grrrit-wm>	 (03CR) 10Mark Bergsma: "Yeah, alternatively we could try running irqbalance indeed." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma)
[12:45:36] <wikibugs>	 6operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#1909805 (10Aklapper) >>! In T109810#1572602, @Jalexander wrote: > but let me check with the lawyers first.  @JAlexander: Did that happen? Any outcome?
[12:49:42] <wikibugs>	 6operations, 7Easy, 5Patch-For-Review: server admin log should include year in date (again) - https://phabricator.wikimedia.org/T85803#1909811 (10Aklapper) @Elee: Any news here? Are you still working on this (as you're set as assignee)?
[12:54:11] <grrrit-wm>	 (03PS3) 10Mark Bergsma: Enable RPS on eth0 on labstores [puppet] - 10https://gerrit.wikimedia.org/r/261598 
[12:55:45] <grrrit-wm>	 (03CR) 10Mark Bergsma: Enable RPS on eth0 on labstores (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma)
[13:14:23] <mark>	 !log labstore1001: apt-get install irqbalance
[13:14:27] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:17:07] <grrrit-wm>	 (03CR) 10Mark Bergsma: "I've just installed irqbalance, which is by no means optimal but better than nothing for now." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma)
[13:18:48] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: puppet fail
[13:27:56] <mark>	 so all that really did is move all network interrupts off cpu#0 to a different cpu
[13:28:02] <mark>	 which is only very marginally better ;)
[13:44:44] <grrrit-wm>	 (03CR) 10coren: [C: 031] "Very much sane." [puppet] - 10https://gerrit.wikimedia.org/r/261598 (owner: 10Mark Bergsma)
[13:45:19] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[14:06:28] <icinga-wm>	 PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures
[14:25:05] <grrrit-wm>	 (03CR) 10Luke081515: [C: 031] Enable global AbuseFilter at French Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257868 (https://phabricator.wikimedia.org/T120568) (owner: 10Glaisher)
[14:34:37] <icinga-wm>	 RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[14:45:08] <icinga-wm>	 PROBLEM - Outgoing network saturation on labstore1001 is CRITICAL: CRITICAL: 13.64% of data above the critical threshold [100000000.0]
[14:54:52] <andrewbogott>	 Coren, mark, is ^ your patch?
[14:55:31] <Coren>	 andrewbogott: I've only +1'ed it - I don't think Mark merged yet.
[14:55:45] <andrewbogott>	 oh, good point.
[14:56:03] <Coren>	 andrewbogott: But me and valhallasw`cloud are looking at a lot of huge gridengine logs that live on NFS, it may be our fault.
[14:57:25] <andrewbogott>	 !log restarting puppet on serpens; openldap will restart but config should not change
[14:57:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:02:47] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1001 is OK: OK: Less than 10.00% above the threshold [75000000.0]
[15:05:47] <andrewbogott>	 !log restarting puppet on seaborgium; openldap will restart but config should not change
[15:05:51] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:14:27] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1909891 (10Ottomata)
[15:17:25] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1591622 (10Ottomata)
[15:32:57] <physikerwelt>	 akosiaris: Hi Alex, do you know how to model dependencies between mediawiki extensions in jenkins? In https://gerrit.wikimedia.org/r/#/c/259167/7 we add Wikidata support for math. But Jenkins seems to be unaware of Wikidata.
[15:42:24] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1909963 (10Ottomata)
[15:43:39] <wikibugs>	 6operations, 10Traffic, 6Zero, 5Patch-For-Review: Merge mobile cache into text cache - https://phabricator.wikimedia.org/T109286#1591624 (10Ottomata) Ok, I think we are ready on the Analytics side.  We'll need to do some things right after this change is made, so some planning is in order over in https://p...
[16:05:01] <Krenair>	 jouncebot, next
[16:05:01] <jouncebot>	 In 0 hour(s) and 54 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151230T1700)
[16:07:03] <Krenair>	 _joe_, can we do https://gerrit.wikimedia.org/r/260593 ?
[16:08:10] <mark>	 there's a) a freeze, and b) people are supposed to be off until monday, so what do you think :)
[16:18:56] <Krenair>	 I wasn't sure how much that affected the puppet changes
[17:00:04] <jouncebot>	 _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151230T1700).
[17:12:08] <andrewbogott>	 Coren: would you expect a ‘find /srv’ on labstore to complete… ever?
[17:12:11] <andrewbogott>	 I’m starting to wonder
[17:12:28] <Coren>	 andrewbogott: Err, yes, very many hours later.
[17:12:54] <Coren>	 andrewbogott: /srv has hundreds of millions of files
[17:12:57] <andrewbogott>	 What I need is a way to find all homedirs for a given user
[17:13:13] <andrewbogott>	 my current plan is find . -type d -name "home" > allhomes.txt
[17:13:20] <Coren>	 andrewbogott: Heh.  Much simpler:
[17:13:24] <andrewbogott>	 and then grep allhomes for the actual user’s home (to avoid running find ever again)
[17:13:49] <Coren>	 cd srv;echo */*/home/<usernamehere>
[17:14:13] <Coren>	 Glob is much smarter about it, because it knows to not try to traverse
[17:14:16] <andrewbogott>	 it’ll always  be that exact depth?
[17:14:37] <Coren>	 Yeah, because /srv/project/foo/home or /srv/other/foo/home
[17:14:40] <andrewbogott>	 hm...
[17:14:43] <andrewbogott>	 ok, that’s easier then :)
[17:16:26] <andrewbogott>	 hm, that ‘echo’ succeeds whether or not there’s a match
[17:16:28] <andrewbogott>	 weird
[17:19:46] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Network Unreachable (mr1-codfw.oob.wikimedia.org)
[17:20:56] <icinga-wm>	 PROBLEM - Router interfaces on mr1-codfw is CRITICAL: CRITICAL: host 208.80.153.196, interfaces up: 33, down: 1, dormant: 0, excluded: 0, unused: 0BRge-0/0/7: down - Transit: CyrusOne OOB (IP-000008-01) {#1099} [1Gbps Cu]BR
[17:21:03] <Coren>	 andrewbogott: Not sure what you mean?  globs will give you all the matches you want to itself when there are none.  That's what globs do.  :-)
[17:21:45] <andrewbogott>	 right, it’s just funny that if the glob doesn’t match anything it echos the pattern isntead.
[17:21:49] <andrewbogott>	 # echo */*/home/lissacoffey
[17:21:49] <andrewbogott>	 */*/home/lissacoffey
[17:22:01] <andrewbogott>	 not a problem, just surprises me
[17:22:28] <Coren>	 That's why code that relies on going through a glob always have a test.  Like:
[17:23:01] <Coren>	 for home in */*/home/the_user;do if [ -d "$home" ]; then do_something_to $home; fi; done
[17:23:33] <Coren>	 In your case, I expect do_something_to is akin to chown the_user $home  :-)
[17:30:09] <andrewbogott>	 Coren: before I break things, mind a quick look at /srv/chownhome.sh ?
[17:30:27] <Coren>	 andrewbogott: You'll have to paste it for me.
[17:30:31] <Coren>	 dpaste*
[17:30:33] <andrewbogott>	 ah, yes, sorry :)
[17:30:46] <andrewbogott>	 https://dpaste.de/XNZM
[17:31:53] <Coren>	 andrewbogott: Intended to run on labstore*?
[17:32:08] <andrewbogott>	 yeah, on labstore1001 for each of the affected users
[17:32:11] <andrewbogott>	 (there are lots :(  )
[17:32:17] <Coren>	 https://dpaste.de/HeRV
[17:32:34] <Coren>	 Removed the undeeded echo and added the much-important useldap
[17:32:46] <Coren>	 Otherwise, chown has no idea who you're talking about.
[17:33:05] <andrewbogott>	 Aw, I like the echo
[17:33:33] <andrewbogott>	 oh, I see
[17:34:01] <Coren>	 Wouldn't have broken thing in this specific case, but never trust expansion of filenames; they always contains spaces at the worst of times.  :-)
[17:34:24] <Coren>	 The glob expands correctly to tokens.
[17:35:07] <andrewbogott>	 yep, ok
[17:35:40] <Coren>	 (Also, every use of $homedir should be quoted to "$homedir" for that reason, but again this is not an issue in this specific case).
[17:36:27] <andrewbogott>	 ok, here goes...
[17:37:50] <andrewbogott>	 well, that was a letdown, apparently only 7 of those 150 users had ever logged in
[17:39:26] <Coren>	 Heh.  "Sorry the task ended up being fairly easy?"  :-)
[17:46:30] <valhallasw`cloud>	 andrewbogott: the task can probably also be un-'security'-ed?
[17:54:06] <andrewbogott>	 valhallasw`cloud:  I don’t know.  Publishing the task will encourage people to go looking for overlaps.  I don’t think there are any, but nevertheless...
[17:54:19] <andrewbogott>	 (Of course, now I’m talking about it in a public channel…)
[17:54:46] <valhallasw`cloud>	 yeah, and I just posted a followup task without security tag
[17:55:34] <valhallasw`cloud>	 I dunno, there's more effective ways to wreak havoc in labs than this ;-)
[17:56:59] <andrewbogott>	 valhallasw`cloud: yeah, ok.
[18:16:17] <icinga-wm>	 RECOVERY - Router interfaces on mr1-codfw is OK: OK: host 208.80.153.196, interfaces up: 35, down: 0, dormant: 0, excluded: 0, unused: 0
[18:20:27] <icinga-wm>	 RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 34.78 ms
[18:39:57] <icinga-wm>	 PROBLEM - HHVM rendering on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:40:37] <icinga-wm>	 PROBLEM - Apache HTTP on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:41:18] <icinga-wm>	 PROBLEM - SSH on mw1123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[18:41:18] <icinga-wm>	 PROBLEM - salt-minion processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:41:37] <icinga-wm>	 PROBLEM - HHVM processes on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:41:58] <icinga-wm>	 PROBLEM - configured eth on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:42:08] <icinga-wm>	 PROBLEM - Disk space on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:42:29] <icinga-wm>	 PROBLEM - RAID on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:42:38] <icinga-wm>	 PROBLEM - nutcracker port on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:42:48] <icinga-wm>	 PROBLEM - nutcracker process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:42:58] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:43:08] <icinga-wm>	 PROBLEM - puppet last run on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:43:17] <icinga-wm>	 PROBLEM - DPKG on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:46:27] <icinga-wm>	 PROBLEM - dhclient process on mw1123 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:46:45] <grrrit-wm>	 (03PS1) 10Halfak: Sets ORES redis cache_maxmemory => '2G' [puppet] - 10https://gerrit.wikimedia.org/r/261642 
[18:47:50] <grrrit-wm>	 (03CR) 10Halfak: "See https://phabricator.wikimedia.org/T122666" [puppet] - 10https://gerrit.wikimedia.org/r/261642 (owner: 10Halfak)
[18:50:57] <icinga-wm>	 RECOVERY - nutcracker port on mw1123 is OK: TCP OK - 0.000 second response time on port 11212
[18:50:58] <icinga-wm>	 RECOVERY - nutcracker process on mw1123 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[18:51:08] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1123 is OK: OK: nf_conntrack is 0 % full
[18:51:27] <icinga-wm>	 RECOVERY - DPKG on mw1123 is OK: All packages OK
[18:51:28] <icinga-wm>	 RECOVERY - SSH on mw1123 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[18:51:28] <icinga-wm>	 RECOVERY - salt-minion processes on mw1123 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[18:51:47] <icinga-wm>	 RECOVERY - HHVM processes on mw1123 is OK: PROCS OK: 6 processes with command name hhvm
[18:52:08] <icinga-wm>	 RECOVERY - configured eth on mw1123 is OK: OK - interfaces up
[18:52:18] <icinga-wm>	 RECOVERY - Disk space on mw1123 is OK: DISK OK
[18:52:28] <icinga-wm>	 RECOVERY - dhclient process on mw1123 is OK: PROCS OK: 0 processes with command name dhclient
[18:52:47] <icinga-wm>	 RECOVERY - RAID on mw1123 is OK: OK: no RAID installed
[19:06:48] <icinga-wm>	 RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures
[20:09:03] <wikibugs>	 6operations, 10Traffic: Varnish apparently unconditionally varies on cookie value - https://phabricator.wikimedia.org/T122673#1910240 (10GWicke) 3NEW
[20:09:19] <wikibugs>	 6operations, 10Traffic: Varnish apparently unconditionally varies on cookie value - https://phabricator.wikimedia.org/T122673#1910250 (10GWicke)
[20:10:00] <wikibugs>	 6operations, 10Traffic: Varnish apparently unconditionally varies on cookie value - https://phabricator.wikimedia.org/T122673#1910240 (10GWicke)
[20:10:57] <wikibugs>	 6operations, 10Traffic: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910254 (10GWicke)
[20:16:27] <wikibugs>	 6operations, 10Traffic: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910265 (10GWicke) This vary behavior seems to be hardcoded in [evaluate_cookie](https://github.com/wikimedia/operations-puppet/blob/650721dba65c57ac6edc77ff2a55f155a78ba32d/template...
[20:27:46] <grrrit-wm>	 (03PS1) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) 
[20:28:37] <wikibugs>	 6operations: Translate extension seemingly broken / partially installed - https://phabricator.wikimedia.org/T122675#1910285 (10coren) 3NEW
[20:28:44] <grrrit-wm>	 (03PS2) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) 
[20:28:49] <wikibugs>	 6operations: Translate extension seemingly broken / partially installed on wikimedia2017wiki - https://phabricator.wikimedia.org/T122675#1910292 (10coren)
[20:33:10] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910297 (10GWicke) The patch above adds another exception that prevents the no-cache override from applying to /api/rest_v1/. It's not really a complete solution...
[20:34:23] <wikibugs>	 6operations, 10Traffic, 5Patch-For-Review, 7Performance: Varnish apparently unconditionally varies on session cookies - https://phabricator.wikimedia.org/T122673#1910298 (10GWicke)
[20:34:24] <grrrit-wm>	 (03PS1) 10Yuvipanda: redis: Set vm_overcommit = 1 for all redises [puppet] - 10https://gerrit.wikimedia.org/r/261663 
[20:34:36] <YuviPanda>	 ori: ^ are you around?
[20:34:46] <ori>	 yes
[20:34:48] <ori>	 what's up?
[20:35:33] <YuviPanda>	 ori: see patch.
[20:35:45] <YuviPanda>	 the ores redis just started puking because of lack of vm_overcommit
[20:35:55] <YuviPanda>	 then I realized we'll have to add that to literally all our roles
[20:36:09] <grrrit-wm>	 (03PS3) 10GWicke: Varnish: Don't disable caching for authenticated REST API requests [puppet] - 10https://gerrit.wikimedia.org/r/261662 (https://phabricator.wikimedia.org/T122673) 
[20:36:11] <YuviPanda>	 since they all use persistance
[20:36:27] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] redis: Set vm_overcommit = 1 for all redises [puppet] - 10https://gerrit.wikimedia.org/r/261663 (owner: 10Yuvipanda)
[20:36:37] <YuviPanda>	 ori: thanks
[20:37:38] <grrrit-wm>	 (03CR) 10Yuvipanda: [C: 032] redis: Set vm_overcommit = 1 for all redises [puppet] - 10https://gerrit.wikimedia.org/r/261663 (owner: 10Yuvipanda)
[20:45:24] <wikibugs>	 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Test multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1910301 (10Eevans)
[21:22:34] <icinga-wm>	 PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 1 failures
[21:29:16] <grrrit-wm>	 (03PS1) 10Ori.livneh: redis: small lint fix [puppet] - 10https://gerrit.wikimedia.org/r/261724 
[21:29:30] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] redis: small lint fix [puppet] - 10https://gerrit.wikimedia.org/r/261724 (owner: 10Ori.livneh)
[21:50:14] <icinga-wm>	 RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[22:02:22] <halfak>	 o/ apergos 
[22:02:41] <halfak>	 Do you know where I would report an issue with pagecounts dumps?
[22:02:52] <halfak>	 E.g. pagecounts-20150509-060000.gz has compression errors
[22:05:28] <Nemo_bis>	 halfak: there is a report already about invalid bzip2
[22:06:05] <halfak>	 Hmm.. This is gzip
[22:08:39] <Reedy>	 halfak: everything goes in phab :)
[22:09:49] <halfak>	 Reedy, yes, but what would be the project.  Who own that?
[22:10:31] <Reedy>	 Datasets-General-or-Unknown
[22:11:20] <halfak>	 Sure it's not Dumps-Generation?
[22:12:54] <Reedy>	 It's not an xml/mysql dump is it?
[22:14:11] <halfak>	 Nope.  It's a pageview dump.   Also hosted on dumps.wikimedia.org. 
[22:14:53] <Reedy>	 Use General then
[22:15:07] <halfak>	 Aha!  It's Datasets-Webstatscollector
[22:18:06] <Luke081515>	 andre__: Should #phabrictor upstream task contain the project phabricator?
[22:18:12] <Luke081515>	 or should I remove it?
[22:33:58] <andre__>	 Luke081515: Depends on each specific task. No "general" rule.
[22:34:40] <andre__>	 Luke081515: if we won't solve it / work around it's upstream only. If we might have a local "solution" it might be both. If it's just our config it's only #phabricator.
[22:35:02] <Luke081515>	 ok, thanks
[22:48:25] <icinga-wm>	 PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: puppet fail
[22:49:32] <apergos>	 at midnight o clock I admit I was out 
[23:02:00] <wikibugs>	 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1910616 (10Krenair) a:5Betacommand>3Krenair ```krenair@tin:~$ mwscript eval.php enwiki > echo ExternalStore::insertToDefault( gzdeflate( "SYSADMIN NOTE: Text of this r...
[23:03:23] <wikibugs>	 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1910618 (10Krenair) I guess we should update rev_len (currently 78946) and rev_sha1 (currently blank) as well?
[23:16:19] <icinga-wm>	 RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures
[23:20:38] <wikibugs>	 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1910638 (10Betacommand) Yeah, sorry for the delay in getting back to this, I have a dump from a few months after this, but it doesnt look like the revision is in it.