[00:05:47] <wikibugs>	 7Blocked-on-Operations, 7Varnish: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1854453 (10ori)
[00:08:47] <grrrit-wm>	 (03Abandoned) 10Ricordisamoa: Don't match Phabricator task IDs inside URLs [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa)
[00:09:07] <grrrit-wm>	 (03PS1) 10Dzahn: zuul: move roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/257039 
[00:10:01] <grrrit-wm>	 (03CR) 10Ricordisamoa: "not able to review, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE))
[00:11:33] <grrrit-wm>	 (03PS2) 10Dzahn: zuul: move roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/257039 
[00:18:11] <Reedy>	 mutante: I have direct push access, so for easy stuff like that, I can just commit directly ;)
[00:19:06] <mutante>	 Reedy: :) it seems the rules are pretty much cleaned up already 
[00:19:23] <mutante>	 the remaining things (ubuntu,apt etc.. ) stay
[00:19:39] <mutante>	 what we could maybe do is add more redirect rules like
[00:19:44] <mutante>	 wikipedia.com -> wikipedia.org
[00:20:02] <mutante>	 would that fix the cert error for httpseverywhere users
[00:20:16] <mutante>	 compare to the existing enwp.org rule 
[00:21:05] <mutante>	 also: TIL there is frwp.org too
[00:22:11] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] maps: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257031 (owner: 10Ori.livneh)
[00:22:57] <Reedy>	 mutante: We could, yeah
[00:23:06] <Reedy>	 I presume they're not getting parked
[00:23:14] <Reedy>	 And we're also not buying HTTPS certs either?
[00:24:00] <mutante>	 Reedy: wikipedia.com wont get parked i think
[00:24:04] <mutante>	 it's too good
[00:24:09] <Reedy>	 heh
[00:24:12] <mutante>	 but let me ask in the meeting we have soon
[00:24:19] <grrrit-wm>	 (03PS1) 10Ori.livneh: Remove redis::ganglia; incompatible with multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/257042 
[00:24:19] <Reedy>	 It's not like it's one of our squatted domains
[00:24:26] <mutante>	 yea, there are layers to this
[00:24:31] <mutante>	 from total crap to kind of good to good
[00:24:41] <mutante>	 typo domains
[00:24:49] <mutante>	 real project names in other country TLDs
[00:25:01] <mutante>	 i think wikipedia.com might be the most common one that isnt a real project URL
[00:25:09] <mutante>	 people just type .com for everything
[00:25:27] <Reedy>	 yeah
[00:25:32] <mutante>	 there is even a key just for .com on my phone keyboard
[00:25:37] <Reedy>	 Which begs the question if we should have HTTPS cert for it
[00:26:04] <mutante>	 yes, i think this specific one should probably be added to the cert
[00:26:18] <mutante>	 but that's exactly the question.. what's the policy 
[00:26:28] <mutante>	 also,,letsencrypt or not and when
[00:26:29] <Reedy>	 :)
[00:26:37] <Reedy>	 Should we file a ticket for wp.com?
[00:26:43] <Reedy>	 that looks like wordpress
[00:26:44] <Reedy>	 lol
[00:27:08] <mutante>	 Reedy: https://phabricator.wikimedia.org/T42998
[00:27:35] <mutante>	 heh, yes, WP is wordpress
[00:28:18] <mutante>	 Reedy: btw, we have open tickets about blog.wm.org and https-only /mixed content that are not solved and blocked by wordpress.com
[00:28:30] <mutante>	 while i dont see it in httpseverywhere anymore at all
[00:28:44] <Reedy>	 It probably got removed by someone in a big cleanup
[00:28:50] <Reedy>	 Maybe it'll be fixed when they rewrite WP in node
[00:28:55] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Remove redis::ganglia; incompatible with multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/257042 (owner: 10Ori.livneh)
[00:29:27] <mutante>	 Reedy: https://phabricator.wikimedia.org/T105905#1563190
[00:30:24] <mutante>	 "Heard back from Automattic last week and it turns out that contrary to the previous discussion, the search/replace feature of WP-CLI is actually disabled on VIP because it can cause database issues. They are looking into alternative options."
[00:30:44] <mutante>	 so .. the feature that is needed is not there .. because we are VIP
[00:30:55] <mutante>	 and "cause database issues" .. wut :p
[00:31:25] <mutante>	 that's a nice kind of VIP where WP-CLI stuff is disabled
[00:31:49] <Reedy>	 I suspect it's because it's doing a massive update, looking for text in big unindexed text columns
[00:32:05] <p858snake>	 database issues? or people do stupid things and break their dbs issues? >.>
[00:33:19] <grrrit-wm>	 (03PS1) 10Ori.livneh: role::ci::slave::browsertests: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257043 
[00:33:42] <mutante>	 p858snake: :) i guess in the end it's the same 
[00:34:04] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] role::ci::slave::browsertests: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257043 (owner: 10Ori.livneh)
[00:36:27] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1854628 (10Dzahn) 3NEW
[00:36:47] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1854641 (10Dzahn)
[00:37:09] <wikibugs>	 6operations, 6Labs, 10Labs-Infrastructure, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1854628 (10Dzahn)
[00:41:11] <mutante>	 andrewbogott: re: apache: or maybe we block it with ferm? (if it's needed for puppetmaster but only from local)
[00:42:06] <mutante>	 i also removed that default site before with apache::site { '000-default' .. ensure => absent before afair
[00:42:59] <mutante>	 i'll leave comments on gerrit , laters!
[00:44:17] <grrrit-wm>	 (03CR) 10Dzahn: "if this is needed for puppetmaster but only connections from local, then maybe base::firewall and a ferm rule is the way to go" [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott)
[00:45:09] <grrrit-wm>	 (03PS1) 10Reedy: Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 
[00:45:20] <grrrit-wm>	 (03CR) 10Dzahn: "to remove the apache site i have used: apache::site { '000-default'... ensure => absent before" [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott)
[00:47:49] <grrrit-wm>	 (03CR) 10CSteipp: [C: 031] Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy)
[00:48:07] <grrrit-wm>	 (03CR) 10Reedy: "Needs merging before 1.27.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy)
[01:45:54] <grrrit-wm>	 (03CR) 10Krinkle: "Possibly fixme" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257043 (owner: 10Ori.livneh)
[01:46:39] <Krinkle>	 ori: can you verify ^ I'm out  on mobile. These auto deploy on ci in labs
[02:03:20] <AaronSchulz>	 Reedy: https://gerrit.wikimedia.org/r/#/c/257033/
[02:24:49] <logmsgbot>	 !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 09m 59s)
[02:24:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:00:13] <icinga-wm>	 PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures
[03:27:24] <icinga-wm>	 RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[03:59:21] <grrrit-wm>	 (03PS2) 10Ori.livneh: Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) 
[03:59:58] <grrrit-wm>	 (03PS3) 10Ori.livneh: Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) 
[06:09:08] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Dec  5 06:09:07 UTC 2015 (duration 3h 44m 18s)
[06:09:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:21:18] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 332
[06:30:44] <icinga-wm>	 PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:02] <icinga-wm>	 PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:03] <icinga-wm>	 PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 3 failures
[06:31:13] <icinga-wm>	 PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:20] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 24
[06:31:54] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:31:54] <icinga-wm>	 PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:03] <icinga-wm>	 PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:13] <icinga-wm>	 PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:32:23] <icinga-wm>	 PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:43] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:43] <icinga-wm>	 PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:57:04] <icinga-wm>	 RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[07:26:52] <icinga-wm>	 RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures
[07:26:54] <icinga-wm>	 RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[07:27:22] <icinga-wm>	 RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:33] <icinga-wm>	 RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:43] <icinga-wm>	 RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:27:44] <icinga-wm>	 RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:28:14] <icinga-wm>	 PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: puppet fail
[07:28:23] <icinga-wm>	 RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:28:23] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[07:29:13] <icinga-wm>	 RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[07:29:22] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:41:38] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 326
[07:45:37] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 13
[07:51:27] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 339
[07:55:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 2
[07:56:03] <icinga-wm>	 RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[07:57:51] <_joe_>	 /win 33
[07:58:04] <_joe_>	 win 33
[07:58:11] * YuviPanda loses _joe_
[07:58:17] <YuviPanda>	 good morning, _joe_
[07:58:38] <_joe_>	 heya
[07:58:42] <_joe_>	 just got paged
[07:58:47] <_joe_>	 but I'm going away
[07:58:56] <YuviPanda>	 kk go away
[07:59:00] <YuviPanda>	 page seems ot be just flapping
[07:59:01] <_joe_>	 I'm with myparents after months, ttyl
[07:59:05] <_joe_>	 yup
[07:59:09] <_joe_>	 bye
[07:59:21] <YuviPanda>	 _joe_: enjoy your weekend!
[08:01:36] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 354
[08:09:17] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 6
[08:09:37] <akosiaris>	 what on earth is going on ?
[08:13:19] <YuviPanda>	 akosiaris: it's been flapping for a while
[08:15:32] <akosiaris>	    'db1019' => 0,   # 1.4TB  64GB, watchlist, recentchanges, contributions, logpager
[08:15:48] <akosiaris>	 so, commons ?
[08:19:23] <mutante>	 akosiaris: yes, it's commons, some queries are slow-ish but then they eventually get done
[08:19:27] <mutante>	 SELECT /* SpecialRecentChangesLinked
[08:19:47] <mutante>	 visible on https://tendril.wikimedia.org/report/slow_queries?host=db1019&hours=1
[08:20:49] <akosiaris>	 the box is in very heavy iowait
[08:21:00] <mutante>	 so it gets critical for a moment and then recovers again when done
[08:24:15] <akosiaris>	 the iowait started increasing on 05:15 UTC 
[08:24:43] <akosiaris>	 and it's around 20%, whereas before that is was around 3%
[08:24:57] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 371
[08:25:07] <akosiaris>	 and it's not gonna go away 
[08:26:08] <akosiaris>	 I think there is nothing wrong with the slave...
[08:26:35] <akosiaris>	 it's just trying to catchup and can't due to all the load 
[08:29:25] <akosiaris>	 hey jynus
[08:30:25] <akosiaris>	 jynus: so, recap up to now: db1019 in heavy IOwait, probably because of trying to catchup to the master. it's commons btw
[08:30:55] <akosiaris>	 slow queries up to now are mosty SELECT /* SpecialRecentChangesLinked::doMainQuery on db1019
[08:32:31] <akosiaris>	 db1040s innodb checkpoint age is low btw https://tendril.wikimedia.org/host/view/db1040.eqiad.wmnet/3306
[08:35:08] <bblack>	 hi
[08:35:21] <akosiaris>	 hey
[08:35:31] <bblack>	 any idea what's causing it?
[08:36:04] <akosiaris>	 it's commons
[08:36:19] <bblack>	 https://doc.wikimedia.org/mediawiki-core/master/php/SpecialRecentchangeslinked_8php_source.html
[08:36:54] <akosiaris>	 so.. https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1040&hours=1
[08:37:09] <akosiaris>	 are those select queries normal ?
[08:37:34] <akosiaris>	 SELECT * FROM (SELECT selected_date AS datestring FROM (SELECT ADDDATE('1970-01-01', t4.i*10000 + t3.i*1000 + t2.i*100 + t1.i*10 + t0.i) selected_date from (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) t0, (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SE
[08:37:41] <akosiaris>	 what is that thing ?
[08:37:46] <akosiaris>	 and with no comment
[08:38:04] <jynus>	 that is research, ignore dbstore
[08:38:09] <akosiaris>	 oh
[08:38:12] <akosiaris>	 damn family tree
[08:38:13] <akosiaris>	 sorry
[08:38:40] <jynus>	 there is an issue with replication s4 wide, it is only hitting db1019 harder
[08:39:14] <akosiaris>	 huh? how did you tell? the rest have 0 lag ..
[08:39:26] <jynus>	 not really
[08:39:42] <akosiaris>	 so tendril is lying ?
[08:39:58] <jynus>	 no, it is just showing the current state
[08:40:48] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 411
[08:40:59] <jynus>	 the others are faster to catch up
[08:41:05] <akosiaris>	 ok
[08:41:47] <jynus>	 I am not saying db1019 is not badly, I am trying to see the original cause
[08:42:07] <bblack>	 threads_connected and aborted clients both show a pattern while this has been going on, on 1019
[08:42:18] <bblack>	 could be effect rather than cause, though
[08:42:26] <jynus>	 that is normal, when lagging, mediawiki kills threads
[08:43:45] <bblack>	 ah I see 1042 has a similar-looking increase in rep lag, it's just not enough to threshold and alert
[08:45:07] <jynus>	 that is what I meant
[08:45:49] <jynus>	 when 2 slaves have lag at the same time
[08:46:01] <jynus>	 it is usually due ot something common
[08:46:11] <jynus>	 (master updates, etc)
[08:51:24] <icinga-wm>	 PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:58:27] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 58
[08:59:07] <bblack>	 !log offlined db1019 megacli disk 32:11
[08:59:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:03:33] <icinga-wm>	 PROBLEM - RAID on db1019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded)
[09:11:46] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) (owner: 10Ori.livneh)
[09:12:07] <YuviPanda>	 bblack: GO TO SLEEP
[09:13:59] <wikibugs>	 6operations, 10ops-eqiad: db1019 failing disk (degraded RAID) - https://phabricator.wikimedia.org/T120511#1855232 (10jcrespo) 3NEW
[09:15:44] <wikibugs>	 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1855240 (10BBlack) We can try the mediawiki-config revert during one of the Monday SWATs I think ( https://gerrit.wikimedia.or...
[09:15:54] <bblack>	 I just had to touch a few things from email :P
[09:15:55] <bblack>	 nite!
[09:16:32] <jynus>	 dbstore1002 could be anything unrelated
[09:16:40] <jynus>	 not prioritary now
[09:16:49] <jynus>	 will documment the raid issue first
[09:17:03] <icinga-wm>	 RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[09:43:15] <jynus>	 I've documented https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Caused_by_hardware
[10:41:14] <icinga-wm>	 RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on port 9042
[11:00:13] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[11:00:23] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0]
[11:06:14] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:06:23] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[11:07:02] <apergos>	 thanks jy nus, read and noted
[11:12:09] <grrrit-wm>	 (03PS3) 10Reedy: Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 
[11:12:16] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 (owner: 10Reedy)
[11:12:37] <grrrit-wm>	 (03Merged) 10jenkins-bot: Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 (owner: 10Reedy)
[11:13:25] <logmsgbot>	 !log reedy@tin Synchronized docroot and w: Add jobqueue-labs to noc (duration: 00m 28s)
[11:13:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:22:08] <logmsgbot>	 !log reedy@tin Synchronized php-1.27.0-wmf.7/extensions/WikimediaMaintenance/refreshMessageBlobs.php: Less waiting for slaves (duration: 00m 28s)
[11:22:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:23:53] <logmsgbot>	 !log reedy@tin Purged l10n cache for 1.27.0-wmf.5
[11:23:57] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:32:57] <grrrit-wm>	 (03PS2) 10Reedy: Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 
[11:33:39] <grrrit-wm>	 (03CR) 10Reedy: [C: 032] Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy)
[11:34:00] <grrrit-wm>	 (03Merged) 10jenkins-bot: Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy)
[11:35:01] <logmsgbot>	 !log reedy@tin Synchronized wmf-config/CommonSettings.php: Disable common password password policy to come in wmf.8 (duration: 00m 28s)
[11:35:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:41:21] <grrrit-wm>	 (03CR) 10Paladox: "Hi how do you download in the repos in phabricator I doin't see an option." [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad)
[12:57:33] <icinga-wm>	 PROBLEM - puppet last run on mw2033 is CRITICAL: CRITICAL: puppet fail
[13:23:53] <icinga-wm>	 PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail
[13:25:04] <icinga-wm>	 RECOVERY - puppet last run on mw2033 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[13:51:15] <icinga-wm>	 RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures
[14:19:45] <wikibugs>	 6operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1855498 (10Nemo_bis)
[14:19:50] <wikibugs>	 6operations, 10DBA, 6WMF-Legal, 5Patch-For-Review: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1855502 (10Nemo_bis)
[14:20:00] <wikibugs>	 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1855508 (10Nemo_bis)
[15:24:38] <wikibugs>	 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1855565 (10Betacommand) I might have a few tricks for recovering this revision, give me a day or two Ill see what I can do.
[17:00:36] <Krinkle>	 Interesting. https://wikitech.m.wikimedia.org/
[17:00:41] <Krinkle>	 Wikimedia.org portal page
[17:04:40] <wikibugs>	 6operations, 10Reading-Web, 7Varnish:  https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1855709 (10Krinkle) 3NEW
[17:08:13] <wikibugs>	 6operations, 10Reading-Web: [Regression] Unable to browse wikitech.wikimedia.org from mobile device (Apache error) - https://phabricator.wikimedia.org/T120528#1855716 (10Krinkle) 3NEW
[17:08:42] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:10:14] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[17:10:27] <wikibugs>	 6operations, 10Reading-Web: [Regression] Unable to browse certain wikitech.wikimedia.org urls from mobile device (Apache error) - https://phabricator.wikimedia.org/T120528#1855725 (10Krinkle)
[17:14:12] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:16:23] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:33:43] <icinga-wm>	 PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: Puppet has 1 failures
[17:43:05] <grrrit-wm>	 (03PS4) 10Ori.livneh: Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) 
[17:43:14] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) (owner: 10Ori.livneh)
[17:58:54] <icinga-wm>	 RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures
[18:04:58] <wikibugs>	 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#1855806 (10csteipp)
[18:22:28] <wikibugs>	 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#1855859 (10ori) MariaDB has [[ https://mariadb.com/kb/en/mariadb/pam-authentication-plugin/ | a free and open source PAM authentication module ]] (MySQL's is enterprise-only)...
[18:30:47] <gwicke>	 !log started nodetool decommission on restbase1008
[18:30:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:42:02] <wikibugs>	 7Puppet, 6operations: "Various fixes for ordered_yaml" PR on github - https://phabricator.wikimedia.org/T120533#1855864 (10Reedy) 3NEW a:3ori
[19:02:33] <grrrit-wm>	 (03CR) 10Chad: "There is not an option to download arbitrary zip files from Phabricator. Github is more than welcome to waste their cpu cycles on that." [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad)
[19:03:27] <grrrit-wm>	 (03CR) 10Chad: "Also, this does nothing to prevent people from still using Gitblit still while it's on.... It just means Gerrit won't link to it." [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad)
[19:24:49] <grrrit-wm>	 (03PS1) 10Ori.livneh: wmflib: fixes for ordered_yaml [puppet] - 10https://gerrit.wikimedia.org/r/257075 
[19:26:26] <grrrit-wm>	 (03PS2) 10Ori.livneh: wmflib: fixes for ordered_yaml [puppet] - 10https://gerrit.wikimedia.org/r/257075 (https://phabricator.wikimedia.org/T120533) 
[19:27:48] <grrrit-wm>	 (03PS4) 10Ori.livneh: Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad)
[19:31:51] <wikibugs>	 7Puppet, 6operations, 5Patch-For-Review: "Various fixes for ordered_yaml" PR on github - https://phabricator.wikimedia.org/T120533#1855890 (10ori) 5Open>3Resolved
[19:32:10] <grrrit-wm>	 (03PS5) 10Ori.livneh: Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad)
[19:32:16] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032 V: 032] Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad)
[19:34:02] <ostriches>	 ori: but it's Saturday :p
[19:34:14] * ostriches grabs laptop for testing
[19:34:54] <icinga-wm>	 PROBLEM - salt-minion processes on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:35:43] <icinga-wm>	 PROBLEM - salt-minion processes on wtp1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:35:53] <icinga-wm>	 PROBLEM - salt-minion processes on db1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:13] <icinga-wm>	 PROBLEM - salt-minion processes on nescio is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:13] <icinga-wm>	 PROBLEM - salt-minion processes on rdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:14] <ori>	 grr
[19:36:24] <icinga-wm>	 PROBLEM - salt-minion processes on cp2024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:33] <icinga-wm>	 PROBLEM - salt-minion processes on db1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:33] <icinga-wm>	 PROBLEM - salt-minion processes on elastic1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:44] <icinga-wm>	 PROBLEM - salt-minion processes on cp2006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:36:44] <icinga-wm>	 PROBLEM - salt-minion processes on mw2169 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion
[19:38:39] <ostriches>	 ori: Yay no more %2F :P
[19:42:03] <S`Afk>	 Hello =)
[20:34:39] <wikibugs>	 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1855947 (10Krenair) a:3Betacommand I'll let @Betacommand try to find the revision, but if that doesn't work out I'll insert a new text entry like `SYSADMIN NOTE: Text of...
[20:41:34] <icinga-wm>	 PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail
[21:09:03] <icinga-wm>	 RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:12:53] <icinga-wm>	 PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: puppet fail
[21:40:23] <icinga-wm>	 RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[21:49:57] <wikibugs>	 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1856035 (10Krinkle) <https://blog.cloudflare.com/introducing-http2/> > **HTTP/2 is here! Goodbye SPDY? Not quite yet** > There is no need to make a decision between SPDY or HTTP/2. Both are automatically ther...
[22:05:20] <Krinkle>	 ori: bblack: CloudFare did the work already to rework the http2 patch in nginx to not remove spdy
[22:05:43] <Krinkle>	 Essentially rebasing this https://github.com/nginx/nginx/commit/ee37ff613fe2a746e23040a7a8aba64063123175 without the removal, and making the ssl handshake accept both
[22:05:50] <Krinkle>	 They're opensourcing it in the new year they say
[22:07:14] <icinga-wm>	 PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: puppet fail
[22:34:34] <icinga-wm>	 RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[22:39:12] <icinga-wm>	 PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:39:34] <icinga-wm>	 PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:41:04] <icinga-wm>	 PROBLEM - nutcracker process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:41:43] <icinga-wm>	 PROBLEM - RAID on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:41:52] <icinga-wm>	 PROBLEM - configured eth on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:41:53] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:42:33] <icinga-wm>	 PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:42:43] <icinga-wm>	 PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[22:42:44] <icinga-wm>	 PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:42:53] <icinga-wm>	 PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:47:53] <icinga-wm>	 PROBLEM - salt-minion processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:47:53] <icinga-wm>	 PROBLEM - dhclient process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:48:03] <icinga-wm>	 PROBLEM - Disk space on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:48:12] <icinga-wm>	 PROBLEM - nutcracker port on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[22:57:43] <icinga-wm>	 RECOVERY - dhclient process on mw1144 is OK: PROCS OK: 0 processes with command name dhclient
[23:01:43] <icinga-wm>	 RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:01:44] <icinga-wm>	 RECOVERY - Disk space on mw1144 is OK: DISK OK
[23:01:53] <icinga-wm>	 RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212
[23:06:03] <icinga-wm>	 RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[23:06:03] <icinga-wm>	 RECOVERY - DPKG on mw1144 is OK: All packages OK
[23:06:23] <icinga-wm>	 RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures
[23:06:23] <icinga-wm>	 RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm
[23:06:24] <icinga-wm>	 RECOVERY - nutcracker process on mw1144 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[23:07:04] <icinga-wm>	 RECOVERY - RAID on mw1144 is OK: OK: no RAID installed
[23:07:13] <icinga-wm>	 RECOVERY - configured eth on mw1144 is OK: OK - interfaces up
[23:07:14] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1144 is OK: OK: nf_conntrack is 0 % full
[23:11:43] <icinga-wm>	 PROBLEM - nutcracker port on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:12:03] <icinga-wm>	 PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[23:12:04] <icinga-wm>	 PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:12:13] <icinga-wm>	 PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:12:23] <icinga-wm>	 PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[23:13:32] <icinga-wm>	 RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212
[23:13:44] <icinga-wm>	 RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0)
[23:13:45] <icinga-wm>	 RECOVERY - DPKG on mw1144 is OK: All packages OK
[23:14:04] <icinga-wm>	 RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm
[23:14:04] <icinga-wm>	 RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures
[23:14:14] <icinga-wm>	 RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 66360 bytes in 3.900 second response time
[23:14:43] <icinga-wm>	 RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.286 second response time