[00:05:47] 7Blocked-on-Operations, 7Varnish: Improve handling of mobile variants in Varnish - https://phabricator.wikimedia.org/T120151#1854453 (10ori) [00:08:47] (03Abandoned) 10Ricordisamoa: Don't match Phabricator task IDs inside URLs [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [00:09:07] (03PS1) 10Dzahn: zuul: move roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/257039 [00:10:01] (03CR) 10Ricordisamoa: "not able to review, sorry" [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [00:11:33] (03PS2) 10Dzahn: zuul: move roles into role module [puppet] - 10https://gerrit.wikimedia.org/r/257039 [00:18:11] mutante: I have direct push access, so for easy stuff like that, I can just commit directly ;) [00:19:06] Reedy: :) it seems the rules are pretty much cleaned up already [00:19:23] the remaining things (ubuntu,apt etc.. ) stay [00:19:39] what we could maybe do is add more redirect rules like [00:19:44] wikipedia.com -> wikipedia.org [00:20:02] would that fix the cert error for httpseverywhere users [00:20:16] compare to the existing enwp.org rule [00:21:05] also: TIL there is frwp.org too [00:22:11] (03CR) 10Ori.livneh: [C: 032] maps: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257031 (owner: 10Ori.livneh) [00:22:57] mutante: We could, yeah [00:23:06] I presume they're not getting parked [00:23:14] And we're also not buying HTTPS certs either? [00:24:00] Reedy: wikipedia.com wont get parked i think [00:24:04] it's too good [00:24:09] heh [00:24:12] but let me ask in the meeting we have soon [00:24:19] (03PS1) 10Ori.livneh: Remove redis::ganglia; incompatible with multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/257042 [00:24:19] It's not like it's one of our squatted domains [00:24:26] yea, there are layers to this [00:24:31] from total crap to kind of good to good [00:24:41] typo domains [00:24:49] real project names in other country TLDs [00:25:01] i think wikipedia.com might be the most common one that isnt a real project URL [00:25:09] people just type .com for everything [00:25:27] yeah [00:25:32] there is even a key just for .com on my phone keyboard [00:25:37] Which begs the question if we should have HTTPS cert for it [00:26:04] yes, i think this specific one should probably be added to the cert [00:26:18] but that's exactly the question.. what's the policy [00:26:28] also,,letsencrypt or not and when [00:26:29] :) [00:26:37] Should we file a ticket for wp.com? [00:26:43] that looks like wordpress [00:26:44] lol [00:27:08] Reedy: https://phabricator.wikimedia.org/T42998 [00:27:35] heh, yes, WP is wordpress [00:28:18] Reedy: btw, we have open tickets about blog.wm.org and https-only /mixed content that are not solved and blocked by wordpress.com [00:28:30] while i dont see it in httpseverywhere anymore at all [00:28:44] It probably got removed by someone in a big cleanup [00:28:50] Maybe it'll be fixed when they rewrite WP in node [00:28:55] (03CR) 10Ori.livneh: [C: 032] Remove redis::ganglia; incompatible with multi-instance [puppet] - 10https://gerrit.wikimedia.org/r/257042 (owner: 10Ori.livneh) [00:29:27] Reedy: https://phabricator.wikimedia.org/T105905#1563190 [00:30:24] "Heard back from Automattic last week and it turns out that contrary to the previous discussion, the search/replace feature of WP-CLI is actually disabled on VIP because it can cause database issues. They are looking into alternative options." [00:30:44] so .. the feature that is needed is not there .. because we are VIP [00:30:55] and "cause database issues" .. wut :p [00:31:25] that's a nice kind of VIP where WP-CLI stuff is disabled [00:31:49] I suspect it's because it's doing a massive update, looking for text in big unindexed text columns [00:32:05] database issues? or people do stupid things and break their dbs issues? >.> [00:33:19] (03PS1) 10Ori.livneh: role::ci::slave::browsertests: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257043 [00:33:42] p858snake: :) i guess in the end it's the same [00:34:04] (03CR) 10Ori.livneh: [C: 032 V: 032] role::ci::slave::browsertests: migrate to redis::instance [puppet] - 10https://gerrit.wikimedia.org/r/257043 (owner: 10Ori.livneh) [00:36:27] 6operations, 6Labs, 10Labs-Infrastructure: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1854628 (10Dzahn) 3NEW [00:36:47] 6operations, 6Labs, 10Labs-Infrastructure, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1854641 (10Dzahn) [00:37:09] 6operations, 6Labs, 10Labs-Infrastructure, 7HTTPS: add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486#1854628 (10Dzahn) [00:41:11] andrewbogott: re: apache: or maybe we block it with ferm? (if it's needed for puppetmaster but only from local) [00:42:06] i also removed that default site before with apache::site { '000-default' .. ensure => absent before afair [00:42:59] i'll leave comments on gerrit , laters! [00:44:17] (03CR) 10Dzahn: "if this is needed for puppetmaster but only connections from local, then maybe base::firewall and a ferm rule is the way to go" [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott) [00:45:09] (03PS1) 10Reedy: Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 [00:45:20] (03CR) 10Dzahn: "to remove the apache site i have used: apache::site { '000-default'... ensure => absent before" [puppet] - 10https://gerrit.wikimedia.org/r/257034 (https://phabricator.wikimedia.org/T120449) (owner: 10Andrew Bogott) [00:47:49] (03CR) 10CSteipp: [C: 031] Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy) [00:48:07] (03CR) 10Reedy: "Needs merging before 1.27.0-wmf.8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy) [01:45:54] (03CR) 10Krinkle: "Possibly fixme" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/257043 (owner: 10Ori.livneh) [01:46:39] ori: can you verify ^ I'm out on mobile. These auto deploy on ci in labs [02:03:20] Reedy: https://gerrit.wikimedia.org/r/#/c/257033/ [02:24:49] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.7) (duration: 09m 59s) [02:24:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:00:13] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [03:27:24] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:59:21] (03PS2) 10Ori.livneh: Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) [03:59:58] (03PS3) 10Ori.livneh: Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) [06:09:08] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Dec 5 06:09:07 UTC 2015 (duration 3h 44m 18s) [06:09:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:21:18] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 332 [06:30:44] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:03] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: Puppet has 3 failures [06:31:13] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 24 [06:31:54] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:23] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:43] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures [06:57:04] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:26:52] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [07:26:54] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [07:27:22] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:33] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:43] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:44] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:14] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: puppet fail [07:28:23] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:23] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:29:13] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:29:22] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:41:38] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 326 [07:45:37] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 13 [07:51:27] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 339 [07:55:27] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 2 [07:56:03] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [07:57:51] <_joe_> /win 33 [07:58:04] <_joe_> win 33 [07:58:11] * YuviPanda loses _joe_ [07:58:17] good morning, _joe_ [07:58:38] <_joe_> heya [07:58:42] <_joe_> just got paged [07:58:47] <_joe_> but I'm going away [07:58:56] kk go away [07:59:00] page seems ot be just flapping [07:59:01] <_joe_> I'm with myparents after months, ttyl [07:59:05] <_joe_> yup [07:59:09] <_joe_> bye [07:59:21] _joe_: enjoy your weekend! [08:01:36] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 354 [08:09:17] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 6 [08:09:37] what on earth is going on ? [08:13:19] akosiaris: it's been flapping for a while [08:15:32] 'db1019' => 0, # 1.4TB 64GB, watchlist, recentchanges, contributions, logpager [08:15:48] so, commons ? [08:19:23] akosiaris: yes, it's commons, some queries are slow-ish but then they eventually get done [08:19:27] SELECT /* SpecialRecentChangesLinked [08:19:47] visible on https://tendril.wikimedia.org/report/slow_queries?host=db1019&hours=1 [08:20:49] the box is in very heavy iowait [08:21:00] so it gets critical for a moment and then recovers again when done [08:24:15] the iowait started increasing on 05:15 UTC [08:24:43] and it's around 20%, whereas before that is was around 3% [08:24:57] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 371 [08:25:07] and it's not gonna go away [08:26:08] I think there is nothing wrong with the slave... [08:26:35] it's just trying to catchup and can't due to all the load [08:29:25] hey jynus [08:30:25] jynus: so, recap up to now: db1019 in heavy IOwait, probably because of trying to catchup to the master. it's commons btw [08:30:55] slow queries up to now are mosty SELECT /* SpecialRecentChangesLinked::doMainQuery on db1019 [08:32:31] db1040s innodb checkpoint age is low btw https://tendril.wikimedia.org/host/view/db1040.eqiad.wmnet/3306 [08:35:08] hi [08:35:21] hey [08:35:31] any idea what's causing it? [08:36:04] it's commons [08:36:19] https://doc.wikimedia.org/mediawiki-core/master/php/SpecialRecentchangeslinked_8php_source.html [08:36:54] so.. https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1040&hours=1 [08:37:09] are those select queries normal ? [08:37:34] SELECT * FROM (SELECT selected_date AS datestring FROM (SELECT ADDDATE('1970-01-01', t4.i*10000 + t3.i*1000 + t2.i*100 + t1.i*10 + t0.i) selected_date from (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SELECT 8 UNION SELECT 9) t0, (SELECT 0 i UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4 UNION SELECT 5 UNION SELECT 6 UNION SELECT 7 UNION SE [08:37:41] what is that thing ? [08:37:46] and with no comment [08:38:04] that is research, ignore dbstore [08:38:09] oh [08:38:12] damn family tree [08:38:13] sorry [08:38:40] there is an issue with replication s4 wide, it is only hitting db1019 harder [08:39:14] huh? how did you tell? the rest have 0 lag .. [08:39:26] not really [08:39:42] so tendril is lying ? [08:39:58] no, it is just showing the current state [08:40:48] PROBLEM - MariaDB Slave Lag: s4 on db1019 is CRITICAL: CRITICAL slave_sql_lag Seconds_Behind_Master: 411 [08:40:59] the others are faster to catch up [08:41:05] ok [08:41:47] I am not saying db1019 is not badly, I am trying to see the original cause [08:42:07] threads_connected and aborted clients both show a pattern while this has been going on, on 1019 [08:42:18] could be effect rather than cause, though [08:42:26] that is normal, when lagging, mediawiki kills threads [08:43:45] ah I see 1042 has a similar-looking increase in rep lag, it's just not enough to threshold and alert [08:45:07] that is what I meant [08:45:49] when 2 slaves have lag at the same time [08:46:01] it is usually due ot something common [08:46:11] (master updates, etc) [08:51:24] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [08:58:27] RECOVERY - MariaDB Slave Lag: s4 on db1019 is OK: OK slave_sql_lag Seconds_Behind_Master: 58 [08:59:07] !log offlined db1019 megacli disk 32:11 [08:59:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:33] PROBLEM - RAID on db1019 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) [09:11:46] (03CR) 10BBlack: [C: 031] Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) (owner: 10Ori.livneh) [09:12:07] bblack: GO TO SLEEP [09:13:59] 6operations, 10ops-eqiad: db1019 failing disk (degraded RAID) - https://phabricator.wikimedia.org/T120511#1855232 (10jcrespo) 3NEW [09:15:44] 6operations, 6Commons, 10MediaWiki-File-management, 6Multimedia, and 2 others: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038#1855240 (10BBlack) We can try the mediawiki-config revert during one of the Monday SWATs I think ( https://gerrit.wikimedia.or... [09:15:54] I just had to touch a few things from email :P [09:15:55] nite! [09:16:32] dbstore1002 could be anything unrelated [09:16:40] not prioritary now [09:16:49] will documment the raid issue first [09:17:03] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [09:43:15] I've documented https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Caused_by_hardware [10:41:14] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on port 9042 [11:00:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:00:23] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [1000.0] [11:06:14] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:06:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:07:02] thanks jy nus, read and noted [11:12:09] (03PS3) 10Reedy: Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 [11:12:16] (03CR) 10Reedy: [C: 032] Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 (owner: 10Reedy) [11:12:37] (03Merged) 10jenkins-bot: Add jobqueue-labs.php to noc [mediawiki-config] - 10https://gerrit.wikimedia.org/r/254917 (owner: 10Reedy) [11:13:25] !log reedy@tin Synchronized docroot and w: Add jobqueue-labs to noc (duration: 00m 28s) [11:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:22:08] !log reedy@tin Synchronized php-1.27.0-wmf.7/extensions/WikimediaMaintenance/refreshMessageBlobs.php: Less waiting for slaves (duration: 00m 28s) [11:22:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:53] !log reedy@tin Purged l10n cache for 1.27.0-wmf.5 [11:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:57] (03PS2) 10Reedy: Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 [11:33:39] (03CR) 10Reedy: [C: 032] Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy) [11:34:00] (03Merged) 10jenkins-bot: Disable PasswordCannotBePopular for sysop and bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/257044 (owner: 10Reedy) [11:35:01] !log reedy@tin Synchronized wmf-config/CommonSettings.php: Disable common password password policy to come in wmf.8 (duration: 00m 28s) [11:35:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:41:21] (03CR) 10Paladox: "Hi how do you download in the repos in phabricator I doin't see an option." [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [12:57:33] PROBLEM - puppet last run on mw2033 is CRITICAL: CRITICAL: puppet fail [13:23:53] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: puppet fail [13:25:04] RECOVERY - puppet last run on mw2033 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:51:15] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [14:19:45] 6operations, 10DBA: Multiple pages with no revisions - https://phabricator.wikimedia.org/T112282#1855498 (10Nemo_bis) [14:19:50] 6operations, 10DBA, 6WMF-Legal, 5Patch-For-Review: dbtree loads third party resources (from jquery.com and google.com) - https://phabricator.wikimedia.org/T96499#1855502 (10Nemo_bis) [14:20:00] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1855508 (10Nemo_bis) [15:24:38] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1855565 (10Betacommand) I might have a few tricks for recovering this revision, give me a day or two Ill see what I can do. [17:00:36] Interesting. https://wikitech.m.wikimedia.org/ [17:00:41] Wikimedia.org portal page [17:04:40] 6operations, 10Reading-Web, 7Varnish: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#1855709 (10Krinkle) 3NEW [17:08:13] 6operations, 10Reading-Web: [Regression] Unable to browse wikitech.wikimedia.org from mobile device (Apache error) - https://phabricator.wikimedia.org/T120528#1855716 (10Krinkle) 3NEW [17:08:42] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [17:10:14] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [17:10:27] 6operations, 10Reading-Web: [Regression] Unable to browse certain wikitech.wikimedia.org urls from mobile device (Apache error) - https://phabricator.wikimedia.org/T120528#1855725 (10Krinkle) [17:14:12] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:16:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:33:43] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: Puppet has 1 failures [17:43:05] (03PS4) 10Ori.livneh: Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) [17:43:14] (03CR) 10Ori.livneh: [C: 032 V: 032] Disable accept filters for HTTP on canary app servers [puppet] - 10https://gerrit.wikimedia.org/r/256968 (https://phabricator.wikimedia.org/T119372) (owner: 10Ori.livneh) [17:58:54] RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [18:04:58] 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#1855806 (10csteipp) [18:22:28] 6operations, 6Security-Team: Use user-specific passwords for accessing EventLogging database - https://phabricator.wikimedia.org/T120532#1855859 (10ori) MariaDB has [[ https://mariadb.com/kb/en/mariadb/pam-authentication-plugin/ | a free and open source PAM authentication module ]] (MySQL's is enterprise-only)... [18:30:47] !log started nodetool decommission on restbase1008 [18:30:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:02] 7Puppet, 6operations: "Various fixes for ordered_yaml" PR on github - https://phabricator.wikimedia.org/T120533#1855864 (10Reedy) 3NEW a:3ori [19:02:33] (03CR) 10Chad: "There is not an option to download arbitrary zip files from Phabricator. Github is more than welcome to waste their cpu cycles on that." [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:03:27] (03CR) 10Chad: "Also, this does nothing to prevent people from still using Gitblit still while it's on.... It just means Gerrit won't link to it." [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:24:49] (03PS1) 10Ori.livneh: wmflib: fixes for ordered_yaml [puppet] - 10https://gerrit.wikimedia.org/r/257075 [19:26:26] (03PS2) 10Ori.livneh: wmflib: fixes for ordered_yaml [puppet] - 10https://gerrit.wikimedia.org/r/257075 (https://phabricator.wikimedia.org/T120533) [19:27:48] (03PS4) 10Ori.livneh: Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:31:51] 7Puppet, 6operations, 5Patch-For-Review: "Various fixes for ordered_yaml" PR on github - https://phabricator.wikimedia.org/T120533#1855890 (10ori) 5Open>3Resolved [19:32:10] (03PS5) 10Ori.livneh: Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:32:16] (03CR) 10Ori.livneh: [C: 032 V: 032] Gerrit: use Diffusion for repo browsing (again) [puppet] - 10https://gerrit.wikimedia.org/r/256605 (https://phabricator.wikimedia.org/T110607) (owner: 10Chad) [19:34:02] ori: but it's Saturday :p [19:34:14] * ostriches grabs laptop for testing [19:34:54] PROBLEM - salt-minion processes on hafnium is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:35:43] PROBLEM - salt-minion processes on wtp1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:35:53] PROBLEM - salt-minion processes on db1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:13] PROBLEM - salt-minion processes on nescio is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:13] PROBLEM - salt-minion processes on rdb1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:14] grr [19:36:24] PROBLEM - salt-minion processes on cp2024 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:33] PROBLEM - salt-minion processes on db1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:33] PROBLEM - salt-minion processes on elastic1031 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:44] PROBLEM - salt-minion processes on cp2006 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:36:44] PROBLEM - salt-minion processes on mw2169 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:38:39] ori: Yay no more %2F :P [19:42:03] Hello =) [20:34:39] 6operations, 10DBA: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1855947 (10Krenair) a:3Betacommand I'll let @Betacommand try to find the revision, but if that doesn't work out I'll insert a new text entry like `SYSADMIN NOTE: Text of... [20:41:34] PROBLEM - puppet last run on mw2211 is CRITICAL: CRITICAL: puppet fail [21:09:03] RECOVERY - puppet last run on mw2211 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:12:53] PROBLEM - puppet last run on wtp2014 is CRITICAL: CRITICAL: puppet fail [21:40:23] RECOVERY - puppet last run on wtp2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:49:57] 6operations, 6Performance-Team, 10Traffic: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#1856035 (10Krinkle) > **HTTP/2 is here! Goodbye SPDY? Not quite yet** > There is no need to make a decision between SPDY or HTTP/2. Both are automatically ther... [22:05:20] ori: bblack: CloudFare did the work already to rework the http2 patch in nginx to not remove spdy [22:05:43] Essentially rebasing this https://github.com/nginx/nginx/commit/ee37ff613fe2a746e23040a7a8aba64063123175 without the removal, and making the ssl handshake accept both [22:05:50] They're opensourcing it in the new year they say [22:07:14] PROBLEM - puppet last run on mw2204 is CRITICAL: CRITICAL: puppet fail [22:34:34] RECOVERY - puppet last run on mw2204 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [22:39:12] PROBLEM - HHVM rendering on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:34] PROBLEM - Apache HTTP on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:41:04] PROBLEM - nutcracker process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:41:43] PROBLEM - RAID on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:41:52] PROBLEM - configured eth on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:41:53] PROBLEM - Check size of conntrack table on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:42:33] PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:42:43] PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:42:44] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:42:53] PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:53] PROBLEM - salt-minion processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:47:53] PROBLEM - dhclient process on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:03] PROBLEM - Disk space on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:48:12] PROBLEM - nutcracker port on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:57:43] RECOVERY - dhclient process on mw1144 is OK: PROCS OK: 0 processes with command name dhclient [23:01:43] RECOVERY - salt-minion processes on mw1144 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:01:44] RECOVERY - Disk space on mw1144 is OK: DISK OK [23:01:53] RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212 [23:06:03] RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:06:03] RECOVERY - DPKG on mw1144 is OK: All packages OK [23:06:23] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 58 minutes ago with 0 failures [23:06:23] RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm [23:06:24] RECOVERY - nutcracker process on mw1144 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [23:07:04] RECOVERY - RAID on mw1144 is OK: OK: no RAID installed [23:07:13] RECOVERY - configured eth on mw1144 is OK: OK - interfaces up [23:07:14] RECOVERY - Check size of conntrack table on mw1144 is OK: OK: nf_conntrack is 0 % full [23:11:43] PROBLEM - nutcracker port on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:03] PROBLEM - SSH on mw1144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:12:04] PROBLEM - DPKG on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:13] PROBLEM - puppet last run on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:12:23] PROBLEM - HHVM processes on mw1144 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [23:13:32] RECOVERY - nutcracker port on mw1144 is OK: TCP OK - 0.000 second response time on port 11212 [23:13:44] RECOVERY - SSH on mw1144 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [23:13:45] RECOVERY - DPKG on mw1144 is OK: All packages OK [23:14:04] RECOVERY - HHVM processes on mw1144 is OK: PROCS OK: 6 processes with command name hhvm [23:14:04] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [23:14:14] RECOVERY - HHVM rendering on mw1144 is OK: HTTP OK: HTTP/1.1 200 OK - 66360 bytes in 3.900 second response time [23:14:43] RECOVERY - Apache HTTP on mw1144 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.286 second response time