[00:35:13] (03PS1) 10Tim Starling: Do a 301 redirect for wiki requests to URLs starting with /? [puppet] - 10https://gerrit.wikimedia.org/r/411522 [00:36:13] (03CR) 10Tim Starling: "Untested" [puppet] - 10https://gerrit.wikimedia.org/r/411522 (owner: 10Tim Starling) [01:02:01] (03PS1) 10Andrew Bogott: labweb horizon: switch config files to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/411545 [01:02:09] (03PS2) 10Andrew Bogott: labweb horizon: switch config files to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/411545 [01:02:42] (03CR) 10Andrew Bogott: [C: 032] labweb horizon: switch config files to version 'ocata' [puppet] - 10https://gerrit.wikimedia.org/r/411545 (owner: 10Andrew Bogott) [01:16:22] (03PS1) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [01:17:02] (03CR) 10jerkins-bot: [V: 04-1] labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [01:23:42] (03PS2) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [01:24:47] (03CR) 10jerkins-bot: [V: 04-1] labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [01:25:38] (03PS3) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [01:25:47] (03CR) 10jerkins-bot: [V: 04-1] labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) (owner: 10Andrew Bogott) [01:40:01] (03PS4) 10Andrew Bogott: labweb horizon: share memcached among labwebs [puppet] - 10https://gerrit.wikimedia.org/r/411546 (https://phabricator.wikimedia.org/T187506) [02:15:36] PROBLEM - Disk space on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:17:46] PROBLEM - puppet last run on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:17:55] PROBLEM - configured eth on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:18:06] PROBLEM - DPKG on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:18:06] PROBLEM - MD RAID on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:18:25] PROBLEM - Check systemd state on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:18:26] PROBLEM - dhclient process on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:18:45] PROBLEM - Check the NTP synchronisation status of timesyncd on stat1005 is CRITICAL: Return code of 255 is out of bounds [02:21:10] (03PS1) 10Krinkle: webperf: Add some commments to navtiming test cases [puppet] - 10https://gerrit.wikimedia.org/r/411547 [02:25:25] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational [02:25:35] RECOVERY - dhclient process on stat1005 is OK: PROCS OK: 0 processes with command name dhclient [02:25:55] RECOVERY - configured eth on stat1005 is OK: OK - interfaces up [02:26:15] RECOVERY - DPKG on stat1005 is OK: All packages OK [02:26:15] RECOVERY - MD RAID on stat1005 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [02:26:26] PROBLEM - puppet last run on lvs4007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:27:46] RECOVERY - puppet last run on stat1005 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [02:48:45] RECOVERY - Check the NTP synchronisation status of timesyncd on stat1005 is OK: OK: synced at Sat 2018-02-17 02:48:38 UTC. [02:51:26] RECOVERY - puppet last run on lvs4007 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:12:40] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.17 (duration: 04m 32s) [03:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:20] !log demon@tin Pruned MediaWiki: 1.31.0-wmf.20 [keeping static files] (duration: 01m 17s) [03:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:26:15] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 857.55 seconds [03:58:19] (03CR) 10Legoktm: [C: 031] Remove redundant wgTemplateSandboxEditNamespaces addition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/363531 (owner: 10Legoktm) [04:04:16] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 164.11 seconds [10:29:56] 10Operations: Remove dpatrick from security@ - https://phabricator.wikimedia.org/T187615#3980810 (10Reedy) [10:47:06] 10Operations: Remove dpatrick from security@ - https://phabricator.wikimedia.org/T187615#3980852 (10Reedy) [12:45:58] (03CR) 10Hashar: [C: 031] "Chad wrote:" [puppet] - 10https://gerrit.wikimedia.org/r/411211 (owner: 10Muehlenhoff) [14:58:26] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 35.27, 33.92, 32.23 [16:30:04] 10Puppet, 10cloud-services-team (Kanban): role::puppet::self referenced in puppet_ssldir.rb - https://phabricator.wikimedia.org/T187622#3981036 (10Andrew) [16:37:43] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1043 - https://phabricator.wikimedia.org/T187542#3981046 (10jcrespo) The usage of the tafs is ok. Note the substitution host is already online and in production, and the old hosts set as spare. What we wanted to to... [16:42:13] (03PS1) 10Andrew Bogott: remove role::toollabs::puppetmaster and toollabs::puppetmaster [puppet] - 10https://gerrit.wikimedia.org/r/411614 (https://phabricator.wikimedia.org/T182810) [16:42:15] (03PS1) 10Andrew Bogott: remove role::puppet::self [puppet] - 10https://gerrit.wikimedia.org/r/411615 (https://phabricator.wikimedia.org/T182810) [16:42:17] (03PS1) 10Andrew Bogott: remove 'puppet' module [puppet] - 10https://gerrit.wikimedia.org/r/411616 (https://phabricator.wikimedia.org/T182810) [16:57:40] enwiki query traffic increased a 20% very quickly a few hours ago, that is >20K queries per second increse in a few hours, and 10x the number of writes [17:00:46] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 8.18, 13.93, 23.30 [17:03:21] it started around 14:54 [17:31:55] 10Operations, 10Ops-Access-Requests: Requesting shell access and access to groups 'analytics-privatedata-users' and 'researchers' for katielin (katie) - https://phabricator.wikimedia.org/T187623#3981063 (10katielin) [17:33:38] !log restarting apache on phab1001 to clear deadlocked workers. refs T182832 [17:33:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:17] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3981076 (10mmodell) @elukey, @dzahn: do you think that... [17:37:54] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10User-Elukey: Phabricator down due to "Failed to `proc_open()`: proc_open() expects parameter 2 to be array" - https://phabricator.wikimedia.org/T186620#3981078 (10mmodell) [17:37:59] 10Operations, 10Phabricator, 10Patch-For-Review, 10Release-Engineering-Team (Kanban), 10User-Elukey: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832#3981080 (10mmodell) [17:42:25] PROBLEM - Disk space on rhenium is CRITICAL: DISK CRITICAL - free space: / 1720 MB (3% inode=96%) [17:59:11] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10hardware-requests: eqiad: (4) systems for CirrusSearch Elasticssearch replica service - https://phabricator.wikimedia.org/T187627#3981122 (10bd808) [18:01:01] 10Operations, 10Cloud-VPS, 10cloud-services-team, 10hardware-requests: eqiad: (4) systems for CirrusSearch Elasticssearch replica service - https://phabricator.wikimedia.org/T187627#3981122 (10bd808) @robh you may be able to find an email thread titled "FY17/18: Putting a live copy of CirrusSearch data in... [19:16:25] PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:17:15] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 74327 bytes in 0.254 second response time [19:42:25] PROBLEM - Nginx local proxy to apache on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:43:16] RECOVERY - Nginx local proxy to apache on mw2123 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.184 second response time [22:11:55] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 33.83, 33.42, 32.03 [22:30:05] PROBLEM - puppet last run on stat1005 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[cdh::hadoop::directory /user/spark] [22:44:26] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 51.74, 49.96, 48.19 [23:01:35] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 52.54, 48.63, 48.20 [23:32:20] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10MW-1.31-release-notes (WMF-deploy-2018-02-20 (1.31.0-wmf.22)), and 2 others: Regularly purge expired temporary userrights from DB tables - https://phabricator.wikimedia.org/T176754#3981254 (10MarcoAurelio) My advice is to do this off-SWAT. Talk to @gre... [23:32:36] PROBLEM - High CPU load on API appserver on mw1345 is CRITICAL: CRITICAL - load average: 49.77, 48.03, 48.04 [23:37:26] PROBLEM - Apache HTTP on mw2130 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:38:16] RECOVERY - Apache HTTP on mw2130 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.121 second response time