[01:51:05] (03PS1) 10Alex Monk: openstack: remove unused volume class, update default version [puppet] - 10https://gerrit.wikimedia.org/r/311304 [01:56:24] (03PS1) 10Alex Monk: openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) [01:57:35] (03CR) 10jenkins-bot: [V: 04-1] openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) (owner: 10Alex Monk) [02:01:32] (03PS2) 10Alex Monk: openstack: Add basic monitoring for HTTP services [puppet] - 10https://gerrit.wikimedia.org/r/311306 (https://phabricator.wikimedia.org/T42022) [02:15:24] (03PS1) 10Alex Monk: openstack: mitaka files/templates: fix puppet header to give correct path [puppet] - 10https://gerrit.wikimedia.org/r/311309 [02:26:51] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.18) (duration: 10m 18s) [02:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:49] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Sep 18 02:32:49 UTC 2016 (duration 5m 58s) [02:32:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:14] PROBLEM - puppet last run on carbon is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[acme-setup-acme-apt] [03:13:13] PROBLEM - puppet last run on db2010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:18:54] PROBLEM - puppet last run on achernar is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:37:44] RECOVERY - puppet last run on db2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:45:56] RECOVERY - puppet last run on achernar is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:53:24] PROBLEM - puppet last run on analytics1031 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/lib/nagios/plugins/check_raid] [04:17:56] RECOVERY - puppet last run on analytics1031 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:31:06] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/ganglia/conf.d/apache_status.pyconf] [04:49:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:49:56] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [04:50:17] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [04:52:16] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [04:55:26] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:59:48] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [05:35:27] PROBLEM - puppet last run on mc2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:39:06] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [1000.0] [05:41:35] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:41:54] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [05:46:35] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [05:58:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [06:02:29] RECOVERY - puppet last run on mc2008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:07:54] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:11:09] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:13:59] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [06:26:07] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:32:37] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:38:19] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:43:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:52:41] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Not Available - 531 bytes in 0.036 second response time [07:00:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [1000.0] [07:12:39] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3677 bytes in 0.025 second response time [07:17:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:30:11] PROBLEM - puppet last run on mw2217 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:35:13] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [07:37:36] PROBLEM - puppet last run on mw2121 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:44:58] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [07:52:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [07:57:14] RECOVERY - puppet last run on mw2217 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:02:25] RECOVERY - puppet last run on mw2121 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [08:12:17] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [08:14:28] PROBLEM - puppet last run on elastic1019 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [08:22:10] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:29:42] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [1000.0] [08:29:53] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [1000.0] [08:38:03] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2646272 (10Volans) [08:39:15] RECOVERY - puppet last run on elastic1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:44:36] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:46:55] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [08:49:13] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2646296 (10Volans) I'd tend to exclude also: * Ansible because through their Yaml configuration files is not possible to achieve our use cases and their [[http://docs.ansibl... [08:56:16] PROBLEM - puppet last run on cp3031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:59:17] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [09:14:25] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [09:21:07] RECOVERY - puppet last run on cp3031 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [09:26:28] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:28:36] PROBLEM - puppet last run on cp1073 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:30:16] <_joe_> !log varnish-backend-restart on cp1048 [09:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:35:26] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [09:47:54] <_joe_> !log varnish-backend-restart on cp1063 [09:47:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:17] PROBLEM - puppet last run on db2039 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:48:38] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [09:53:18] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:55:38] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:55:49] PROBLEM - puppet last run on cp1063 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [09:55:57] PROBLEM - puppet last run on cp1099 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Service[varnish] [09:56:08] <_joe_> ? [09:56:09] PROBLEM - Varnishkafka log producer on cp1063 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [09:58:58] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [09:59:53] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:01:20] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:02:15] <_joe_> all the RAID errors are from swift cluster overloading [10:02:31] <_joe_> which is expected as we had to restart a few eqiad backends [10:03:39] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:03:40] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:11:01] RECOVERY - puppet last run on cp1063 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:11:12] RECOVERY - Varnishkafka log producer on cp1063 is OK: PROCS OK: 1 process with command name varnishkafka [10:15:42] RECOVERY - puppet last run on db2039 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:20:53] RECOVERY - puppet last run on cp1099 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:24:04] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:28:17] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:28:35] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:30:56] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [10:31:11] (03PS1) 10Ema: varnish-backend-restart: fix service invocation [puppet] - 10https://gerrit.wikimedia.org/r/311326 [10:35:24] (03CR) 10Ema: [C: 032] varnish-backend-restart: fix service invocation [puppet] - 10https://gerrit.wikimedia.org/r/311326 (owner: 10Ema) [10:35:48] PROBLEM - HP RAID on ms-be1023 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:36:08] PROBLEM - HP RAID on ms-be1024 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:37:47] PROBLEM - puppet last run on cp2009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate_varnishkafka_webrequest_gmond_pyconf] [10:40:09] RECOVERY - puppet last run on cp2009 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:40:40] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [1000.0] [10:41:13] (03PS1) 10Urbanecm: Add WT namespace alias to NS_PROJECT in mywiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311327 (https://phabricator.wikimedia.org/T140998) [10:41:29] PROBLEM - HP RAID on ms-be1025 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [10:47:41] RECOVERY - puppet last run on cp1073 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [10:48:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [10:49:20] !log repooling varnish on cp1074 [10:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:39] !log repooling varnish on cp1073 [10:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:11] !log repooling varnish on cp1072 [10:50:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:43] !log repooling varnish on cp1071 [10:50:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:51:20] RECOVERY - HP RAID on ms-be1025 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:52:36] !log repooling varnish on cp1064 [10:52:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:53:28] !log repooling varnish on cp1062 [10:53:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:54:12] !log repooling varnish on cp1050 [10:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:58:23] RECOVERY - HP RAID on ms-be1024 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [10:58:25] !log varnish-backend restart on cp3044 [10:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:59] !log varnish-backend restart on cp3037 [11:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:00:34] RECOVERY - HP RAID on ms-be1023 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:03:11] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [11:08:04] PROBLEM - Varnishkafka log producer on cp3044 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [11:10:34] RECOVERY - Varnishkafka log producer on cp3044 is OK: PROCS OK: 1 process with command name varnishkafka [11:17:45] !log repooling varnish-be in codfw [11:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:17:54] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [11:31:28] PROBLEM - puppet last run on elastic2007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:35:37] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [11:56:17] RECOVERY - puppet last run on elastic2007 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [13:05:51] (03PS1) 10BBlack: cache_upload: increase FE size limit to 2MB [puppet] - 10https://gerrit.wikimedia.org/r/311330 [13:06:44] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: increase FE size limit to 2MB [puppet] - 10https://gerrit.wikimedia.org/r/311330 (owner: 10BBlack) [13:14:49] PROBLEM - Varnishkafka log producer on cp1064 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [13:19:51] RECOVERY - Varnishkafka log producer on cp1064 is OK: PROCS OK: 1 process with command name varnishkafka [13:29:03] !log disabling puppet on cp1074, to experiment with vhtcpd regex filter [13:29:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:51:39] (03PS1) 10BBlack: cache_upload: vhtcpd host regex filter [puppet] - 10https://gerrit.wikimedia.org/r/311332 [13:53:29] (03CR) 10jenkins-bot: [V: 04-1] cache_upload: vhtcpd host regex filter [puppet] - 10https://gerrit.wikimedia.org/r/311332 (owner: 10BBlack) [13:55:05] (03PS2) 10BBlack: cache_upload: vhtcpd host regex filter [puppet] - 10https://gerrit.wikimedia.org/r/311332 [13:57:33] (03CR) 10BBlack: [C: 032] cache_upload: vhtcpd host regex filter [puppet] - 10https://gerrit.wikimedia.org/r/311332 (owner: 10BBlack) [14:07:53] (03PS1) 10BBlack: htcppurger: quoting bugfix for host_regex [puppet] - 10https://gerrit.wikimedia.org/r/311333 [14:08:10] (03CR) 10BBlack: [C: 032 V: 032] htcppurger: quoting bugfix for host_regex [puppet] - 10https://gerrit.wikimedia.org/r/311333 (owner: 10BBlack) [14:17:18] !log restarting varnish backend on cp1073 (503 LRU_Fail pattern, has been up a few days...) [14:17:19] PROBLEM - puppet last run on cp2025 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/varnish-backend-restart] [14:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:09] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [1000.0] [14:18:19] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:30:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:30:56] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [14:39:57] RECOVERY - puppet last run on cp2025 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:42:44] !log restarting upload varnish backend: cp2022 [14:42:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:52:12] !log restarting upload varnish backend: cp1049 [14:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:23] PROBLEM - Varnishkafka log producer on cp1072 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:06:37] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [15:09:07] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [15:12:38] RECOVERY - Varnishkafka log producer on cp1072 is OK: PROCS OK: 1 process with command name varnishkafka [15:13:00] !log restarting upload varnish backend: cp2005 [15:13:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:29] PROBLEM - Varnishkafka log producer on cp1071 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [15:15:59] RECOVERY - Varnishkafka log producer on cp1071 is OK: PROCS OK: 1 process with command name varnishkafka [15:22:30] (03PS1) 10BBlack: cache_upload: FE size limit 1MB [puppet] - 10https://gerrit.wikimedia.org/r/311336 [15:23:03] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: FE size limit 1MB [puppet] - 10https://gerrit.wikimedia.org/r/311336 (owner: 10BBlack) [15:29:21] (03PS4) 10BBlack: cache_upload: one-hit-wonder experiment, hit/2+ [puppet] - 10https://gerrit.wikimedia.org/r/308995 (https://phabricator.wikimedia.org/T144187) [15:29:44] (03PS5) 10BBlack: cache_upload: two-hit-wonder experiment, hit/2+ [puppet] - 10https://gerrit.wikimedia.org/r/308995 (https://phabricator.wikimedia.org/T144187) [15:43:11] !log restarting upload varnish backend: cp2017 [15:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:05:39] PROBLEM - puppet last run on rdb2004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:07:44] PROBLEM - puppet last run on db2053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:33:01] RECOVERY - puppet last run on rdb2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:35:05] RECOVERY - puppet last run on db2053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:35:54] !log restarting upload varnish backend: cp2011 [16:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:48:31] (03CR) 10BBlack: [C: 032] cache_upload: two-hit-wonder experiment, hit/2+ [puppet] - 10https://gerrit.wikimedia.org/r/308995 (https://phabricator.wikimedia.org/T144187) (owner: 10BBlack) [16:59:04] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:01:54] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:06:08] !log restart upload varnish backend: cp2026 [17:06:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:21:40] PROBLEM - HP RAID on ms-be1026 is CRITICAL: CHECK_NRPE: Socket timeout after 40 seconds. [17:23:05] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:24:06] RECOVERY - HP RAID on ms-be1026 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:32:13] !log restart upload varnish backend: cp2020 [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:32:47] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:42:08] !log restart upload varnish backend: cp1071 [17:42:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:47:58] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:48:57] PROBLEM - Varnishkafka log producer on cp1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [17:51:31] RECOVERY - Varnishkafka log producer on cp1062 is OK: PROCS OK: 1 process with command name varnishkafka [17:57:19] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [17:58:42] !log restart upload varnish backend: cp2008 [17:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:03:14] @seen odder [18:03:14] Steinsplitter: Last time I saw odder they were quitting the network with reason: Quit: leaving N/A at 8/21/2016 10:28:08 AM (28d7h35m6s ago) [18:05:54] Guys, we are having thumbail issues on Commons... 5/200 (2.5%) of thumbnails displayed on category pages and file histories are "Missing"....see #wikimeida-tech [18:06:17] !log restart upload varnish backend: cp1050 (already in LRU_Fail) [18:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:07:18] (That is 120px thumbnails) [18:23:30] (03PS1) 10BBlack: fix another possible netmapper-1.3+v4 FE crash [puppet] - 10https://gerrit.wikimedia.org/r/311338 [18:24:11] (03CR) 10BBlack: [C: 032 V: 032] fix another possible netmapper-1.3+v4 FE crash [puppet] - 10https://gerrit.wikimedia.org/r/311338 (owner: 10BBlack) [18:26:48] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:29:28] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:48:39] PROBLEM - Varnishkafka log producer on cp3049 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [18:49:38] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2646724 (10Aklapper) [19:00:18] 06Operations, 10DNS, 10Domains, 10Traffic, and 2 others: Point wikipedia.in to 180.179.52.130 instead of URL forward - https://phabricator.wikimedia.org/T144508#2646737 (10Aklapper) @Naveenpf: Not **here** in T144508. This task is only about pointing wikipedia.in to 180.179.52.130. This task is **not** abo... [19:03:29] RECOVERY - Varnishkafka log producer on cp3049 is OK: PROCS OK: 1 process with command name varnishkafka [19:28:15] !log restart up [19:28:15] All status_type @ All / upload (non-PURGE) [19:28:15] 13:3014:0014:3015:0015:3016:0016:3017:0017:3018:0018:3019:00025 K50 K75 K100 K125 Krate per second [19:28:18] get109.8 Kpost962head598options64connect0put0trace0delete0 [19:28:20] All status_type @ All / upload (PURGE) [19:28:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:28:23] 13:3014:0014:3015:0015:3016:0016:3017:0017:3018:0018:3019:00050 K100 K150 K200 Krate per second [19:28:26] purge84.0 K [19:28:28] bleh [19:28:57] !log restart upload backend: cp1064 (already in LRU_Fail, caught early) [19:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:44] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2646791 (10Gilles) That sucks, it's not a temp file I deliberately create, it seems to be something ffmpeg creates on its own... Why didn't "timeout" kill it after a mi... [19:32:50] 06Operations, 06Performance-Team, 10Thumbor: thumbor imagemagick filling up /tmp on thumbor1002 - https://phabricator.wikimedia.org/T145878#2646807 (10Gilles) Looking at the error mediawiki returns when trying to render the same thumbnail, the symptoms are identical to the case found in T145612 [19:33:49] PROBLEM - Varnishkafka log producer on cp1050 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [19:38:50] RECOVERY - Varnishkafka log producer on cp1050 is OK: PROCS OK: 1 process with command name varnishkafka [19:58:36] !log restart upload backend: cp1074 (stats indicate LRU_Fail imminent) [19:58:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:14] !log restart upload backend: cp3039 [20:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:32:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:37:21] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:37:48] !log restart upload backend: cp3036 [20:37:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:00:12] PROBLEM - Varnishkafka log producer on cp3034 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:02:43] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:03:53] PROBLEM - Varnishkafka log producer on cp1062 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [21:12:38] RECOVERY - Varnishkafka log producer on cp3034 is OK: PROCS OK: 1 process with command name varnishkafka [21:21:10] RECOVERY - Varnishkafka log producer on cp1062 is OK: PROCS OK: 1 process with command name varnishkafka [21:27:27] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:10:03] PROBLEM - puppet last run on mw2239 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:23:02] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/spice-html5/spice_sec_auto.html] [22:35:16] RECOVERY - puppet last run on mw2239 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:45:45] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures