[00:04:06] <icinga-wm>	 PROBLEM - jmxtrans on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[00:04:16] <icinga-wm>	 PROBLEM - jmxtrans on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[00:04:45] <icinga-wm>	 PROBLEM - jmxtrans on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[00:05:35] <icinga-wm>	 PROBLEM - jmxtrans on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar
[00:05:48] <grrrit-wm>	 (03PS1) 10Ori.livneh: Roll out varnishxcps from I08679839 to all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/221317 
[00:05:57] <ori>	 bd808: are analytics logs flowing into logstash? is this related to the work you're doing?
[00:06:38] <bd808>	 I think there are hadoop logs in logstash, but nothing I've done, no
[00:09:26] <icinga-wm>	 RECOVERY - jmxtrans on analytics1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[00:15:45] <grrrit-wm>	 (03PS2) 10Ori.livneh: Add varnishxcps from I08679839 to 2layer role; still restricted to cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/221317 
[00:16:07] <icinga-wm>	 RECOVERY - jmxtrans on analytics1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[00:16:45] <icinga-wm>	 RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[00:17:02] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Add varnishxcps from I08679839 to 2layer role; still restricted to cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/221317 (owner: 10Ori.livneh)
[00:17:06] <icinga-wm>	 RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar
[00:19:50] <grrrit-wm>	 (03PS1) 10Ori.livneh: Remove $::hostname == 'cp1048' gate from varnishxcps invocation [puppet] - 10https://gerrit.wikimedia.org/r/221320 
[00:22:22] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Remove $::hostname == 'cp1048' gate from varnishxcps invocation [puppet] - 10https://gerrit.wikimedia.org/r/221320 (owner: 10Ori.livneh)
[00:28:00] <wikibugs>	 6operations, 6Discovery, 7Elasticsearch: logstash partman recipe huge root partition - https://phabricator.wikimedia.org/T104035#1406421 (10fgiunchedi) 3NEW
[00:30:15] <wikibugs>	 6operations, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406430 (10Dzahn) 3NEW
[00:30:51] <wikibugs>	 6operations, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406438 (10Dzahn)
[00:30:54] <wikibugs>	 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1406437 (10Dzahn)
[00:31:37] <wikibugs>	 6operations, 6Analytics-Engineering, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406430 (10Dzahn)
[00:36:33] <wikibugs>	 6operations, 6Analytics-Engineering, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406453 (10Dzahn) p:5Triage>3Normal
[00:40:46] <wikibugs>	 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1406458 (10GWicke) 5Resolved>3Open
[00:47:31] <wikibugs>	 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1406463 (10BBlack) I think at least one of the reasons for the 3 hosts idea was that if one underlying ganeti box died, we could still have 2x instances of various types up and running with some redundancy wh...
[00:48:46] <wikibugs>	 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1406466 (10Dzahn) and we need on server with class { 'ganglia_new::monitor::aggregator': per site.  we usually put it on install servers in other sites, so for new setups it would be "installx000"
[00:53:46] <icinga-wm>	 RECOVERY - Cassanda CQL query interface on xenon is OK: TCP OK - 0.001 second response time on port 9042
[01:00:24] <wikibugs>	 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1406476 (10BBlack) Of course, another option here is that we can just do a blender role that's built for cache pop infrastructure and mixes up a bunch of these things on a few bare hosts, too.  Gets tricky fo...
[01:04:15] <grrrit-wm>	 (03PS2) 10Dzahn: static-bugzilla: additional redirects [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) 
[01:09:36] <icinga-wm>	 PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail
[01:28:06] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0]
[01:29:16] <icinga-wm>	 RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures
[01:37:13] <grrrit-wm>	 (03PS3) 10Dzahn: static-bugzilla: additional redirect for number URLs [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) 
[01:38:46] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[01:39:17] <grrrit-wm>	 (03CR) 10Dzahn: [C: 032] "thanks to #httpd pointing out the problem (one character! :)" [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) (owner: 10Dzahn)
[01:42:14] <wikibugs>	 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1406569 (10Dzahn) 18:27 < thumbs> mutante: change %1 to $1.  :)  [terbium:~] $ apache-fast-test staticbz zirconium.wikimedia.org testing 6 urls on 1 servers, totalling 6 requests sp...
[01:42:39] <wikibugs>	 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1406573 (10Dzahn) 5Open>3Resolved
[01:42:42] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1406574 (10Dzahn)
[01:43:12] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1406575 (10Dzahn) 5Open>3Resolved all subtasks resolved
[01:43:15] <wikibugs>	 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1406577 (10Legoktm) Awesome, thanks!
[01:52:35] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[01:56:06] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60560 bytes in 8.592 second response time
[02:01:36] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[02:17:26] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60560 bytes in 0.476 second response time
[02:20:40] <logmsgbot>	 !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 05m 46s)
[02:20:49] <morebots>	 Logged the message, Master
[02:23:40] <logmsgbot>	 !log LocalisationUpdate completed (1.26wmf11) at 2015-06-27 02:23:40+00:00
[02:23:47] <morebots>	 Logged the message, Master
[02:41:56] <icinga-wm>	 PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds
[02:43:46] <icinga-wm>	 RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212
[02:47:46] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1003 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3
[02:48:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1006 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3
[02:48:36] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1004 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3
[02:48:36] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1005 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3
[02:48:45] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1002 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3
[02:48:47] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1001 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3
[03:02:26] <icinga-wm>	 PROBLEM - puppet last run on mw2035 is CRITICAL puppet fail
[03:20:26] <icinga-wm>	 RECOVERY - puppet last run on mw2035 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures
[03:20:37] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[03:24:05] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60553 bytes in 0.133 second response time
[04:37:25] <icinga-wm>	 PROBLEM - puppet last run on ganeti2002 is CRITICAL puppet fail
[04:44:15] <icinga-wm>	 PROBLEM - puppet last run on mw1083 is CRITICAL Puppet has 1 failures
[04:53:26] <icinga-wm>	 RECOVERY - puppet last run on ganeti2002 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures
[04:58:47] <logmsgbot>	 !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 27 04:58:46 UTC 2015 (duration 58m 45s)
[04:58:53] <morebots>	 Logged the message, Master
[05:00:16] <icinga-wm>	 RECOVERY - puppet last run on mw1083 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[05:16:25] <icinga-wm>	 PROBLEM - puppet last run on db2063 is CRITICAL puppet fail
[05:32:35] <icinga-wm>	 RECOVERY - puppet last run on db2063 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures
[06:06:25] <icinga-wm>	 PROBLEM - RAID on logstash1003 is CRITICAL 1 failed LD(s) (Degraded)
[06:15:06] <icinga-wm>	 PROBLEM - puppet last run on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:15:07] <wikibugs>	 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Show an error message when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406755 (10Dzahn)
[06:15:11] <wikibugs>	 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1406754 (10Dzahn)
[06:15:15] <icinga-wm>	 PROBLEM - configured eth on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:15:26] <icinga-wm>	 PROBLEM - dhclient process on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:15:42] <wikibugs>	 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Don't 404 when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406756 (10Dzahn)
[06:15:46] <icinga-wm>	 PROBLEM - salt-minion processes on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:15:46] <icinga-wm>	 PROBLEM - Disk space on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:15:47] <icinga-wm>	 PROBLEM - DPKG on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[06:18:47] <grrrit-wm>	 (03PS1) 10Dzahn: static-bugzilla: add bug number comments [puppet] - 10https://gerrit.wikimedia.org/r/221351 
[06:19:06] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[06:22:16] <icinga-wm>	 RECOVERY - puppet last run on logstash1003 is OK Puppet is currently enabled, last run 21 minutes ago with 0 failures
[06:22:25] <icinga-wm>	 RECOVERY - configured eth on logstash1003 is OK - interfaces up
[06:22:35] <icinga-wm>	 RECOVERY - dhclient process on logstash1003 is OK: PROCS OK: 0 processes with command name dhclient
[06:22:36] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60551 bytes in 0.262 second response time
[06:22:56] <icinga-wm>	 RECOVERY - salt-minion processes on logstash1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[06:22:56] <icinga-wm>	 RECOVERY - Disk space on logstash1003 is OK: DISK OK
[06:22:57] <icinga-wm>	 RECOVERY - DPKG on logstash1003 is OK: All packages OK
[06:27:36] <icinga-wm>	 PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail
[06:31:35] <icinga-wm>	 PROBLEM - puppet last run on cp2003 is CRITICAL Puppet has 1 failures
[06:31:55] <icinga-wm>	 PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures
[06:31:56] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures
[06:31:56] <icinga-wm>	 PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures
[06:32:36] <icinga-wm>	 PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures
[06:32:36] <icinga-wm>	 PROBLEM - puppet last run on mw2121 is CRITICAL Puppet has 1 failures
[06:32:37] <icinga-wm>	 PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures
[06:33:35] <icinga-wm>	 PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures
[06:34:45] <icinga-wm>	 PROBLEM - puppet last run on mw1092 is CRITICAL Puppet has 1 failures
[06:35:05] <icinga-wm>	 PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures
[06:35:16] <icinga-wm>	 PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures
[06:36:17] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures
[06:37:06] <icinga-wm>	 PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures
[06:44:06] <icinga-wm>	 PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:44:15] <icinga-wm>	 PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:44:16] <icinga-wm>	 PROBLEM - SSH on graphite1002 is CRITICAL: Server answer
[06:45:07] <icinga-wm>	 RECOVERY - puppet last run on mw2121 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[06:45:35] <icinga-wm>	 RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures
[06:45:46] <icinga-wm>	 RECOVERY - Disk space on graphite1002 is OK: DISK OK
[06:45:56] <icinga-wm>	 RECOVERY - DPKG on graphite1002 is OK: All packages OK
[06:46:05] <icinga-wm>	 RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:46:17] <icinga-wm>	 RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures
[06:46:56] <icinga-wm>	 RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:46:57] <icinga-wm>	 RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:06] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures
[06:47:16] <icinga-wm>	 RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:37] <icinga-wm>	 RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:37] <icinga-wm>	 RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures
[06:47:56] <icinga-wm>	 RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures
[06:47:56] <icinga-wm>	 RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:47:56] <icinga-wm>	 RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:50:17] <icinga-wm>	 PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:50:25] <icinga-wm>	 PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:50:56] <icinga-wm>	 PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:51:06] <icinga-wm>	 PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:51:16] <icinga-wm>	 PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:51:25] <icinga-wm>	 PROBLEM - SSH on graphite1002 is CRITICAL: Server answer
[06:51:36] <icinga-wm>	 PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[06:52:17] <icinga-wm>	 PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[07:18:57] <icinga-wm>	 PROBLEM - Host graphite1002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:19:36] <icinga-wm>	 RECOVERY - Disk space on graphite1002 is OK: DISK OK
[07:19:47] <icinga-wm>	 RECOVERY - Host graphite1002 is UPING OK - Packet loss = 0%, RTA = 1.66 ms
[07:19:47] <icinga-wm>	 RECOVERY - DPKG on graphite1002 is OK: All packages OK
[07:19:55] <icinga-wm>	 RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[07:20:15] <icinga-wm>	 RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 56 minutes ago with 0 failures
[07:20:48] <icinga-wm>	 RECOVERY - salt-minion processes on graphite1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[07:20:48] <icinga-wm>	 RECOVERY - configured eth on graphite1002 is OK - interfaces up
[07:21:06] <icinga-wm>	 RECOVERY - RAID on graphite1002 is OK optimal, 2 logical, 4 physical
[07:21:18] <icinga-wm>	 RECOVERY - dhclient process on graphite1002 is OK: PROCS OK: 0 processes with command name dhclient
[08:28:53] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: lvs: split monitors to respective files [puppet] - 10https://gerrit.wikimedia.org/r/221356 
[08:28:55] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Merge all lvs::monitor_service manifests into one [puppet] - 10https://gerrit.wikimedia.org/r/221357 
[09:00:41] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 
[09:07:25] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 
[09:26:57] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 
[09:31:58] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Merge all lvs::monitor_service manifests into one [puppet] - 10https://gerrit.wikimedia.org/r/221357 
[09:32:00] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 
[09:32:02] <grrrit-wm>	 (03PS3) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 
[09:34:56] <icinga-wm>	 PROBLEM - Disk space on analytics1022 is CRITICAL: DISK CRITICAL - free space: / 1064 MB (3% inode=94%)
[09:40:59] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 031] Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack)
[09:56:43] <wikibugs>	 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1406876 (10Chmarkine) stats.wikimedia.org doesn't redirect http to https. It has mixed content (T93702). Do we need to fix that first?
[10:12:06] <grrrit-wm>	 (03Abandoned) 10Chmarkine: Enable HSTS on racktables with max-age=7days [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine)
[10:44:45] <icinga-wm>	 PROBLEM - puppet last run on wtp2014 is CRITICAL Puppet has 1 failures
[11:00:55] <icinga-wm>	 RECOVERY - puppet last run on wtp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[11:49:36] <icinga-wm>	 PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100%
[11:50:56] <icinga-wm>	 RECOVERY - Host mw1085 is UPING OK - Packet loss = 0%, RTA = 5.52 ms
[12:14:25] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0]
[12:25:06] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[12:26:15] <icinga-wm>	 PROBLEM - Disk space on analytics1018 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=94%)
[12:31:07] <grrrit-wm>	 (03PS4) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) 
[12:31:51] <grrrit-wm>	 (03CR) 10BBlack: "PS4 was a manual rebase (including a relevant update: rewrite_proxy_urls moved into common VCL)." [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack)
[12:34:56] <icinga-wm>	 PROBLEM - Disk space on analytics1018 is CRITICAL: DISK CRITICAL - free space: / 1015 MB (3% inode=94%)
[13:55:41] <grrrit-wm>	 (03PS1) 10BBlack: increase ssl_session_timeout to 15m [puppet] - 10https://gerrit.wikimedia.org/r/221375 
[14:06:57] <grrrit-wm>	 (03PS1) 10BBlack: enable SPDY header compression [puppet] - 10https://gerrit.wikimedia.org/r/221376 
[14:11:22] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] increase ssl_session_timeout to 15m [puppet] - 10https://gerrit.wikimedia.org/r/221375 (owner: 10BBlack)
[14:21:08] <wikibugs>	 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1407066 (10BBlack) This morning I went ahead and manually applied the lvs1004 fixes on lvs1001 as well, just to reduce the depool/repool insanity th...
[14:46:36] <icinga-wm>	 PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 38.46% of data above the critical threshold [500.0]
[14:53:11] <grrrit-wm>	 (03PS1) 10BBlack: comment out login-lb.$dcname from DNS [dns] - 10https://gerrit.wikimedia.org/r/221378 
[14:53:46] <icinga-wm>	 PROBLEM - Disk space on analytics1012 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=94%)
[14:58:56] <icinga-wm>	 RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0]
[15:45:57] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[15:51:06] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 0.300 second response time
[15:58:17] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[15:59:57] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60569 bytes in 4.958 second response time
[16:14:17] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[16:19:36] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 1.705 second response time
[16:25:06] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[16:26:15] <icinga-wm>	 PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100%
[16:27:27] <icinga-wm>	 PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100%
[16:28:16] <icinga-wm>	 RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 89.42 ms
[16:28:36] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 4.924 second response time
[16:28:36] <icinga-wm>	 RECOVERY - Host ns2-v4 is UPING OK - Packet loss = 0%, RTA = 88.10 ms
[16:41:06] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[16:44:27] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 1.535 second response time
[16:48:39] <grrrit-wm>	 (03PS2) 10BBlack: Delete login-lb/login-addrs from DNS [dns] - 10https://gerrit.wikimedia.org/r/221378 
[16:53:46] <grrrit-wm>	 (03PS1) 10BBlack: Delete loginlb from LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/221380 
[17:05:56] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[17:09:25] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 0.289 second response time
[17:16:35] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[17:18:07] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 0.298 second response time
[17:27:16] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[17:28:39] <Krenair>	 gitblit just really doesn't want to live, does it?
[17:30:45] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 4.961 second response time
[18:04:36] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[18:09:46] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 4.854 second response time
[18:22:25] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[18:24:05] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 3.981 second response time
[18:36:26] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[18:41:46] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60530 bytes in 0.170 second response time
[18:50:45] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[18:54:15] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60553 bytes in 7.306 second response time
[20:08:36] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[20:10:55] <grrrit-wm>	 (03PS1) 10Alex Monk: Set wikidata's logo specifically for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221405 (https://phabricator.wikimedia.org/T54214) 
[20:11:58] <ori>	 !log Delegated full access to Google Webmaster Tools for myself (olivneh@).
[20:12:05] <morebots>	 Logged the message, Master
[20:20:55] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60578 bytes in 0.405 second response time
[20:26:25] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[20:31:26] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.334 second response time
[20:38:46] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[20:42:16] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.482 second response time
[22:18:12] <wikibugs>	 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1407695 (10MarkAHershberger) Initially, I'm only interested in releases.w.o.  Those are the most straight-forward.  Later, yes, getting git checkouts would be interesting and useful.
[22:23:16] <icinga-wm>	 PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds
[22:24:22] <wikibugs>	 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1407697 (10Peachey88)
[22:28:27] <icinga-wm>	 RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60535 bytes in 0.423 second response time
[22:47:37] <icinga-wm>	 PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[22:47:37] <icinga-wm>	 PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[22:47:46] <icinga-wm>	 PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[22:47:46] <icinga-wm>	 PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[22:48:17] <icinga-wm>	 PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[22:49:26] <icinga-wm>	 PROBLEM - SSH on graphite1002 is CRITICAL: Server answer
[22:50:46] <icinga-wm>	 PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[22:50:46] <icinga-wm>	 PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:19:37] <icinga-wm>	 RECOVERY - RAID on graphite1002 is OK optimal, 2 logical, 4 physical
[23:19:37] <icinga-wm>	 RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 36 minutes ago with 0 failures
[23:19:47] <icinga-wm>	 RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0)
[23:19:48] <icinga-wm>	 RECOVERY - Disk space on graphite1002 is OK: DISK OK
[23:19:56] <icinga-wm>	 RECOVERY - dhclient process on graphite1002 is OK: PROCS OK: 0 processes with command name dhclient
[23:20:26] <icinga-wm>	 RECOVERY - DPKG on graphite1002 is OK: All packages OK
[23:21:06] <icinga-wm>	 RECOVERY - configured eth on graphite1002 is OK - interfaces up
[23:21:06] <icinga-wm>	 RECOVERY - salt-minion processes on graphite1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[23:23:45] <icinga-wm>	 PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures
[23:26:57] <grrrit-wm>	 (03PS1) 10BBlack: update-ocsp: require proxy argument [puppet] - 10https://gerrit.wikimedia.org/r/221422 
[23:26:59] <grrrit-wm>	 (03PS1) 10BBlack: update-ocsp: support multi-cert fetches [puppet] - 10https://gerrit.wikimedia.org/r/221423 
[23:27:26] <icinga-wm>	 PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:28:06] <icinga-wm>	 PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:28:06] <icinga-wm>	 PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:28:27] <icinga-wm>	 PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:28:36] <icinga-wm>	 PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:28:37] <icinga-wm>	 PROBLEM - SSH on graphite1002 is CRITICAL: Server answer
[23:28:39] <icinga-wm>	 PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:28:46] <icinga-wm>	 PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake.
[23:30:35] <bd808>	 !log Deleted corrupt shards on logstash1004 and logstash1005. Recovery in process
[23:30:42] <morebots>	 Logged the message, Master
[23:31:41] <bd808>	 godog: TIL that you can rm a corrupt shard on disk while elasticsearch is still running and then force allocation to start building a clean replica
[23:39:36] <icinga-wm>	 RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures
[23:47:20] <wikibugs>	 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1407738 (10JanZerebecki) >>! In T86654#1405484, @BBlack wrote: > On security issues, I lean towards thinking that ECDSA is better than RSA, and that while t...