[00:04:06] PROBLEM - jmxtrans on analytics1022 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:04:16] PROBLEM - jmxtrans on analytics1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:04:45] PROBLEM - jmxtrans on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:05:35] PROBLEM - jmxtrans on analytics1018 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, regex args -jar.+jmxtrans-all.jar [00:05:48] (03PS1) 10Ori.livneh: Roll out varnishxcps from I08679839 to all varnishes [puppet] - 10https://gerrit.wikimedia.org/r/221317 [00:05:57] bd808: are analytics logs flowing into logstash? is this related to the work you're doing? [00:06:38] I think there are hadoop logs in logstash, but nothing I've done, no [00:09:26] RECOVERY - jmxtrans on analytics1022 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:15:45] (03PS2) 10Ori.livneh: Add varnishxcps from I08679839 to 2layer role; still restricted to cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/221317 [00:16:07] RECOVERY - jmxtrans on analytics1018 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:16:45] RECOVERY - jmxtrans on analytics1012 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:17:02] (03CR) 10Ori.livneh: [C: 032] Add varnishxcps from I08679839 to 2layer role; still restricted to cp1048 [puppet] - 10https://gerrit.wikimedia.org/r/221317 (owner: 10Ori.livneh) [00:17:06] RECOVERY - jmxtrans on analytics1021 is OK: PROCS OK: 1 process with command name java, regex args -jar.+jmxtrans-all.jar [00:19:50] (03PS1) 10Ori.livneh: Remove $::hostname == 'cp1048' gate from varnishxcps invocation [puppet] - 10https://gerrit.wikimedia.org/r/221320 [00:22:22] (03CR) 10Ori.livneh: [C: 032] Remove $::hostname == 'cp1048' gate from varnishxcps invocation [puppet] - 10https://gerrit.wikimedia.org/r/221320 (owner: 10Ori.livneh) [00:28:00] 6operations, 6Discovery, 7Elasticsearch: logstash partman recipe huge root partition - https://phabricator.wikimedia.org/T104035#1406421 (10fgiunchedi) 3NEW [00:30:15] 6operations, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406430 (10Dzahn) 3NEW [00:30:51] 6operations, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406438 (10Dzahn) [00:30:54] 6operations, 7Monitoring, 5Patch-For-Review: remove ganglia(old), replace with ganglia_new - https://phabricator.wikimedia.org/T93776#1406437 (10Dzahn) [00:31:37] 6operations, 6Analytics-Engineering, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406430 (10Dzahn) [00:36:33] 6operations, 6Analytics-Engineering, 7network: networking: adjust ACLs to allow analytics clusters to talk to new ganglia aggregator - https://phabricator.wikimedia.org/T104036#1406453 (10Dzahn) p:5Triage>3Normal [00:40:46] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1406458 (10GWicke) 5Resolved>3Open [00:47:31] 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1406463 (10BBlack) I think at least one of the reasons for the 3 hosts idea was that if one underlying ganeti box died, we could still have 2x instances of various types up and running with some redundancy wh... [00:48:46] 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1406466 (10Dzahn) and we need on server with class { 'ganglia_new::monitor::aggregator': per site. we usually put it on install servers in other sites, so for new setups it would be "installx000" [00:53:46] RECOVERY - Cassanda CQL query interface on xenon is OK: TCP OK - 0.001 second response time on port 9042 [01:00:24] 6operations, 10Traffic: Deploy infra ganeti cluster @ ulsfo - https://phabricator.wikimedia.org/T96852#1406476 (10BBlack) Of course, another option here is that we can just do a blender role that's built for cache pop infrastructure and mixes up a bunch of these things on a few bare hosts, too. Gets tricky fo... [01:04:15] (03PS2) 10Dzahn: static-bugzilla: additional redirects [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) [01:09:36] PROBLEM - puppet last run on mw2030 is CRITICAL puppet fail [01:28:06] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [01:29:16] RECOVERY - puppet last run on mw2030 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [01:37:13] (03PS3) 10Dzahn: static-bugzilla: additional redirect for number URLs [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) [01:38:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [01:39:17] (03CR) 10Dzahn: [C: 032] "thanks to #httpd pointing out the problem (one character! :)" [puppet] - 10https://gerrit.wikimedia.org/r/220164 (https://phabricator.wikimedia.org/T103425) (owner: 10Dzahn) [01:42:14] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1406569 (10Dzahn) 18:27 < thumbs> mutante: change %1 to $1. :) [terbium:~] $ apache-fast-test staticbz zirconium.wikimedia.org testing 6 urls on 1 servers, totalling 6 requests sp... [01:42:39] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1406573 (10Dzahn) 5Open>3Resolved [01:42:42] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1406574 (10Dzahn) [01:43:12] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1406575 (10Dzahn) 5Open>3Resolved all subtasks resolved [01:43:15] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: old-bugzilla redirects broken - https://phabricator.wikimedia.org/T103425#1406577 (10Legoktm) Awesome, thanks! [01:52:35] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [01:56:06] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60560 bytes in 8.592 second response time [02:01:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [02:17:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60560 bytes in 0.476 second response time [02:20:40] !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 05m 46s) [02:20:49] Logged the message, Master [02:23:40] !log LocalisationUpdate completed (1.26wmf11) at 2015-06-27 02:23:40+00:00 [02:23:47] Logged the message, Master [02:41:56] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [02:43:46] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [02:47:46] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3 [02:48:06] RECOVERY - ElasticSearch health check for shards on logstash1006 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3 [02:48:36] RECOVERY - ElasticSearch health check for shards on logstash1004 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3 [02:48:36] RECOVERY - ElasticSearch health check for shards on logstash1005 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3 [02:48:45] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3 [02:48:47] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 5, timed_out: False, active_primary_shards: 41, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 111, initializing_shards: 7, number_of_data_nodes: 3 [03:02:26] PROBLEM - puppet last run on mw2035 is CRITICAL puppet fail [03:20:26] RECOVERY - puppet last run on mw2035 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [03:20:37] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [03:24:05] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60553 bytes in 0.133 second response time [04:37:25] PROBLEM - puppet last run on ganeti2002 is CRITICAL puppet fail [04:44:15] PROBLEM - puppet last run on mw1083 is CRITICAL Puppet has 1 failures [04:53:26] RECOVERY - puppet last run on ganeti2002 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [04:58:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 27 04:58:46 UTC 2015 (duration 58m 45s) [04:58:53] Logged the message, Master [05:00:16] RECOVERY - puppet last run on mw1083 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [05:16:25] PROBLEM - puppet last run on db2063 is CRITICAL puppet fail [05:32:35] RECOVERY - puppet last run on db2063 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:06:25] PROBLEM - RAID on logstash1003 is CRITICAL 1 failed LD(s) (Degraded) [06:15:06] PROBLEM - puppet last run on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:07] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Show an error message when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406755 (10Dzahn) [06:15:11] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1406754 (10Dzahn) [06:15:15] PROBLEM - configured eth on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:26] PROBLEM - dhclient process on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:42] 6operations, 10Wikimedia-Bugzilla, 5Patch-For-Review: Don't 404 when trying to view dynamic pages like buglist.cgi in static bugzilla - https://phabricator.wikimedia.org/T102579#1406756 (10Dzahn) [06:15:46] PROBLEM - salt-minion processes on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:46] PROBLEM - Disk space on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:15:47] PROBLEM - DPKG on logstash1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:18:47] (03PS1) 10Dzahn: static-bugzilla: add bug number comments [puppet] - 10https://gerrit.wikimedia.org/r/221351 [06:19:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [06:22:16] RECOVERY - puppet last run on logstash1003 is OK Puppet is currently enabled, last run 21 minutes ago with 0 failures [06:22:25] RECOVERY - configured eth on logstash1003 is OK - interfaces up [06:22:35] RECOVERY - dhclient process on logstash1003 is OK: PROCS OK: 0 processes with command name dhclient [06:22:36] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60551 bytes in 0.262 second response time [06:22:56] RECOVERY - salt-minion processes on logstash1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [06:22:56] RECOVERY - Disk space on logstash1003 is OK: DISK OK [06:22:57] RECOVERY - DPKG on logstash1003 is OK: All packages OK [06:27:36] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [06:31:35] PROBLEM - puppet last run on cp2003 is CRITICAL Puppet has 1 failures [06:31:55] PROBLEM - puppet last run on cp4003 is CRITICAL Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on cp3037 is CRITICAL Puppet has 1 failures [06:31:56] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on wtp2008 is CRITICAL Puppet has 1 failures [06:32:36] PROBLEM - puppet last run on mw2121 is CRITICAL Puppet has 1 failures [06:32:37] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:33:35] PROBLEM - puppet last run on labcontrol2001 is CRITICAL Puppet has 1 failures [06:34:45] PROBLEM - puppet last run on mw1092 is CRITICAL Puppet has 1 failures [06:35:05] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:35:16] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 1 failures [06:36:17] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:37:06] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:44:06] PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:44:15] PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:44:16] PROBLEM - SSH on graphite1002 is CRITICAL: Server answer [06:45:07] RECOVERY - puppet last run on mw2121 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:45:35] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:45:46] RECOVERY - Disk space on graphite1002 is OK: DISK OK [06:45:56] RECOVERY - DPKG on graphite1002 is OK: All packages OK [06:46:05] RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [06:46:17] RECOVERY - puppet last run on cp4003 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on cp3037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:17] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:56] RECOVERY - puppet last run on wtp2008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:57] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:06] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw1092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:37] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:56] RECOVERY - puppet last run on labcontrol2001 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:47:56] RECOVERY - puppet last run on mw2173 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:56] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:17] PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:50:25] PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:50:56] PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:51:06] PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:51:16] PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:51:25] PROBLEM - SSH on graphite1002 is CRITICAL: Server answer [06:51:36] PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [06:52:17] PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [07:18:57] PROBLEM - Host graphite1002 is DOWN: PING CRITICAL - Packet loss = 100% [07:19:36] RECOVERY - Disk space on graphite1002 is OK: DISK OK [07:19:47] RECOVERY - Host graphite1002 is UPING OK - Packet loss = 0%, RTA = 1.66 ms [07:19:47] RECOVERY - DPKG on graphite1002 is OK: All packages OK [07:19:55] RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [07:20:15] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 56 minutes ago with 0 failures [07:20:48] RECOVERY - salt-minion processes on graphite1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:20:48] RECOVERY - configured eth on graphite1002 is OK - interfaces up [07:21:06] RECOVERY - RAID on graphite1002 is OK optimal, 2 logical, 4 physical [07:21:18] RECOVERY - dhclient process on graphite1002 is OK: PROCS OK: 0 processes with command name dhclient [08:28:53] (03PS1) 10Alexandros Kosiaris: lvs: split monitors to respective files [puppet] - 10https://gerrit.wikimedia.org/r/221356 [08:28:55] (03PS1) 10Alexandros Kosiaris: Merge all lvs::monitor_service manifests into one [puppet] - 10https://gerrit.wikimedia.org/r/221357 [09:00:41] (03PS1) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 [09:07:25] (03PS2) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 [09:26:57] (03PS1) 10Alexandros Kosiaris: Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 [09:31:58] (03PS2) 10Alexandros Kosiaris: Merge all lvs::monitor_service manifests into one [puppet] - 10https://gerrit.wikimedia.org/r/221357 [09:32:00] (03PS2) 10Alexandros Kosiaris: Remove more unused lvs::monitor manifests [puppet] - 10https://gerrit.wikimedia.org/r/221363 [09:32:02] (03PS3) 10Alexandros Kosiaris: Merge lvs::hashes [puppet] - 10https://gerrit.wikimedia.org/r/221361 [09:34:56] PROBLEM - Disk space on analytics1022 is CRITICAL: DISK CRITICAL - free space: / 1064 MB (3% inode=94%) [09:40:59] (03CR) 10Ori.livneh: [C: 031] Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [09:56:43] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: let all services on misc-web enforce http->https redirects - https://phabricator.wikimedia.org/T103919#1406876 (10Chmarkine) stats.wikimedia.org doesn't redirect http to https. It has mixed content (T93702). Do we need to fix that first? [10:12:06] (03Abandoned) 10Chmarkine: Enable HSTS on racktables with max-age=7days [puppet] - 10https://gerrit.wikimedia.org/r/195444 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [10:44:45] PROBLEM - puppet last run on wtp2014 is CRITICAL Puppet has 1 failures [11:00:55] RECOVERY - puppet last run on wtp2014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:49:36] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [11:50:56] RECOVERY - Host mw1085 is UPING OK - Packet loss = 0%, RTA = 5.52 ms [12:14:25] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [12:25:06] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [12:26:15] PROBLEM - Disk space on analytics1018 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=94%) [12:31:07] (03PS4) 10BBlack: Add legacy bits.wm.o support to text-lb VCL [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) [12:31:51] (03CR) 10BBlack: "PS4 was a manual rebase (including a relevant update: rewrite_proxy_urls moved into common VCL)." [puppet] - 10https://gerrit.wikimedia.org/r/215624 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [12:34:56] PROBLEM - Disk space on analytics1018 is CRITICAL: DISK CRITICAL - free space: / 1015 MB (3% inode=94%) [13:55:41] (03PS1) 10BBlack: increase ssl_session_timeout to 15m [puppet] - 10https://gerrit.wikimedia.org/r/221375 [14:06:57] (03PS1) 10BBlack: enable SPDY header compression [puppet] - 10https://gerrit.wikimedia.org/r/221376 [14:11:22] (03CR) 10BBlack: [C: 032] increase ssl_session_timeout to 15m [puppet] - 10https://gerrit.wikimedia.org/r/221375 (owner: 10BBlack) [14:21:08] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review, 7Pybal: pybal DNS lookup issues causing outage risks - https://phabricator.wikimedia.org/T103921#1407066 (10BBlack) This morning I went ahead and manually applied the lvs1004 fixes on lvs1001 as well, just to reduce the depool/repool insanity th... [14:46:36] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 38.46% of data above the critical threshold [500.0] [14:53:11] (03PS1) 10BBlack: comment out login-lb.$dcname from DNS [dns] - 10https://gerrit.wikimedia.org/r/221378 [14:53:46] PROBLEM - Disk space on analytics1012 is CRITICAL: DISK CRITICAL - free space: / 1060 MB (3% inode=94%) [14:58:56] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [15:45:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:51:06] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 0.300 second response time [15:58:17] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:59:57] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60569 bytes in 4.958 second response time [16:14:17] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:19:36] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 1.705 second response time [16:25:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:26:15] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [16:27:27] PROBLEM - Host ns2-v4 is DOWN: PING CRITICAL - Packet loss = 100% [16:28:16] RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 89.42 ms [16:28:36] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 4.924 second response time [16:28:36] RECOVERY - Host ns2-v4 is UPING OK - Packet loss = 0%, RTA = 88.10 ms [16:41:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:44:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 1.535 second response time [16:48:39] (03PS2) 10BBlack: Delete login-lb/login-addrs from DNS [dns] - 10https://gerrit.wikimedia.org/r/221378 [16:53:46] (03PS1) 10BBlack: Delete loginlb from LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/221380 [17:05:56] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [17:09:25] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 0.289 second response time [17:16:35] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [17:18:07] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 0.298 second response time [17:27:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [17:28:39] gitblit just really doesn't want to live, does it? [17:30:45] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 4.961 second response time [18:04:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [18:09:46] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 4.854 second response time [18:22:25] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [18:24:05] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60562 bytes in 3.981 second response time [18:36:26] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [18:41:46] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60530 bytes in 0.170 second response time [18:50:45] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [18:54:15] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60553 bytes in 7.306 second response time [20:08:36] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [20:10:55] (03PS1) 10Alex Monk: Set wikidata's logo specifically for wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/221405 (https://phabricator.wikimedia.org/T54214) [20:11:58] !log Delegated full access to Google Webmaster Tools for myself (olivneh@). [20:12:05] Logged the message, Master [20:20:55] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60578 bytes in 0.405 second response time [20:26:25] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [20:31:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.334 second response time [20:38:46] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [20:42:16] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60557 bytes in 0.482 second response time [22:18:12] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1407695 (10MarkAHershberger) Initially, I'm only interested in releases.w.o. Those are the most straight-forward. Later, yes, getting git checkouts would be interesting and useful. [22:23:16] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [22:24:22] 6operations, 10Analytics, 10Traffic: Provide summary of MediaWiki downloads - https://phabricator.wikimedia.org/T104010#1407697 (10Peachey88) [22:28:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60535 bytes in 0.423 second response time [22:47:37] PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [22:47:37] PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [22:47:46] PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [22:47:46] PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [22:48:17] PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [22:49:26] PROBLEM - SSH on graphite1002 is CRITICAL: Server answer [22:50:46] PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [22:50:46] PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:19:37] RECOVERY - RAID on graphite1002 is OK optimal, 2 logical, 4 physical [23:19:37] RECOVERY - puppet last run on graphite1002 is OK Puppet is currently enabled, last run 36 minutes ago with 0 failures [23:19:47] RECOVERY - SSH on graphite1002 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [23:19:48] RECOVERY - Disk space on graphite1002 is OK: DISK OK [23:19:56] RECOVERY - dhclient process on graphite1002 is OK: PROCS OK: 0 processes with command name dhclient [23:20:26] RECOVERY - DPKG on graphite1002 is OK: All packages OK [23:21:06] RECOVERY - configured eth on graphite1002 is OK - interfaces up [23:21:06] RECOVERY - salt-minion processes on graphite1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [23:23:45] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 1 failures [23:26:57] (03PS1) 10BBlack: update-ocsp: require proxy argument [puppet] - 10https://gerrit.wikimedia.org/r/221422 [23:26:59] (03PS1) 10BBlack: update-ocsp: support multi-cert fetches [puppet] - 10https://gerrit.wikimedia.org/r/221423 [23:27:26] PROBLEM - DPKG on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:28:06] PROBLEM - configured eth on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:28:06] PROBLEM - salt-minion processes on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:28:27] PROBLEM - RAID on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:28:36] PROBLEM - puppet last run on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:28:37] PROBLEM - SSH on graphite1002 is CRITICAL: Server answer [23:28:39] PROBLEM - Disk space on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:28:46] PROBLEM - dhclient process on graphite1002 is CRITICAL: CHECK_NRPError - Could not complete SSL handshake. [23:30:35] !log Deleted corrupt shards on logstash1004 and logstash1005. Recovery in process [23:30:42] Logged the message, Master [23:31:41] godog: TIL that you can rm a corrupt shard on disk while elasticsearch is still running and then force allocation to start building a clean replica [23:39:36] RECOVERY - puppet last run on mw1058 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:20] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default, 5Patch-For-Review: Switch to ECDSA hybrid certificates - https://phabricator.wikimedia.org/T86654#1407738 (10JanZerebecki) >>! In T86654#1405484, @BBlack wrote: > On security issues, I lean towards thinking that ECDSA is better than RSA, and that while t...