[00:07:00] (03CR) 10Paladox: phabricator: write my.cnf for db access into each admin home dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551268 (https://phabricator.wikimedia.org/T238425) (owner: 10Dzahn) [00:14:05] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 42.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:14:35] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 25.67 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:15:49] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 74.88 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:16:19] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 116.1 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:26:09] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10Rxy) p:05Triage→03Normal [00:27:17] 10Operations, 10SRE-Access-Requests: Requesting access to LogStash for rxy - https://phabricator.wikimedia.org/T239494 (10Rxy) p:05Normal→03Triage [00:43:25] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 0.9917 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [00:46:55] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.0375 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [01:11:00] (03PS3) 10DannyS712: Enable partial blocks on eswiki and scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553431 (https://phabricator.wikimedia.org/T239370) [01:21:50] (03PS1) 10Zoranzoki21: Add throttle rule for cawiki workshop on 2019-12-02 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553787 (https://phabricator.wikimedia.org/T239465) [04:09:23] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=bacula site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:37:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:13:59] PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100% [07:02:03] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Marostegui) ` 06:13:59 <+icinga-wm> PROBLEM - Host cp3057 is DOWN: PING CRITICAL - Packet loss = 100 ` Could be another case of R440 going down? [07:28:41] 10Operations, 10Traffic: cp3057 crashed - https://phabricator.wikimedia.org/T239502 (10Vgutierrez) [07:30:32] !log depool and powercycle cp3057 - T239502 [07:30:38] !log vgutierrez@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp3057.esams.wmnet [07:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:39] T239502: cp3057 crashed - https://phabricator.wikimedia.org/T239502 [07:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:34:11] RECOVERY - Host cp3057 is UP: PING OK - Packet loss = 0%, RTA = 83.39 ms [07:38:01] 10Operations, 10Traffic: cp3057 crashed - https://phabricator.wikimedia.org/T239502 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal a:03Vgutierrez As with other occurrences of T238305, nothing on the console, nothing on the SEL and nothing weird on the logs. [07:38:03] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [07:38:21] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) [07:40:03] 10Operations, 10Traffic: servers freeze across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Vgutierrez) >>! In T238305#5690539, @BBlack wrote: > It was observed earlier in the traffic meeting that we're fairly certain that none of our R440 hosts have had this problem more than once, so th... [07:40:22] !log repooling cp3057 - T239502 [07:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:28] T239502: cp3057 crashed - https://phabricator.wikimedia.org/T239502 [12:06:27] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 50.01 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [12:36:05] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 85.61 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:01:38] (03PS1) 10Alex Monk: deployment-prep: Replace stretch poolcounter with a buster one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553806 [14:18:09] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:19:47] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [14:27:55] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 57 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:27:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 57.49 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:28:25] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 52.15 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:29:39] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 103.5 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:29:39] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 104 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:30:09] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 92.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:30:39] looks like traffic randomly increased 30m ago and then went back down [14:33:15] looks like GETs to text caches [14:33:50] 13:55-13:59ish [15:47:27] !log Reset email of SUL user Hayk.arabaget (T239462) [15:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:33] T239462: Reset the password of user Hayk.arabaget - https://phabricator.wikimedia.org/T239462 [17:34:05] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.729 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:37:33] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0.02083 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [17:57:31] 10Operations, 10ops-codfw, 10DC-Ops: Move kafka200[123] to logstash202[012] - https://phabricator.wikimedia.org/T235125 (10Volans) 05Resolved→03Open Re-opening as the DNS name of the interfaces attached to those hosts have not been modified in Netbox. Things like: ` IP address: 10.193.1.23/16 Parent: lo... [18:03:39] 10Operations, 10ops-codfw, 10SRE-swift-storage, 10decommission, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Volans) 05Resolved→03Open `ms-be2013` and `ms-be2014` are marked as `Decommissioning` in Netbox, if they were unracked their status should be changed t... [18:21:57] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [18:21:57] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [18:22:19] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1866 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [18:23:33] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28426 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [18:23:35] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28426 bytes in 0.857 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [18:24:03] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 28547 bytes in 0.335 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [18:25:45] RECOVERY - Wikitech and wt-static content in sync on cloudweb2001-dev is OK: wikitech-static OK - wikitech and wikitech-static in sync (85946 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [18:30:01] 10Operations, 10ops-codfw, 10decommission: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Volans) 05Resolved→03Open Netbox status is currently `Decommissioning`, if the host has been unracked it should be `Offline`. [18:30:03] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Volans) [18:32:08] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Volans) 05Resolved→03Open It seems that Netbox's ip address has not been updated and still reports `graphite2002` in the DNS name, see https://netbox.wikimedia.org/ipam/ip-address... [18:40:53] 10Operations, 10ops-codfw, 10decommission: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Papaul) 05Open→03Resolved [18:40:55] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul) [18:44:20] 10Operations, 10ops-codfw, 10SRE-swift-storage, 10decommission, 10User-fgiunchedi: decom ms-be201[345] - https://phabricator.wikimedia.org/T221068 (10Papaul) 05Open→03Resolved [18:48:16] (03PS1) 10Jhedden: tools: add qdisc node collector to tools bastion [puppet] - 10https://gerrit.wikimedia.org/r/553815 [18:50:09] (03CR) 10Jhedden: "Hoping this will provide more information on the recent bastion load utilization" [puppet] - 10https://gerrit.wikimedia.org/r/553815 (owner: 10Jhedden) [18:52:52] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) @Volans do we have a template in place when a server name changes from X to Y to also update DNS in netbox. The last steps that I recalled were 1- change switch port descri... [18:58:31] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Volans) @Papaul given we're setting the DNS name of the ip address in Netbox, that one too needs to be updated, see the links above: ` IP: 10.193.2.251/16 Assignment: mw2231 (mgmt) DN... [18:59:55] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 1.113 ge 0.5 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:03:25] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)0.5 ge (W)0.1 ge 0 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:09:49] 10Operations, 10ops-codfw: Decommission old mw2231/WMF6435 replaced with WMF6403 - https://phabricator.wikimedia.org/T232126 (10Papaul) @volans confirm that 10.193.1.118 is indeed the mgmt IP address for mw2231 and 10.193.2.251 is no longer in use. Thanks [20:10:43] RECOVERY - Wikitech and wt-static content in sync on labweb1002 is OK: wikitech-static OK - wikitech and wikitech-static in sync (85946 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:10:45] RECOVERY - Wikitech and wt-static content in sync on labweb1001 is OK: wikitech-static OK - wikitech and wikitech-static in sync (85946 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [20:26:51] PROBLEM - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [20:26:55] PROBLEM - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is CRITICAL: connect to address 10.192.48.122 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:27:03] PROBLEM - cassandra-b service on restbase2017 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:09] PROBLEM - cassandra-b CQL 10.192.32.111:9042 on restbase2016 is CRITICAL: connect to address 10.192.32.111 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:27:17] PROBLEM - cassandra-b SSL 10.192.32.111:7001 on restbase2016 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [20:27:21] PROBLEM - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is CRITICAL: connect to address 10.192.32.154 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [20:27:23] PROBLEM - Check systemd state on restbase2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:27] PROBLEM - Check systemd state on restbase2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:27:37] PROBLEM - cassandra-c SSL 10.192.32.154:7001 on restbase2011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [20:27:45] PROBLEM - cassandra-c service on restbase2011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:28:09] PROBLEM - Check systemd state on restbase2011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:28:13] PROBLEM - cassandra-b service on restbase2016 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:29:11] RECOVERY - Check systemd state on restbase2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:29:57] RECOVERY - cassandra-b service on restbase2016 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:30:41] RECOVERY - cassandra-b CQL 10.192.32.111:9042 on restbase2016 is OK: TCP OK - 0.036 second response time on 10.192.32.111 port 9042 https://phabricator.wikimedia.org/T93886 [20:30:47] RECOVERY - cassandra-b SSL 10.192.32.111:7001 on restbase2016 is OK: SSL OK - Certificate restbase2016-b valid until 2020-11-29 09:26:15 +0000 (expires in 364 days) https://phabricator.wikimedia.org/T120662 [20:41:05] RECOVERY - cassandra-b service on restbase2017 is OK: OK - cassandra-b is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:41:23] RECOVERY - Check systemd state on restbase2017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:42:35] RECOVERY - cassandra-b SSL 10.192.48.122:7001 on restbase2017 is OK: SSL OK - Certificate restbase2017-b valid until 2020-11-29 09:26:18 +0000 (expires in 364 days) https://phabricator.wikimedia.org/T120662 [20:42:41] RECOVERY - cassandra-b CQL 10.192.48.122:9042 on restbase2017 is OK: TCP OK - 0.036 second response time on 10.192.48.122 port 9042 https://phabricator.wikimedia.org/T93886 [20:47:01] RECOVERY - cassandra-c service on restbase2011 is OK: OK - cassandra-c is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:47:23] RECOVERY - Check systemd state on restbase2011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:48:17] RECOVERY - cassandra-c CQL 10.192.32.154:9042 on restbase2011 is OK: TCP OK - 0.039 second response time on 10.192.32.154 port 9042 https://phabricator.wikimedia.org/T93886 [20:48:35] RECOVERY - cassandra-c SSL 10.192.32.154:7001 on restbase2011 is OK: SSL OK - Certificate restbase2011-c valid until 2020-06-24 13:02:00 +0000 (expires in 206 days) https://phabricator.wikimedia.org/T120662 [22:30:13] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops