[00:04:03] <icinga-wm>	 PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:25] <icinga-wm>	 PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:57] <icinga-wm>	 PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:04:59] <icinga-wm>	 PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:33] <icinga-wm>	 PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:20:22] <wikibugs>	 10Operations, 10netops: every Sunday at 00:00 UTC, logrotate fails on netflow hosts - https://phabricator.wikimedia.org/T257128 (10CDanis)
[00:20:27] <wikibugs>	 10Operations, 10netops: every Sunday at 00:00 UTC, logrotate fails on netflow hosts - https://phabricator.wikimedia.org/T257128 (10CDanis) p:05Triage→03Low
[03:31:39] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 201.4 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[04:51:23] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 62.03 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[04:53:50] <wikibugs>	 (03PS1) 10VulpesVulpes825: Don't index NS_USER on hywiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609559 (https://phabricator.wikimedia.org/T257112)
[05:17:53] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs1007 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[06:03:43] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={2,3} site=eqiad topic={rsyslog-info,rsyslog-notice,udp_localhost-info} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+
[06:03:43] <icinga-wm>	 r-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[06:12:57] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash-codfw instance=kafkamon1001 job=burrow partition={0,1,2,3,4,5} site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqia
[06:12:57] <icinga-wm>	 var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[06:20:25] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200705T0700)
[07:08:19] <icinga-wm>	 PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:09:05] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 133, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:15:41] <icinga-wm>	 RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:27] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 135, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:30] <wikibugs>	 (03PS1) 10Amire80: Remove englishwikisource.tumblr.com from Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/609564
[07:34:58] <wikibugs>	 (03PS1) 10Amire80: Remove three entries from the Russian Planet [puppet] - 10https://gerrit.wikimedia.org/r/609565
[07:50:58] <wikibugs>	 (03PS2) 10RhinosF1: Remove englishwikisource.tumblr.com from Planet Wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/609564 (owner: 10Amire80)
[07:51:36] <wikibugs>	 (03CR) 10RhinosF1: [C: 03+1] "Fixed the grammar because I'm insane and it was annoying me leaving it. Otherwise, +1" [puppet] - 10https://gerrit.wikimedia.org/r/609564 (owner: 10Amire80)
[07:55:30] <wikibugs>	 (03CR) 10Amire80: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/609564 (owner: 10Amire80)
[08:07:47] <wikibugs>	 (03CR) 10Ammarpad: Don't index NS_USER on hywiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609559 (https://phabricator.wikimedia.org/T257112) (owner: 10VulpesVulpes825)
[09:07:43] <wikibugs>	 (03PS1) 10Joal: Bump AQS druid snapshot to 2020-06 [puppet] - 10https://gerrit.wikimedia.org/r/609566
[09:39:37] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 102.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[10:23:37] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 04-1] "I think this needs changing on https://meta.wikimedia.org/wiki/Interwiki_map, not here." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/577740 (https://phabricator.wikimedia.org/T227053) (owner: 10Fomafix)
[10:51:45] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 73.22 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[11:22:53] <icinga-wm>	 PROBLEM - Host cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[11:24:18] <godog>	 here but not near my laptop
[11:24:34] <_joe_>	 I'm around
[11:24:36] <godog>	 checking dashboards
[11:24:47] <icinga-wm>	 PROBLEM - Host re0.cr3-eqsin is DOWN: PING CRITICAL - Packet loss = 100%
[11:25:18] <_joe_>	 eqsin is unreachable
[11:25:27] <_joe_>	 I'll write a patch to depool it
[11:25:39] <paravoid>	 is it?
[11:25:47] <_joe_>	 it's unreachable from eqiad
[11:26:02] <_joe_>	 icinga has all cp5* timing out checks
[11:26:29] <paravoid>	 
[11:26:41] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:27:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 74, down: 3, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:27:10] <paravoid>	 that's the new router, but it's just the one
[11:27:29] <paravoid>	 in theory one of the two should be enough
[11:27:30] <icinga-wm>	 PROBLEM - OSPF status on mr1-eqsin is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:27:30] <_joe_>	 they have now recovered
[11:27:48] <_joe_>	 but we had alerts on 9 cp5* hosts
[11:28:01] <_joe_>	 so yes, now everything seems ok
[11:28:18] <paravoid>	 (on my phone btw)
[11:28:51] <icinga-wm>	 PROBLEM - Host cr3-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[11:29:03] <XioNoX>	 I barely have any signal for the next 2h (in the mountains)
[11:29:44] <_joe_>	 yeah I browse just fine on eqsin for the record
[11:30:57] <XioNoX>	 looks like even the mgmt interface went down (from the backlog)
[11:31:27] <_joe_>	 yes
[11:31:40] <jbond42>	 up/here reading backlog
[11:31:52] <_joe_>	 I will prepare a patch to depool eqsin in case we lose the second router
[11:32:02] <_joe_>	 but I don't think this warrants depooling the site as of now
[11:32:13] <godog>	 logging off for now but LMK if I can help
[11:35:08] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Depool eqsin [dns] - 10https://gerrit.wikimedia.org/r/609571
[11:35:57] <_joe_>	 paravoid: do you think there is something we can try to do now?
[11:36:06] <_joe_>	 oh you're on the phone too, heh
[11:36:57] <XioNoX>	 next steps will be to look at the console server see if there is anything there
[11:38:09] <XioNoX>	 if yes get the logs, etc and most likely open a jtac case
[11:38:39] <XioNoX>	 if not probably depool eqsin and powercycle it by turning the PDU ports off/on
[11:41:10] <_joe_>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/609571/
[11:42:23] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 63 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:44:09] <jbond42>	 dont have access to the console server  unfortunatly
[11:48:13] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:52:39] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[12:03:35] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:09:23] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:29:33] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 52.88 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[13:48:55] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:09:13] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:14:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:39] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:03] <icinga-wm>	 PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[14:44:11] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:08:09] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:13:41] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:38] <wikibugs>	 (03PS1) 10Thiemo Kreuz (WMDE): [POC] Convert all Wikipedia logos to (true) grayscale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584
[15:31:33] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:33:22] <wikibugs>	 (03PS2) 10Thiemo Kreuz (WMDE): [POC] Convert all Wikipedia logos to (true) grayscale [mediawiki-config] - 10https://gerrit.wikimedia.org/r/609584 (https://phabricator.wikimedia.org/T252108)
[15:37:25] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[15:51:25] <wikibugs>	 (03CR) 10Privacybatm: "In this new patch set, I used a file to write the checksum instead of a pipe. This change resolves the above stuck issue. The benchmark va" [software/transferpy] - 10https://gerrit.wikimedia.org/r/608640 (https://phabricator.wikimedia.org/T254979) (owner: 10Privacybatm)
[15:56:16] <gehel>	 !log restart blazegraph + updater on wdqs1007 and depool to allow catching up on lag
[15:56:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:57:13] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1007 is CRITICAL: 1.524e+05 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:57:49] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[15:57:55] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:58:38] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs1007 is CRITICAL: 1.524e+05 ge 4.32e+04 Gehel catching up on lag after restart - The acknowledgement expires at: 2020-07-06 15:58:16. https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:01:22] <gehel>	 !log restart elastic-psi on elastic1052 (high GC rate)
[16:01:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:14:21] <icinga-wm>	 RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 53.9 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37
[16:27:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:28:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:54:43] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 48.3 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:58:23] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:20:11] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::wmcs::novaproxy: add use_wmflabs_root option [puppet] - 10https://gerrit.wikimedia.org/r/609586 (https://phabricator.wikimedia.org/T256276)
[18:20:13] <wikibugs>	 (03PS1) 10Andrew Bogott: profile::wmcs::novaproxy: add acme_certname [puppet] - 10https://gerrit.wikimedia.org/r/609587 (https://phabricator.wikimedia.org/T256276)
[19:10:28] <wikibugs>	 10Operations, 10netops: Investigate Junos vmhost snapshot - https://phabricator.wikimedia.org/T257153 (10ayounsi) p:05Triage→03High
[19:16:28] <wikibugs>	 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) p:05Triage→03High
[19:34:10] <wikibugs>	 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) Opened JTAC case 2020-0705-0136
[19:43:47] <wikibugs>	 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) Note that there are no mentions of hard drives in Juniper's MX204 [[  https://www.juniper.net/documentation/en_US/release-independent/junos/information-products/pathway-pages/mx-series/mx204/mx204-hw-guide....
[20:45:27] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 51 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[20:51:17] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[21:01:53] <wikibugs>	 10Operations, 10netops: cr3-eqsin disk 1 failure - https://phabricator.wikimedia.org/T257154 (10ayounsi) JTAC asked for some more troubleshooting commands, the interesting one is: ` root> show vmhost hardware     Compute cluster: rainier-re-cc   Compute node: rainier-re-cn                      Hardware invento...
[21:20:33] <qchris>	 !log Enable puppet on gerrit1002 (gerrit-test) again to let it catch up again
[21:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:24:18] <wikibugs>	 (03PS1) 10QChris: Bump gerrit.war to Gerrit 3.2.2-102-g3bbb138e13 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/609590 (https://phabricator.wikimedia.org/T256550)
[21:24:22] <wikibugs>	 (03PS1) 10QChris: Bump zuul.jar to master-0-g7accc67 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/609591 (https://phabricator.wikimedia.org/T256550)
[21:27:12] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Bump gerrit.war to Gerrit 3.2.2-102-g3bbb138e13 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/609590 (https://phabricator.wikimedia.org/T256550) (owner: 10QChris)
[21:27:42] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Bump zuul.jar to master-0-g7accc67 [software/gerrit] (deploy/wmf/stable-3.2) - 10https://gerrit.wikimedia.org/r/609591 (https://phabricator.wikimedia.org/T256550) (owner: 10QChris)
[21:32:21] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13 and zuul plugin to master-0-g7accc67
[21:32:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:29] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13 and zuul plugin to master-0-g7accc67 (duration: 00m 08s)
[21:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:32:57] <qchris>	 !log Restarting gerrit on gerrit1002 to pick up new wars and jars.
[21:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:34] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit2001
[21:45:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:44] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit2001 (duration: 00m 10s)
[21:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:46:21] <qchris>	 !log Restarting gerrit on gerrit2001 to pick up new war and jars.
[21:46:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:34] <logmsgbot>	 !log qchris@deploy1001 Started deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit1001
[21:50:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:42] <logmsgbot>	 !log qchris@deploy1001 Finished deploy [gerrit/gerrit@fbd0684]: Bump gerrit to 3.2.2-102-g3bbb138e13, zuul plugin to master-0-g7accc67, and gitiles to v3.2.2-1-g00c5ca0-with-0e3b533 on gerrit1001 (duration: 00m 07s)
[21:50:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:50:48] <qchris>	 !log Restarting gerrit on gerrit1001 to pick up new war and jars.
[21:50:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:55:58] <qchris>	 Gerrit is up, is properly connected to phabricator, allows clones, pushes, voting, and integration with Zuul works too.
[22:20:29] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 62 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:32:09] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 50 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:41:43] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 58 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[22:47:35] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 49 probes of 569 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas