[00:34:15] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:35:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:37:31] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [00:37:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:37:50] 10Operations, 10Gerrit, 10Release-Engineering-Team, 10Wikimedia Design Style Guide, 10Patch-For-Review: Automatic pickup of Gerrit clone master doesn't happen (due to git-lfs not installed on production misc) - https://phabricator.wikimedia.org/T235677 (10mmodell) >>! In T235677#5607624, @Volker_E wrote:... [02:15:39] (03CR) 10Dzahn: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/544077 (https://phabricator.wikimedia.org/T180641) (owner: 10Dzahn) [02:46:17] ACKNOWLEDGEMENT - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn esams decomed [04:33:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:35:29] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:03:28] 10Operations, 10ops-esams: Setup new access switches - https://phabricator.wikimedia.org/T184065 (10ayounsi) 05Open→03Resolved a:03ayounsi This is done. [10:03:30] 10Operations, 10ops-esams: Repurpose csw2-oe14/15 and lab-ex4200 as msw - https://phabricator.wikimedia.org/T215991 (10ayounsi) [10:03:32] 10Operations, 10ops-esams: Prepare racks OE14, OE15 and OE16 with new infrastructure - https://phabricator.wikimedia.org/T184064 (10ayounsi) [10:23:05] PROBLEM - Host re0.cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [10:36:18] 04Critical Alert for device cr3-esams.wikimedia.org - Juniper alarm active [10:52:22] ACKNOWLEDGEMENT - Host asw-esams is DOWN: PING CRITICAL - Packet loss = 100% Ayounsi Device decomed and removed from Icingas puppet. Icinga stuck? - The acknowledgement expires at: 2019-10-28 10:51:25. [10:52:49] PROBLEM - Juniper alarms on cr3-esams is CRITICAL: JNX_ALARMS CRITICAL - No response from remote host 91.198.174.245 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [10:54:46] PROBLEM - Host cr3-esams is DOWN: PING CRITICAL - Packet loss = 100% [10:55:19] well, seems like re1 rebooted when cr0 was powering on... [10:55:53] well, when re0 was collecting crashdump data, which seem to restart re0 [10:56:03] why re1 too? who knows [10:56:05] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:56:49] PROBLEM - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:56:51] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 74, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:57:13] PROBLEM - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:57:31] PROBLEM - Host cr3-esams IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:57:40] both REs are back on the console side [10:58:00] ack [10:58:07] RECOVERY - Host re0.cr3-esams is UP: PING OK - Packet loss = 0%, RTA = 83.82 ms [10:59:23] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:01:34] !log re0.cr3-esams> request chassis routing-engine master switch [11:01:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:58] !log restart cr3-esams [11:31:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:15] ? [11:53:00] (03PS1) 10BBlack: depool esams [dns] - 10https://gerrit.wikimedia.org/r/546328 [13:12:09] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:13:45] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:48:25] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 38 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [13:54:01] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 27 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:15:24] (03PS1) 10CDanis: ripe-atlas-eqsin-ipv6: raise alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/546330 [14:28:22] (03PS2) 10CDanis: ripe-atlas-eqsin-ipv6: raise alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/546330 [14:34:53] (03PS3) 10CDanis: ripe-atlas-eqsin-ipv6: raise alert threshold [puppet] - 10https://gerrit.wikimedia.org/r/546330 [14:35:43] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 38 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [14:37:22] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/compiler1002/19084/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/546330 (owner: 10CDanis) [14:41:19] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 27 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:23:45] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 41 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:34:57] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 34 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:43:47] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 45 probes of 470 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [15:48:48] (03CR) 10CDanis: [C: 03+2] "This is noisy today and I don't believe it's actionable, so merging. Revert if you disagree!" [puppet] - 10https://gerrit.wikimedia.org/r/546330 (owner: 10CDanis) [15:53:26] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 46 probes of 470 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [16:22:32] 10Operations, 10netops: cr3-esams crash - https://phabricator.wikimedia.org/T236598 (10CDanis) [16:23:42] ACKNOWLEDGEMENT - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP CDanis https://phabricator.wikimedia.org/T236598 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:23:42] ACKNOWLEDGEMENT - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 74, down: 5, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T236598 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:23:42] ACKNOWLEDGEMENT - OSPF status on cr2-knams is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis https://phabricator.wikimedia.org/T236598 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:23:42] ACKNOWLEDGEMENT - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis https://phabricator.wikimedia.org/T236598 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:23:42] ACKNOWLEDGEMENT - OSPF status on mr1-esams is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP CDanis https://phabricator.wikimedia.org/T236598 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:26:48] 10Operations, 10netops: cr3-esams crash - https://phabricator.wikimedia.org/T236598 (10ayounsi) Opened Juniper case 2019-1026-0004. > Hi, > > earlier today we had re0 on our MX480 chassis crash and be stuck on the "db>" prompt, see attached filed (re0-cr3-esams-crash.log) for the data gathering command done.... [16:36:33] (03PS3) 10Alex Monk: Swap toolforge proxies to use acme-chief certificates [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) [16:37:05] (03CR) 10Alex Monk: "This is puppet.git so I can't merge anyway, but sure, we can wait until Buster if you like." [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [16:38:43] (03CR) 10Alex Monk: "Also we should probably do I202ded71" [puppet] - 10https://gerrit.wikimedia.org/r/545679 (https://phabricator.wikimedia.org/T235252) (owner: 10Alex Monk) [17:27:05] (03PS1) 10Alex Monk: deployment-prep: Fix ATS upload domain handling [puppet] - 10https://gerrit.wikimedia.org/r/546332 [17:47:39] (03PS1) 10Ayounsi: Smokeping, remove cr3-esams while it's not working [puppet] - 10https://gerrit.wikimedia.org/r/546334 (https://phabricator.wikimedia.org/T236598) [17:48:28] (03CR) 10Ayounsi: [C: 03+2] Smokeping, remove cr3-esams while it's not working [puppet] - 10https://gerrit.wikimedia.org/r/546334 (https://phabricator.wikimedia.org/T236598) (owner: 10Ayounsi) [20:12:00] PROBLEM - MD RAID on elastic1039 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:12:01] ACKNOWLEDGEMENT - MD RAID on elastic1039 is CRITICAL: CRITICAL: State: degraded, Active: 3, Working: 3, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T236601 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [20:12:04] 10Operations, 10ops-eqiad: Degraded RAID on elastic1039 - https://phabricator.wikimedia.org/T236601 (10ops-monitoring-bot) [20:13:14] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:13] 10Operations, 10DC-Ops, 10Traffic, 10decommission: decommission multatuli - https://phabricator.wikimedia.org/T236489 (10Papaul) [21:37:04] PROBLEM - Device not healthy -SMART- on elastic1039 is CRITICAL: cluster=elasticsearch device=sda instance=elastic1039:9100 job=node site=eqiad https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [22:07:38] RECOVERY - Device not healthy -SMART- on elastic1039 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1039&var-datasource=eqiad+prometheus/ops [23:06:12] RECOVERY - Check systemd state on elastic1039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:11:00] PROBLEM - Check systemd state on elastic1039 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state