[00:17:06] <icinga-wm_>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:18:54] <icinga-wm_>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:32:50] <wikibugs>	 (03CR) 10Bmansurov: "What's the command you used to get that error message?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[00:48:58] <icinga-wm_>	 PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:06:22] <icinga-wm_>	 RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[01:46:36] <icinga-wm_>	 PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:08:44] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1041 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[02:11:56] <icinga-wm_>	 RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:25:42] <icinga-wm_>	 PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 90.14% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1
[02:32:48] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:36:28] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:39:32] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1041 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[02:45:30] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:58:14] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:04:52] <icinga-wm_>	 PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:07:12] <icinga-wm_>	 PROBLEM - Host an-worker1091 is DOWN: PING CRITICAL - Packet loss = 100%
[03:07:14] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:09:00] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:18:00] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:40] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:45:00] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:02] <icinga-wm_>	 PROBLEM - MegaRAID on pc2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:46:03] <icinga-wm_>	 ACKNOWLEDGEMENT - MegaRAID on pc2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T255904 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[03:46:07] <wikibugs>	 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10ops-monitoring-bot)
[04:01:20] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:44] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:31:56] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:44:28] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:02:30] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:15:08] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:20] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:30] <icinga-wm_>	 PROBLEM - puppet last run on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[05:38:38] <icinga-wm_>	 PROBLEM - Disk space on Hadoop worker on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[05:38:42] <icinga-wm_>	 PROBLEM - Check systemd state on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:38:42] <icinga-wm_>	 PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[05:39:02] <icinga-wm_>	 PROBLEM - Check size of conntrack table on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[05:40:02] <icinga-wm_>	 PROBLEM - Hadoop DataNode on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[05:44:24] <icinga-wm_>	 PROBLEM - IPMI Sensor Status on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[05:45:05] <icinga-wm_>	 PROBLEM - dhclient process on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[05:53:24] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[05:57:48] <icinga-wm_>	 PROBLEM - Disk space on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops
[05:57:56] <icinga-wm_>	 PROBLEM - configured eth on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[05:59:52] <icinga-wm_>	 PROBLEM - DPKG on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:02:58] <icinga-wm_>	 PROBLEM - Check the NTP synchronisation status of timesyncd on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP
[06:07:14] <icinga-wm_>	 PROBLEM - MegaRAID on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[06:51:52] <icinga-wm_>	 PROBLEM - Long running screen/tmux on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200620T0700)
[07:32:35] <elukey>	 checking an-workers..
[07:36:56] <elukey>	 !log powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console
[07:36:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:38] <icinga-wm_>	 RECOVERY - Host an-worker1091 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[07:42:02] <elukey>	 !log powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console
[07:42:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:43:56] <icinga-wm_>	 PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100%
[07:45:20] <wikibugs>	 (03PS1) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593)
[07:45:54] <icinga-wm_>	 RECOVERY - Host an-worker1093 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[07:46:02] <icinga-wm_>	 RECOVERY - Hadoop DataNode on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:46:26] <icinga-wm_>	 RECOVERY - Disk space on Hadoop worker on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:46:30] <icinga-wm_>	 RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:46:30] <icinga-wm_>	 RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[07:46:50] <icinga-wm_>	 RECOVERY - Check size of conntrack table on an-worker1093 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[07:47:32] <icinga-wm_>	 RECOVERY - IPMI Sensor Status on an-worker1093 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[07:48:12] <icinga-wm_>	 RECOVERY - dhclient process on an-worker1093 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[07:51:52] <icinga-wm_>	 RECOVERY - puppet last run on an-worker1093 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[07:55:06] <icinga-wm_>	 RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[07:56:34] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on an-worker1093 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[07:59:03] <wikibugs>	 (03CR) 10Majavah: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah)
[08:01:08] <icinga-wm_>	 RECOVERY - configured eth on an-worker1093 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[08:02:38] <icinga-wm_>	 RECOVERY - Disk space on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops
[08:03:02] <icinga-wm_>	 RECOVERY - DPKG on an-worker1093 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[08:06:08] <icinga-wm_>	 RECOVERY - Check the NTP synchronisation status of timesyncd on an-worker1093 is OK: OK: synced at Sat 2020-06-20 08:06:07 UTC. https://wikitech.wikimedia.org/wiki/NTP
[08:38:49] <wikibugs>	 (03Abandoned) 10VulpesVulpes825: Enable DiscussionTools as a beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) (owner: 10VulpesVulpes825)
[09:30:48] <icinga-wm_>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[09:36:38] <icinga-wm_>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[10:02:32] <icinga-wm_>	 PROBLEM - very high load average likely xfs on ms-be2037 is CRITICAL: CRITICAL - load average: 135.55, 110.60, 60.85 https://wikitech.wikimedia.org/wiki/Swift
[10:06:10] <icinga-wm_>	 RECOVERY - very high load average likely xfs on ms-be2037 is OK: OK - load average: 19.58, 73.92, 57.58 https://wikitech.wikimedia.org/wiki/Swift
[10:50:06] <icinga-wm_>	 PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:52:42] <icinga-wm_>	 RECOVERY - Long running screen/tmux on an-worker1093 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[11:15:18] <icinga-wm_>	 RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:39:30] <icinga-wm_>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 55 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:45:18] <icinga-wm_>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:55:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> What's the command you used to get that error message?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov)
[12:44:50] <icinga-wm_>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 56 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:50:40] <icinga-wm_>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[12:56:19] <wikibugs>	 (03PS2) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593)
[12:58:09] <wikibugs>	 (03PS3) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593)
[13:32:42] <icinga-wm_>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:38:30] <icinga-wm_>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:43:42] <icinga-wm_>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:45:30] <icinga-wm_>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:07:04] <icinga-wm_>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:12:54] <icinga-wm_>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:23:18] <wikibugs>	 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon)
[14:42:47] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10Pcoombe)
[14:47:34] <wikibugs>	 (03PS2) 10Aklapper: Phab: Allow disabling Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/602951
[14:49:01] <wikibugs>	 10Operations, 10SRE-Access-Requests: Allow AKlapper to disable other people's personal Herald rules in Phabricator - https://phabricator.wikimedia.org/T255914 (10Aklapper)
[14:49:14] <wikibugs>	 (03PS3) 10Aklapper: Phab: Allow disabling Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/602951 (https://phabricator.wikimedia.org/T255914)
[14:54:24] <wikibugs>	 (03PS2) 10QChris: gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536
[14:54:26] <wikibugs>	 (03PS3) 10QChris: gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533
[14:54:28] <wikibugs>	 (03PS1) 10QChris: gerrit: Update its-phabricator templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606781
[14:54:30] <wikibugs>	 (03PS1) 10QChris: gerrit: Update email templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606782
[14:54:32] <wikibugs>	 (03PS1) 10QChris: gerrit: Drop empty unused Git config file [puppet] - 10https://gerrit.wikimedia.org/r/606783
[14:54:34] <wikibugs>	 (03PS1) 10QChris: gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784
[14:56:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 (owner: 10QChris)
[15:03:42] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10Aklapper) 05Open→03Stalled a:05anirudhsbh→03None Hi @anirudhsbh, thanks for taking the time to report this and welcome to Wikimedia Phabricator! I assume you don't pl...
[15:19:03] <wikibugs>	 (03PS4) 10DannyS712: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780)
[16:48:12] <icinga-wm_>	 PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100%
[16:52:22] <wikibugs>	 10Operations, 10MediaWiki-Stakeholders-Group, 10TechCom-RFC, 10Traffic, and 3 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Akuckartz)
[16:54:04] <icinga-wm_>	 RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 350.59 ms
[16:54:55] <wikibugs>	 (03PS1) 10Krinkle: arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920)
[16:55:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920) (owner: 10Krinkle)
[16:55:47] <wikibugs>	 (03PS2) 10Krinkle: arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920)
[16:57:02] <wikibugs>	 (03PS3) 10Krinkle: arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920)
[18:34:16] <wikibugs>	 (03PS2) 10QChris: gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784
[19:44:44] <wikibugs>	 (03PS1) 10QChris: gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795
[19:44:46] <wikibugs>	 (03PS1) 10QChris: gerrit: Do not enable the ability to move changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606796
[21:08:13] <wikibugs>	 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) I made the archiver work and you can now see it: https://lists-beta.wmflabs.org/hyperkitty/list/test-high-volume@lists...
[21:30:59] <wikibugs>	 (03CR) 10Paladox: [C: 03+1] gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795 (owner: 10QChris)
[21:50:41] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606783 (owner: 10QChris)
[21:57:11] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606457 (owner: 10Reedy)
[22:10:18] <wikibugs>	 (03CR) 10QChris: [C: 04-1] "Since this change has the templates for 2.16, and gerrit again (D'Oh!)" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox)
[22:20:42] <wikibugs>	 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Tgr) >>! In T52864#6242222, @Ladsgroup wrote: > The only thing is that with disabling gravatar (which we can't enable due to our...
[22:50:54] <icinga-wm_>	 PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100%
[22:54:24] <cdanis>	 uhm
[22:56:25] <logmsgbot>	 !log cdanis@cumin2001 dbctl commit (dc=all): 'db1088 seems to have crashed', diff saved to https://phabricator.wikimedia.org/P11611 and previous config saved to /var/cache/conftool/dbconfig/20200620-225624-cdanis.json
[22:56:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:00:51] <wikibugs>	 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10CDanis) p:05Triage→03High
[23:01:17] <icinga-wm_>	 ACKNOWLEDGEMENT - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% CDanis https://phabricator.wikimedia.org/T255927
[23:02:44] <icinga-wm_>	 RECOVERY - Host db1088 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[23:06:32] <icinga-wm_>	 PROBLEM - mysqld processes #page on db1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[23:07:18] <icinga-wm_>	 PROBLEM - MariaDB Slave SQL: s6 #page on db1088 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:07:32] <rzl>	 good evening 👋
[23:07:42] <icinga-wm_>	 PROBLEM - MariaDB read only s6 on db1088 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[23:08:06] <icinga-wm_>	 PROBLEM - MariaDB Slave IO: s6 #page on db1088 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:09:01] <XioNoX>	 I'm around if needed
[23:09:06] <rzl>	 cdanis: you already depooled, right?
[23:09:33] <shdubsh>	 o/
[23:11:51] <cdanis>	 yes, already depooled, opened T255927 
[23:11:51] <stashbot>	 T255927: db1088 crashed - https://phabricator.wikimedia.org/T255927
[23:11:59] <cdanis>	 hilariously this only paged because the machine came back up on its own and mysql restarted
[23:14:43] <cdanis>	 I am going to acknowledge these with the same ticket number, investigating the replica's state can wait until Monday
[23:15:12] <icinga-wm_>	 ACKNOWLEDGEMENT - HP RAID on db1088 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[23:15:12] <icinga-wm_>	 ACKNOWLEDGEMENT - MariaDB Slave IO: s6 #page on db1088 is CRITICAL: CRITICAL slave_io_state could not connect CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:15:13] <icinga-wm_>	 ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 #page on db1088 is CRITICAL: CRITICAL slave_sql_lag could not connect CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:15:13] <icinga-wm_>	 ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 #page on db1088 is CRITICAL: CRITICAL slave_sql_state could not connect CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[23:15:13] <icinga-wm_>	 ACKNOWLEDGEMENT - MariaDB read only s6 on db1088 is CRITICAL: Could not connect to localhost:3306 CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[23:15:14] <icinga-wm_>	 ACKNOWLEDGEMENT - mysqld processes #page on db1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[23:15:31] <rzl>	 sgtm, thanks for getting it
[23:17:13] <rzl>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1592673419159&to=1592695019159&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200
[23:17:24] <rzl>	 appserver latency at the 95th has a delightful shark fin but is recovered to within bounds
[23:18:41] <wikibugs>	 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10CDanis) Looks like it rebooted by itself (which hilariously was the first thing to make it page), but I'm leaving it depooled.
[23:32:49] <wikibugs>	 (03PS1) 10CDanis: s/slave/replica/ in visible parts of MariaDB alerts [puppet] - 10https://gerrit.wikimedia.org/r/606801
[23:32:51] <icinga-wm_>	 ACKNOWLEDGEMENT - HP RAID on db1088 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T255928 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[23:32:54] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T255928 (10ops-monitoring-bot)