[00:17:06] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:18:54] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:32:50] (03CR) 10Bmansurov: "What's the command you used to get that error message?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [00:48:58] PROBLEM - puppet last run on dumpsdata1001 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:06:22] RECOVERY - puppet last run on dumpsdata1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:46:36] PROBLEM - Check systemd state on ms-be1041 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:08:44] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1041 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:11:56] RECOVERY - Check systemd state on ms-be1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:42] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 90.14% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [02:32:48] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:36:28] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:32] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be1041 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [02:45:30] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:58:14] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:04:52] PROBLEM - Check systemd state on an-launcher1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:12] PROBLEM - Host an-worker1091 is DOWN: PING CRITICAL - Packet loss = 100% [03:07:14] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:00] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:00] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:30:40] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:45:00] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:02] PROBLEM - MegaRAID on pc2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:46:03] ACKNOWLEDGEMENT - MegaRAID on pc2007 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T255904 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:46:07] 10Operations, 10ops-codfw: Degraded RAID on pc2007 - https://phabricator.wikimedia.org/T255904 (10ops-monitoring-bot) [04:01:20] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:44] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:31:56] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:28] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:30] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:08] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:20] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:30] PROBLEM - puppet last run on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:38:38] PROBLEM - Disk space on Hadoop worker on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:38:42] PROBLEM - Check systemd state on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:38:42] PROBLEM - Hadoop NodeManager on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:39:02] PROBLEM - Check size of conntrack table on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [05:40:02] PROBLEM - Hadoop DataNode on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [05:44:24] PROBLEM - IPMI Sensor Status on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [05:45:05] PROBLEM - dhclient process on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [05:53:24] PROBLEM - Check whether ferm is active by checking the default input chain on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [05:57:48] PROBLEM - Disk space on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [05:57:56] PROBLEM - configured eth on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [05:59:52] PROBLEM - DPKG on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:02:58] PROBLEM - Check the NTP synchronisation status of timesyncd on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [06:07:14] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [06:51:52] PROBLEM - Long running screen/tmux on an-worker1093 is CRITICAL: connect to address 10.64.53.35 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200620T0700) [07:32:35] checking an-workers.. [07:36:56] !log powercycle an-worker1091 - bug soft lock up CPU showed in mgmt console [07:36:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:38] RECOVERY - Host an-worker1091 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [07:42:02] !log powercycle an-worker1093 - bug soft lock up CPU showed in mgmt console [07:42:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:56] PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100% [07:45:20] (03PS1) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) [07:45:54] RECOVERY - Host an-worker1093 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [07:46:02] RECOVERY - Hadoop DataNode on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:46:26] RECOVERY - Disk space on Hadoop worker on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:46:30] RECOVERY - Check systemd state on an-worker1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:30] RECOVERY - Hadoop NodeManager on an-worker1093 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [07:46:50] RECOVERY - Check size of conntrack table on an-worker1093 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [07:47:32] RECOVERY - IPMI Sensor Status on an-worker1093 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:48:12] RECOVERY - dhclient process on an-worker1093 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:51:52] RECOVERY - puppet last run on an-worker1093 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [07:55:06] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [07:56:34] RECOVERY - Check whether ferm is active by checking the default input chain on an-worker1093 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:59:03] (03CR) 10Majavah: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606710 (https://phabricator.wikimedia.org/T198673) (owner: 10Majavah) [08:01:08] RECOVERY - configured eth on an-worker1093 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [08:02:38] RECOVERY - Disk space on an-worker1093 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=an-worker1093&var-datasource=eqiad+prometheus/ops [08:03:02] RECOVERY - DPKG on an-worker1093 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:06:08] RECOVERY - Check the NTP synchronisation status of timesyncd on an-worker1093 is OK: OK: synced at Sat 2020-06-20 08:06:07 UTC. https://wikitech.wikimedia.org/wiki/NTP [08:38:49] (03Abandoned) 10VulpesVulpes825: Enable DiscussionTools as a beta feature on zhwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/592514 (https://phabricator.wikimedia.org/T251075) (owner: 10VulpesVulpes825) [09:30:48] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [09:36:38] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:02:32] PROBLEM - very high load average likely xfs on ms-be2037 is CRITICAL: CRITICAL - load average: 135.55, 110.60, 60.85 https://wikitech.wikimedia.org/wiki/Swift [10:06:10] RECOVERY - very high load average likely xfs on ms-be2037 is OK: OK - load average: 19.58, 73.92, 57.58 https://wikitech.wikimedia.org/wiki/Swift [10:50:06] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:42] RECOVERY - Long running screen/tmux on an-worker1093 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [11:15:18] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:39:30] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 55 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:45:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:55:09] (03CR) 10Alexandros Kosiaris: "> What's the command you used to get that error message?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/602527 (https://phabricator.wikimedia.org/T241230) (owner: 10Bmansurov) [12:44:50] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 56 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:40] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:56:19] (03PS2) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) [12:58:09] (03PS3) 10VulpesVulpes825: Add zh-hans and zh-hant translation of Module and Module_talk aliases for all Zh Projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/606769 (https://phabricator.wikimedia.org/T165593) [13:32:42] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 51 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:38:30] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 47 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:43:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:45:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:07:04] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 52 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:12:54] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 48 probes of 572 (alerts on 50) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:23:18] 10Operations, 10Graphoid, 10serviceops, 10Core Platform Team (Icebox), 10MW-1.35-notes (1.35.0-wmf.34; 2020-05-26): Undeploy graphoid - https://phabricator.wikimedia.org/T242855 (10Jseddon) [14:42:47] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10Pcoombe) [14:47:34] (03PS2) 10Aklapper: Phab: Allow disabling Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/602951 [14:49:01] 10Operations, 10SRE-Access-Requests: Allow AKlapper to disable other people's personal Herald rules in Phabricator - https://phabricator.wikimedia.org/T255914 (10Aklapper) [14:49:14] (03PS3) 10Aklapper: Phab: Allow disabling Herald rules [puppet] - 10https://gerrit.wikimedia.org/r/602951 (https://phabricator.wikimedia.org/T255914) [14:54:24] (03PS2) 10QChris: gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 [14:54:26] (03PS3) 10QChris: gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 [14:54:28] (03PS1) 10QChris: gerrit: Update its-phabricator templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606781 [14:54:30] (03PS1) 10QChris: gerrit: Update email templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606782 [14:54:32] (03PS1) 10QChris: gerrit: Drop empty unused Git config file [puppet] - 10https://gerrit.wikimedia.org/r/606783 [14:54:34] (03PS1) 10QChris: gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 [14:56:13] (03CR) 10jerkins-bot: [V: 04-1] gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 (owner: 10QChris) [15:03:42] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10Aklapper) 05Open→03Stalled a:05anirudhsbh→03None Hi @anirudhsbh, thanks for taking the time to report this and welcome to Wikimedia Phabricator! I assume you don't pl... [15:19:03] (03PS4) 10DannyS712: Remove TranslationNotifications user settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/603167 (https://phabricator.wikimedia.org/T144780) [16:48:12] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [16:52:22] 10Operations, 10MediaWiki-Stakeholders-Group, 10TechCom-RFC, 10Traffic, and 3 others: RFC: API-driven web front-end - https://phabricator.wikimedia.org/T111588 (10Akuckartz) [16:54:04] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 350.59 ms [16:54:55] (03PS1) 10Krinkle: arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920) [16:55:16] (03CR) 10jerkins-bot: [V: 04-1] arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920) (owner: 10Krinkle) [16:55:47] (03PS2) 10Krinkle: arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920) [16:57:02] (03PS3) 10Krinkle: arclamp: Add support for sample_pop option to arclamp-log [puppet] - 10https://gerrit.wikimedia.org/r/606789 (https://phabricator.wikimedia.org/T255920) [18:34:16] (03PS2) 10QChris: gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 [19:44:44] (03PS1) 10QChris: gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795 [19:44:46] (03PS1) 10QChris: gerrit: Do not enable the ability to move changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606796 [21:08:13] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) I made the archiver work and you can now see it: https://lists-beta.wmflabs.org/hyperkitty/list/test-high-volume@lists... [21:30:59] (03CR) 10Paladox: [C: 03+1] gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795 (owner: 10QChris) [21:50:41] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606783 (owner: 10QChris) [21:57:11] (03CR) 10DannyS712: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/606457 (owner: 10Reedy) [22:10:18] (03CR) 10QChris: [C: 04-1] "Since this change has the templates for 2.16, and gerrit again (D'Oh!)" [puppet] - 10https://gerrit.wikimedia.org/r/473264 (owner: 10Paladox) [22:20:42] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Tgr) >>! In T52864#6242222, @Ladsgroup wrote: > The only thing is that with disabling gravatar (which we can't enable due to our... [22:50:54] PROBLEM - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% [22:54:24] uhm [22:56:25] !log cdanis@cumin2001 dbctl commit (dc=all): 'db1088 seems to have crashed', diff saved to https://phabricator.wikimedia.org/P11611 and previous config saved to /var/cache/conftool/dbconfig/20200620-225624-cdanis.json [22:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:51] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10CDanis) p:05Triage→03High [23:01:17] ACKNOWLEDGEMENT - Host db1088 is DOWN: PING CRITICAL - Packet loss = 100% CDanis https://phabricator.wikimedia.org/T255927 [23:02:44] RECOVERY - Host db1088 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [23:06:32] PROBLEM - mysqld processes #page on db1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [23:07:18] PROBLEM - MariaDB Slave SQL: s6 #page on db1088 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:07:32] good evening 👋 [23:07:42] PROBLEM - MariaDB read only s6 on db1088 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [23:08:06] PROBLEM - MariaDB Slave IO: s6 #page on db1088 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:09:01] I'm around if needed [23:09:06] cdanis: you already depooled, right? [23:09:33] o/ [23:11:51] yes, already depooled, opened T255927 [23:11:51] T255927: db1088 crashed - https://phabricator.wikimedia.org/T255927 [23:11:59] hilariously this only paged because the machine came back up on its own and mysql restarted [23:14:43] I am going to acknowledge these with the same ticket number, investigating the replica's state can wait until Monday [23:15:12] ACKNOWLEDGEMENT - HP RAID on db1088 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:15:12] ACKNOWLEDGEMENT - MariaDB Slave IO: s6 #page on db1088 is CRITICAL: CRITICAL slave_io_state could not connect CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:15:13] ACKNOWLEDGEMENT - MariaDB Slave Lag: s6 #page on db1088 is CRITICAL: CRITICAL slave_sql_lag could not connect CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:15:13] ACKNOWLEDGEMENT - MariaDB Slave SQL: s6 #page on db1088 is CRITICAL: CRITICAL slave_sql_state could not connect CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [23:15:13] ACKNOWLEDGEMENT - MariaDB read only s6 on db1088 is CRITICAL: Could not connect to localhost:3306 CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [23:15:14] ACKNOWLEDGEMENT - mysqld processes #page on db1088 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld CDanis https://phabricator.wikimedia.org/T255927 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [23:15:31] sgtm, thanks for getting it [23:17:13] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&from=1592673419159&to=1592695019159&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200 [23:17:24] appserver latency at the 95th has a delightful shark fin but is recovered to within bounds [23:18:41] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10CDanis) Looks like it rebooted by itself (which hilariously was the first thing to make it page), but I'm leaving it depooled. [23:32:49] (03PS1) 10CDanis: s/slave/replica/ in visible parts of MariaDB alerts [puppet] - 10https://gerrit.wikimedia.org/r/606801 [23:32:51] ACKNOWLEDGEMENT - HP RAID on db1088 is CRITICAL: CRITICAL: Slot 1: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T255928 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:32:54] 10Operations, 10ops-eqiad: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T255928 (10ops-monitoring-bot)