[00:47:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:49:35] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:51:03] PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 91961672 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:55:47] RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:04:45] PROBLEM - Check systemd state on mwlog2001 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:21:27] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [02:23:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:09:49] 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10jijiki) [03:19:25] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:19:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:21:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:47:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:49:55] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:37:41] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:41] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:11] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:47:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:17] PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [07:22:48] !log powercycle elastic2033 - no ssh, no tty available via mgmt [07:22:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:01] No bootable devices found, lovely [07:29:25] RECOVERY - Host elastic2033 is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms [07:32:21] mmm this is a lie [07:32:52] 10ops-codfw, 10Discovery: elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey) [07:34:02] (03CR) 10Majavah: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron) [07:36:56] (03CR) 10Majavah: [C: 03+1] "cherry picked on beta to unblock some cfssl work, works fine there" [puppet] - 10https://gerrit.wikimedia.org/r/683837 (owner: 10Jbond) [07:38:27] PROBLEM - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100% [07:39:54] 10ops-codfw, 10Discovery: elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey) I left the host in the System Config panel so it will not keep trying to PXE, so it needs a `power reset` to start investigations :) [07:40:14] ACKNOWLEDGEMENT - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T281621 [07:43:06] (checked health of ES from elastic2030, all green) [07:52:01] (03PS1) 10Majavah: Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 [07:52:03] (03PS1) 10Majavah: Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 [07:52:49] (03PS2) 10Majavah: Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 (https://phabricator.wikimedia.org/T280908) [07:52:51] (03PS2) 10Majavah: Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 (https://phabricator.wikimedia.org/T280908) [07:53:06] anyone around who could deploy the logo changes for T280908? [07:53:07] T280908: Change Spanish Wikipedia logo due to its 20th anniversary as of May 1 for one month - https://phabricator.wikimedia.org/T280908 [07:55:06] 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10jcrespo) > I'd propose to just set read_only = 1 by default on all beta database servers This is exactly how we handle production servers, but there is a difference... [08:02:15] 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) >>! In T280989#7049886, @Dzahn wrote: > I assumed the issue is the Icinga check can't distinguish between failed and "has not tried yet" because both mean there is no proof of a succeful run. It ac... [08:53:15] 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Majavah) >>! In T110115#7050521, @jcrespo wrote: > Is there a way to have effective monitoring on beta No, not really. I don't imagine that being a problem unless i... [09:12:50] (03PS1) 10Majavah: beta: Use https for swift [puppet] - 10https://gerrit.wikimedia.org/r/684010 (https://phabricator.wikimedia.org/T277990) [09:15:14] (03PS1) 10Majavah: beta: Use https for swift [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684012 (https://phabricator.wikimedia.org/T277990) [09:30:19] * Majavah re-asks if anyone with mw deployment access is around, for T280908? [09:36:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:28:17] RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [10:59:25] RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [11:17:31] 10SRE, 10Wikimedia-Mailing-lists, 10ContentSecurityPolicy: icon of https://lists.wikimedia.org/mailman/listinfo/commons-poty is blocked by ContentSecurityPolicy - https://phabricator.wikimedia.org/T281626 (10Ladsgroup) Yeah, I saw it but I'm not sure if it's worth fixing. We are migrating all of mailing list... [12:24:35] (03PS1) 10QChris: Add .gitreview [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/684022 [12:24:37] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/684022 (owner: 10QChris) [15:10:19] PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:34:21] 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Krinkle) [16:34:25] 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Krinkle) [16:46:19] Majavah: yeah I'll do it [16:47:31] (03CR) 10Legoktm: [C: 03+2] Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah) [16:48:10] (03Merged) 10jenkins-bot: Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah) [16:48:56] thanks legoktm [16:50:05] !log legoktm@deploy1002 Synchronized static/images/project-logos/: Add eswiki 20th anniversary logos (duration: 00m 57s) [16:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:50:31] (03CR) 10Legoktm: [C: 03+2] Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah) [16:51:10] (03Merged) 10jenkins-bot: Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah) [16:53:29] staged on mwdebug1002 [16:54:39] lgtm [16:56:23] !log legoktm@deploy1002 Synchronized wmf-config/logos.php: Use eswiki 20th anniversary logos (T280908) (duration: 00m 56s) [16:56:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:31] T280908: Change Spanish Wikipedia logo due to its 20th anniversary as of May 1 for one month - https://phabricator.wikimedia.org/T280908 [16:58:17] !log legoktm@deploy1002 Synchronized logos/config.yaml: Add eswiki 20th anniversary logos (duration: 00m 57s) [16:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:31] legoktm: thanks! also the script you made for changing the logos made things much easier [17:07:39] 10SRE, 10Wikimedia-Mailing-lists, 10ContentSecurityPolicy: icon of https://lists.wikimedia.org/mailman/listinfo/commons-poty is blocked by ContentSecurityPolicy - https://phabricator.wikimedia.org/T281626 (10Legoktm) 05Open→03Declined >>! In T281626#7050617, @Ladsgroup wrote: > Yeah, I saw it but I'm not... [17:19:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:24:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:33] (03PS1) 10Majavah: P::toolforge::mailrelay: support multiple domains [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) [18:43:55] (03CR) 10Majavah: "Untested." [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) (owner: 10Majavah) [18:45:39] (03PS2) 10Majavah: P::toolforge::mailrelay: support multiple domains [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) [19:12:48] !log Invalidate password for MaraBot@SUL (T281586) [19:12:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:47] RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:37] (03PS1) 10Majavah: P::mariadb::beta: Set read only by default [puppet] - 10https://gerrit.wikimedia.org/r/684034 (https://phabricator.wikimedia.org/T110115) [22:07:13] PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:07:53] PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:14:51] RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [22:15:35] RECOVERY - Check systemd state on an-worker1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state