[00:47:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:49:35] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:51:03] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 91961672 and 57 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[00:55:47] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:04:45] <icinga-wm>	 PROBLEM - Check systemd state on mwlog2001 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:21:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[02:23:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:09:49] <wikibugs>	 10SRE, 10ops-eqiad: mc1027.eqiad.wmnet is down, not powering back up - https://phabricator.wikimedia.org/T276415 (10jijiki)
[03:19:25] <icinga-wm>	 PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[03:19:37] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:21:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:47:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:49:55] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:37:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 239, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:38:41] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:46:11] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 145, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:47:37] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 241, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:54:17] <icinga-wm>	 PROBLEM - SSH on phab2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[07:22:48] <elukey>	 !log powercycle elastic2033 - no ssh, no tty available via mgmt
[07:22:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:01] <elukey>	 No bootable devices found, lovely
[07:29:25] <icinga-wm>	 RECOVERY - Host elastic2033 is UP: PING OK - Packet loss = 0%, RTA = 33.06 ms
[07:32:21] <elukey>	 mmm this is a lie
[07:32:52] <wikibugs>	 10ops-codfw, 10Discovery: elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey)
[07:34:02] <wikibugs>	 (03CR) 10Majavah: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/683695 (https://phabricator.wikimedia.org/T233134) (owner: 10Herron)
[07:36:56] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "cherry picked on beta to unblock some cfssl work, works fine there" [puppet] - 10https://gerrit.wikimedia.org/r/683837 (owner: 10Jbond)
[07:38:27] <icinga-wm>	 PROBLEM - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100%
[07:39:54] <wikibugs>	 10ops-codfw, 10Discovery: elastic2033 without bootable devices available - https://phabricator.wikimedia.org/T281621 (10elukey) I left the host in the System Config panel so it will not keep trying to PXE, so it needs a `power reset` to start investigations :)
[07:40:14] <icinga-wm>	 ACKNOWLEDGEMENT - Host elastic2033 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T281621
[07:43:06] <elukey>	 (checked health of ES from elastic2030, all green)
[07:52:01] <wikibugs>	 (03PS1) 10Majavah: Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008
[07:52:03] <wikibugs>	 (03PS1) 10Majavah: Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009
[07:52:49] <wikibugs>	 (03PS2) 10Majavah: Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 (https://phabricator.wikimedia.org/T280908)
[07:52:51] <wikibugs>	 (03PS2) 10Majavah: Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 (https://phabricator.wikimedia.org/T280908)
[07:53:06] <Majavah>	 anyone around who could deploy the logo changes for T280908?
[07:53:07] <stashbot>	 T280908: Change Spanish Wikipedia logo due to its 20th anniversary as of May 1 for one month - https://phabricator.wikimedia.org/T280908
[07:55:06] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10jcrespo) > I'd propose to just set read_only = 1 by default on all beta database servers  This is exactly how we handle production servers, but there is a difference...
[08:02:15] <wikibugs>	 10SRE, 10Patch-For-Review: try planet/people on bullseye - https://phabricator.wikimedia.org/T280989 (10jcrespo) >>! In T280989#7049886, @Dzahn wrote: > I assumed the issue is the Icinga check can't distinguish between failed and "has not tried yet" because both mean there is no proof of a succeful run.  It ac...
[08:53:15] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Majavah) >>! In T110115#7050521, @jcrespo wrote: > Is there a way to have effective monitoring on beta  No, not really. I don't imagine that being a problem unless i...
[09:12:50] <wikibugs>	 (03PS1) 10Majavah: beta: Use https for swift [puppet] - 10https://gerrit.wikimedia.org/r/684010 (https://phabricator.wikimedia.org/T277990)
[09:15:14] <wikibugs>	 (03PS1) 10Majavah: beta: Use https for swift [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684012 (https://phabricator.wikimedia.org/T277990)
[09:30:19] * Majavah re-asks if anyone with mw deployment access is around, for T280908?
[09:36:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:39:05] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[10:28:17] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1006 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.075 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[10:59:25] <icinga-wm>	 RECOVERY - SSH on phab2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[11:17:31] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10ContentSecurityPolicy: icon of https://lists.wikimedia.org/mailman/listinfo/commons-poty is blocked by ContentSecurityPolicy - https://phabricator.wikimedia.org/T281626 (10Ladsgroup) Yeah, I saw it but I'm not sure if it's worth fixing. We are migrating all of mailing list...
[12:24:35] <wikibugs>	 (03PS1) 10QChris: Add .gitreview [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/684022
[12:24:37] <wikibugs>	 (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/684022 (owner: 10QChris)
[15:10:19] <icinga-wm>	 PROBLEM - SSH on contint2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:34:21] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure: Possible to run writes (e.g. UPDATE) on Beta Cluster replica - https://phabricator.wikimedia.org/T110115 (10Krinkle)
[16:34:25] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Technical-Debt, 10Tracking-Neverending: Minimize infrastructure differences between Beta Cluster and production - https://phabricator.wikimedia.org/T87220 (10Krinkle)
[16:46:19] <legoktm>	 Majavah: yeah I'll do it
[16:47:31] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah)
[16:48:10] <wikibugs>	 (03Merged) 10jenkins-bot: Add eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684008 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah)
[16:48:56] <Urbanecm>	 thanks legoktm 
[16:50:05] <logmsgbot>	 !log legoktm@deploy1002 Synchronized static/images/project-logos/: Add eswiki 20th anniversary logos (duration: 00m 57s)
[16:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:50:31] <wikibugs>	 (03CR) 10Legoktm: [C: 03+2] Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah)
[16:51:10] <wikibugs>	 (03Merged) 10jenkins-bot: Use eswiki 20th anniversary logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/684009 (https://phabricator.wikimedia.org/T280908) (owner: 10Majavah)
[16:53:29] <legoktm>	 staged on mwdebug1002
[16:54:39] <legoktm>	 lgtm
[16:56:23] <logmsgbot>	 !log legoktm@deploy1002 Synchronized wmf-config/logos.php: Use eswiki 20th anniversary logos (T280908) (duration: 00m 56s)
[16:56:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:56:31] <stashbot>	 T280908: Change Spanish Wikipedia logo due to its 20th anniversary as of May 1 for one month - https://phabricator.wikimedia.org/T280908
[16:58:17] <logmsgbot>	 !log legoktm@deploy1002 Synchronized logos/config.yaml: Add eswiki 20th anniversary logos (duration: 00m 57s)
[16:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:31] <Majavah>	 legoktm:  thanks! also the script you made for changing the logos made things much easier
[17:07:39] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists, 10ContentSecurityPolicy: icon of https://lists.wikimedia.org/mailman/listinfo/commons-poty is blocked by ContentSecurityPolicy - https://phabricator.wikimedia.org/T281626 (10Legoktm) 05Open→03Declined >>! In T281626#7050617, @Ladsgroup wrote: > Yeah, I saw it but I'm not...
[17:19:33] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=netbox_device_statistics site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:24:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:38:33] <wikibugs>	 (03PS1) 10Majavah: P::toolforge::mailrelay: support multiple domains [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109)
[18:43:55] <wikibugs>	 (03CR) 10Majavah: "Untested." [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109) (owner: 10Majavah)
[18:45:39] <wikibugs>	 (03PS2) 10Majavah: P::toolforge::mailrelay: support multiple domains [puppet] - 10https://gerrit.wikimedia.org/r/684032 (https://phabricator.wikimedia.org/T278109)
[19:12:48] <Urbanecm>	 !log Invalidate password for MaraBot@SUL (T281586)
[19:12:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:15:47] <icinga-wm>	 RECOVERY - SSH on contint2001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:18:37] <wikibugs>	 (03PS1) 10Majavah: P::mariadb::beta: Set read only by default [puppet] - 10https://gerrit.wikimedia.org/r/684034 (https://phabricator.wikimedia.org/T110115)
[22:07:13] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1111 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:07:53] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1111 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:14:51] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1111 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[22:15:35] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1111 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state