[00:02:29] 10Operations, 10serviceops: upgrade planet.wikimedia.org backends to buster - https://phabricator.wikimedia.org/T247651 (10Dzahn) [00:02:38] 10Operations, 10serviceops: upgrade planet.wikimedia.org backends to buster - https://phabricator.wikimedia.org/T247651 (10Dzahn) a:03Dzahn [00:05:54] 10Operations, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) [00:07:10] 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) [00:10:54] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [00:10:54] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [00:10:56] 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) [00:10:58] 10Operations, 10serviceops: upgrade planet.wikimedia.org backends to buster - https://phabricator.wikimedia.org/T247651 (10Dzahn) [00:11:01] 10Operations, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) [00:11:04] 10Operations, 10serviceops: replace bromine and vega with buster VMs - https://phabricator.wikimedia.org/T247650 (10Dzahn) [00:11:07] 10Operations, 10serviceops: upgrade people.wikimedia.org backend to buster - https://phabricator.wikimedia.org/T247649 (10Dzahn) [00:11:10] 10Operations, 10serviceops: miscweb1001/2001 - upgrade to buster or decom - https://phabricator.wikimedia.org/T247648 (10Dzahn) [00:11:13] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) @Muehlenhoff ^ [00:12:19] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) Ok, labstore1006 is now buster. Failing things back to their steady state. [00:12:26] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [00:12:58] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [00:13:00] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Dzahn) [00:13:31] (03PS1) 10Bstorm: Revert "dumps-distribution: move all NFS traffic to labstore1007" [puppet] - 10https://gerrit.wikimedia.org/r/579670 (https://phabricator.wikimedia.org/T224583) [00:14:11] (03PS1) 10Bstorm: Revert "dumps-distribution: fail over to labstore1007 for dumps.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/579671 (https://phabricator.wikimedia.org/T224583) [00:14:43] (03PS1) 10Bstorm: Revert "dumps-distribution: switch which host does acme" [puppet] - 10https://gerrit.wikimedia.org/r/579672 (https://phabricator.wikimedia.org/T224583) [00:16:36] (03CR) 10Bstorm: [C: 03+2] Revert "dumps-distribution: move all NFS traffic to labstore1007" [puppet] - 10https://gerrit.wikimedia.org/r/579670 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [00:17:14] (03CR) 10Bstorm: [C: 03+2] Revert "dumps-distribution: fail over to labstore1007 for dumps.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/579671 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [00:18:34] (03CR) 10Bstorm: [C: 03+2] Revert "dumps-distribution: switch which host does acme" [puppet] - 10https://gerrit.wikimedia.org/r/579672 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [00:19:16] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Dzahn) [00:19:31] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10Dzahn) [00:19:34] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10Dzahn) [00:19:39] 10Operations, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Upgrade memcached for Debian Stretch/Buster - https://phabricator.wikimedia.org/T213089 (10Dzahn) [00:21:09] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) [00:23:17] (03PS1) 10Bstorm: Revert "dumps-distribution: set the TTL to 5M for dumps.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/579674 (https://phabricator.wikimedia.org/T224583) [00:24:05] (03PS2) 10Bstorm: Revert "dumps-distribution: set the TTL to 5M for dumps.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/579674 (https://phabricator.wikimedia.org/T224583) [00:24:07] (03PS1) 10Dzahn: racktables: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579675 [00:24:50] (03CR) 10Bstorm: [C: 03+2] Revert "dumps-distribution: set the TTL to 5M for dumps.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/579674 (https://phabricator.wikimedia.org/T224583) (owner: 10Bstorm) [00:25:25] (03PS1) 10Dzahn: iegreview: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579677 [00:26:23] 10Operations, 10Epic: Migrate all of production metal to Buster or later - https://phabricator.wikimedia.org/T247045 (10Bstorm) [00:26:25] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1006/1007 to Stretch/Buster - https://phabricator.wikimedia.org/T224583 (10Bstorm) 05Open→03Resolved Everything is failed back to how it normally is except it's now buster. [00:26:27] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Bstorm) [00:28:21] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Dzahn) [00:28:49] (03PS1) 10Dzahn: planet: remove port 80 firewall hole [puppet] - 10https://gerrit.wikimedia.org/r/579678 [00:34:40] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:20] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:59:59] PROBLEM - puppet last run on stat1008 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [01:01:23] 10Operations: apt config on planet1001 would install systemd from backports - https://phabricator.wikimedia.org/T247592 (10Dzahn) on planet1001, `/etc/apt/sources.list` looks like this: ` 1 # deb http://mirrors.wikimedia.org/debian/ stretch main 2 3 ## Wikimedia APT repository 4 # deb http://apt1001.w... [01:05:11] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:06:19] !log planet1001 - copying /etc/apt/sources.list from planet2001 to planet1001 - apt-get update - apt-get install openssh-server T247592 [01:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:25] T247592: apt config on planet1001 would install systemd from backports - https://phabricator.wikimedia.org/T247592 [01:08:00] 10Operations: apt config on planet1001 would install systemd from backports - https://phabricator.wikimedia.org/T247592 (10Dzahn) 05Open→03Resolved Copied the sources.list from 2001 to 1001 and installed newer version of openssh-server and client. libpam-systemd now 232-25+deb9u12 installed and candidate.... [01:13:21] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:15:45] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [01:17:57] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [01:51:33] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:43] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:06:55] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:41] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:47] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:29] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:11] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:05] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [03:01:23] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [03:11:27] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:05] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:07:51] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:09] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:46:29] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:11] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:56:31] PROBLEM - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [04:58:53] PROBLEM - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:01:55] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [05:02:17] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [05:03:53] PROBLEM - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:06:59] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:14:45] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:19] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:21:49] RECOVERY - Old JVM GC check - cloudelastic1002-cloudelastic-chi-eqiad on cloudelastic1002 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:22:09] RECOVERY - Old JVM GC check - cloudelastic1001-cloudelastic-chi-eqiad on cloudelastic1001 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:24:57] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:27:01] RECOVERY - Old JVM GC check - cloudelastic1003-cloudelastic-chi-eqiad on cloudelastic1003 is OK: (C)100 gt (W)80 gt 72.2 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=cloudelastic-chi-eqiad [05:37:57] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:40:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:15:41] (03PS1) 10KartikMistry: apertium-cy-en: Fix FTBFS with apertium >= 3.6 [debs/contenttranslation/apertium-cy-en] - 10https://gerrit.wikimedia.org/r/579683 (https://phabricator.wikimedia.org/T247585) [06:21:31] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:29:11] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:31:41] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [06:32:07] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [06:52:09] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:18] 10Operations, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.7.2 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Tgr) [06:58:30] 10Operations, 10Fundraising-Backlog, 10MediaWiki-Vagrant: Package XDebug 2.7.2 for apt.wikimedia.org - https://phabricator.wikimedia.org/T220406 (10Tgr) Reframed the task since it seems the PECL route was discarded (probably for the better). OTOH we are about to support PHP 7.4 in MediaWiki, and testing that... [06:59:57] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:07:35] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:15:17] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:03] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:03:43] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:35] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 57 probes of 541 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:32:03] !log run systemctl restart systemd-timedated.service on stat1008 [08:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:45] !log run kafka preferred-replica-election on kafka-jumbo1001 - T247561 [08:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:50] T247561: kafka-jumbo1006 network issues - https://phabricator.wikimedia.org/T247561 [08:34:09] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 and stat1005 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) [08:36:03] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 47 probes of 541 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [08:45:23] PROBLEM - Host stat1008 is DOWN: PING CRITICAL - Packet loss = 100% [08:47:21] RECOVERY - Host stat1008 is UP: PING OK - Packet loss = 0%, RTA = 0.34 ms [08:49:18] this was me --^ [08:49:31] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:50:59] RECOVERY - puppet last run on stat1008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:51:46] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 and stat1005 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) @Papaul @Jclark-ctr can we try to move stat1005 to a different switch port again? [09:30:59] RECOVERY - mediawiki originals uploads -hourly- for codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [09:40:47] RECOVERY - mediawiki originals uploads -hourly- for eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [10:05:01] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 89 probes of 541 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:11:35] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 37 probes of 541 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:31:25] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:38:49] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 100 probes of 541 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [10:49:05] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3060 is OK: HTTP OK: HTTP/1.0 200 OK - 22102 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [10:58:23] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 43 probes of 541 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:17:03] 10Operations, 10Mail, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603 (10Aklapper) [14:31:15] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [14:31:47] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [16:02:31] PROBLEM - traffic_server tls process restarted on cp4025 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4025&var-layer=tls [16:05:25] PROBLEM - Varnish frontend child restarted on cp4025 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4025&var-datasource=ulsfo+prometheus/ops [16:39:05] Hi, Zuul started to show this errors in tests "Permission denied" [16:39:18] Example https://integration.wikimedia.org/ci/job/mwgate-node10-docker/93511/ [17:21:06] (03PS1) 10KartikMistry: apertium-en-es: Fix FTBFS with apertium 3.6.1 [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/579757 (https://phabricator.wikimedia.org/T247585) [17:28:21] (03PS1) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 [17:30:03] (03CR) 10CRusnov: "This has been tested on production PuppetDB using the flask test server for expected queries." [puppet] - 10https://gerrit.wikimedia.org/r/579758 (owner: 10CRusnov) [17:31:14] (03PS2) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) [17:31:16] (03CR) 10jerkins-bot: [V: 04-1] puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [17:33:08] (03PS3) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) [17:52:43] > Pausing due to database lag: Waiting for all: 31.35 seconds lagged. [17:52:46] @wikidata [17:52:55] 31.35 sec... is it normal? (probably not?) [18:11:22] It's not that much [18:11:28] If there's a bot going on a rampage (like there often is) [18:12:49] indeed :P [18:13:21] it went better for moments and now sucking again [18:15:41] Check RC and probably cry [18:16:46] I don't see someone flooding RC tho [19:03:23] (03PS1) 10DannyS712: trwiki: Grant interface admins editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) [19:05:55] (03PS2) 10DannyS712: trwiki: Grant interface admins editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) [19:06:55] (03PS3) 10DannyS712: trwiki: Grant interface admins editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) [19:10:52] hmm? [19:11:39] ah [19:13:03] (03CR) 10Urbanecm: [C: 04-1] trwiki: Grant interface admins editprotected & editsemiprotected (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) (owner: 10DannyS712) [20:05:51] RECOVERY - Memory correctable errors -EDAC- on mw1239 is OK: (C)4 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [20:12:21] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:23:42] (03CR) 10DannyS712: trwiki: Grant interface admins editprotected & editsemiprotected (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) (owner: 10DannyS712) [20:24:27] (03PS4) 10DannyS712: trwiki: Grant interface editors editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) [20:26:06] 10Operations, 10Release-Engineering-Team, 10serviceops: replace doc1001.eqiad.wmnet with a buster VM - https://phabricator.wikimedia.org/T247653 (10Dzahn) p:05Triage→03Medium [20:26:11] 10Operations, 10serviceops: replace backends for releases.wikimedia.org with buster VMs - https://phabricator.wikimedia.org/T247652 (10Dzahn) p:05Triage→03Medium [20:26:20] 10Operations: decom racktables? - https://phabricator.wikimedia.org/T247646 (10Dzahn) p:05Triage→03Medium [20:45:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:51:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.219e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [20:59:05] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:05:38] (03PS1) 10Alex Monk: Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 [21:13:02] (03CR) 10Reedy: "Guessing https://github.com/wikimedia/puppet/blob/e0da161490eed41d9b446b825add7bb2747b9ce7/manifests/realm.pp#L209 probably needs removing" [puppet] - 10https://gerrit.wikimedia.org/r/579800 (owner: 10Alex Monk) [21:15:03] (03PS2) 10Alex Monk: Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 [22:08:49] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:11:19] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:51:41] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.396e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [22:59:19] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw