[00:00:11] ACKNOWLEDGEMENT - TFTP service on bast5001 is CRITICAL: NRPE: Command check_atftpd not defined daniel_zahn replaced by install5001 https://wikitech.wikimedia.org/wiki/Monitoring/atftpd [00:02:09] PROBLEM - ensure kvm processes are running on cloudvirt1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:04:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:19] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:57] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/630703 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:09:02] (03PS2) 10Jdlrobson: Move search in header for anons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630207 (https://phabricator.wikimedia.org/T263032) [00:09:56] !log TFTP/install server for eqsin switched from bast5001 to install5001 - T252526 [00:10:01] I'll overrun the B&C window a bit, I need to deploy a followup patch [00:10:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:10:02] T252526: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 [00:10:55] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:12:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:15:03] (03PS1) 10Gergő Tisza: Add (and increment) CacheDecorator cache version [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630421 (https://phabricator.wikimedia.org/T264029) [00:15:27] (03CR) 10Gergő Tisza: [C: 03+2] Add (and increment) CacheDecorator cache version [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630421 (https://phabricator.wikimedia.org/T264029) (owner: 10Gergő Tisza) [00:17:25] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:23:59] (03Merged) 10jenkins-bot: Add (and increment) CacheDecorator cache version [extensions/GrowthExperiments] (wmf/1.36.0-wmf.10) - 10https://gerrit.wikimedia.org/r/630421 (https://phabricator.wikimedia.org/T264029) (owner: 10Gergő Tisza) [00:24:39] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:26:17] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:28:55] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.639 second response time https://wikitech.wikimedia.org/wiki/Application_servers [00:29:09] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:30:09] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:34] !log tgr@deploy1001 Synchronized php-1.36.0-wmf.10/extensions/GrowthExperiments/includes/NewcomerTasks/TaskSuggester/CacheDecorator.php: Backport: [[gerrit:630421|Add (and increment) CacheDecorator cache version ([PHABRICATOR-TASK])]] (duration: 00m 58s) [00:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:31] !log B&C done [00:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:35:27] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [00:36:49] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [00:37:07] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [00:38:26] (03PS2) 10CDanis: base/phaste.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:39:46] (03CR) 10CDanis: [C: 03+1] base/phaste.py: Port to Python 3 [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [00:44:03] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:29] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:52:33] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:49] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:17] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:04:33] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:08:05] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:14:55] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:17:57] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [01:18:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [01:20:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:21:49] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.168 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:23:13] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.437 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:27:38] 10Operations, 10ops-codfw: elastic2037 DIMM errors logged in racadm getsel - https://phabricator.wikimedia.org/T263714 (10Papaul) return information {F32367182} [01:28:55] 10Operations, 10netops: Investigate Juniper storm control - https://phabricator.wikimedia.org/T245192 (10Papaul) a:05Papaul→03None [01:29:35] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:29:41] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:31:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:31:27] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:31:47] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 55.93 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [01:35:53] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:39:17] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:43:48] (03CR) 10Cwhite: [C: 03+1] am: tweak alert labels/annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630554 (owner: 10Filippo Giunchedi) [01:44:23] (03CR) 10Cwhite: [C: 03+1] alertmanager: group alerts and add severity: page [puppet] - 10https://gerrit.wikimedia.org/r/630556 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [01:44:31] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:45:53] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:46:47] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 4.906 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:46:49] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.099 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:54:03] (03PS1) 10Andrew Bogott: Further attempt to reimage cloudvirts while preserving /srv [puppet] - 10https://gerrit.wikimedia.org/r/630708 (https://phabricator.wikimedia.org/T263677) [01:54:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:54:43] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:55:16] (03CR) 10Andrew Bogott: [C: 03+2] Further attempt to reimage cloudvirts while preserving /srv [puppet] - 10https://gerrit.wikimedia.org/r/630708 (https://phabricator.wikimedia.org/T263677) (owner: 10Andrew Bogott) [01:56:21] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:56:29] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [01:57:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:01:15] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:01:39] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:05:03] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.11 [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630710 [02:10:17] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:07] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:20:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [02:20:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:20:35] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:22:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [02:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:55] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:26:33] (03PS1) 10Andrew Bogott: Update nic names for cloudvirt1012, 1013, 1014 [puppet] - 10https://gerrit.wikimedia.org/r/630712 (https://phabricator.wikimedia.org/T259399) [02:26:59] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 6.273 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:27:55] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.114 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:29:33] PROBLEM - dump of analytics_meta in eqiad on alert1001 is CRITICAL: Last dump for analytics_meta at eqiad (db1108.eqiad.wmnet:3352) taken on 2020-09-29 02:07:47 is 2 GB, but previous one was 1 GB, a change of 64.9% https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:29:58] (03CR) 10Andrew Bogott: [C: 03+2] Update nic names for cloudvirt1012, 1013, 1014 [puppet] - 10https://gerrit.wikimedia.org/r/630712 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [02:31:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:32:37] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:33:05] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:34:45] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:34:47] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.11 [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630710 (https://phabricator.wikimedia.org/T263177) (owner: 10TrainBranchBot) [02:34:51] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [02:35:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:36:19] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:05] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [02:38:49] PROBLEM - snapshot of s5 in eqiad on alert1001 is CRITICAL: snapshot for s5 at eqiad taken more than 3 days ago: Most recent backup 2020-09-26 02:36:03 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [02:40:55] (03PS1) 10Andrew Bogott: Return cloudvirt1012, 1013, 1014 to the standard labvirt partman [puppet] - 10https://gerrit.wikimedia.org/r/630713 [02:41:33] (03CR) 10Andrew Bogott: [C: 03+2] Return cloudvirt1012, 1013, 1014 to the standard labvirt partman [puppet] - 10https://gerrit.wikimedia.org/r/630713 (owner: 10Andrew Bogott) [02:44:25] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:48:59] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:56:59] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:02:41] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:37] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:09:59] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:10:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:05] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:45] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:12:01] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:12:02] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [03:12:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:13:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:14:48] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.405 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:15:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [03:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:32] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:00] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [03:19:44] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:25:06] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:22] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:36:04] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:43:28] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:20] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:51:06] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:57:08] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [03:58:12] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:00:22] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:02:36] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:08:10] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [04:08:38] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:11:28] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:14:24] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:20:24] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:28] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:30:20] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:33:40] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:35:18] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:40:20] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:10] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:34] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:16] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 5.597 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:53:28] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:57:52] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [04:58:09] (03PS1) 10Marostegui: backup2002: Change es1 replica [puppet] - 10https://gerrit.wikimedia.org/r/630715 (https://phabricator.wikimedia.org/T263740) [04:58:33] (03PS1) 10Marostegui: mariadb: Decommission es2013 [puppet] - 10https://gerrit.wikimedia.org/r/630716 (https://phabricator.wikimedia.org/T263740) [04:59:13] (03CR) 10jerkins-bot: [V: 04-1] backup2002: Change es1 replica [puppet] - 10https://gerrit.wikimedia.org/r/630715 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [04:59:52] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.decommission [05:00:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:06] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:04:56] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:05:53] (03PS2) 10Marostegui: backup2002: Change es1 replica [puppet] - 10https://gerrit.wikimedia.org/r/630715 (https://phabricator.wikimedia.org/T263740) [05:06:06] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission es2013 [puppet] - 10https://gerrit.wikimedia.org/r/630716 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [05:06:33] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [05:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:02] (03PS1) 10Marostegui: dns: Remove es2013 production entries [dns] - 10https://gerrit.wikimedia.org/r/630717 (https://phabricator.wikimedia.org/T263740) [05:08:30] (03CR) 10Marostegui: [C: 03+2] dns: Remove es2013 production entries [dns] - 10https://gerrit.wikimedia.org/r/630717 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [05:09:38] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware, 10Patch-For-Review: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) a:05Marostegui→03Papaul Ready for #dc-ops [05:09:56] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:17] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:10:23] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Marostegui) [05:10:51] !log Remove es2013 from tendril and zarcillo T263740 [05:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:10:57] T263740: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 [05:12:34] (03PS1) 10Marostegui: es2026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630718 (https://phabricator.wikimedia.org/T263837) [05:12:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2026 T263837', diff saved to https://phabricator.wikimedia.org/P12822 and previous config saved to /var/cache/conftool/dbconfig/20200929-051236-marostegui.json [05:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:42] T263837: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 [05:13:05] (03CR) 10Marostegui: [C: 03+2] es2026: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630718 (https://phabricator.wikimedia.org/T263837) (owner: 10Marostegui) [05:13:36] !log Stop mysql and reboot es2026 - T263837 [05:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:14:16] (03CR) 10Samwilson: [C: 03+1] Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [05:14:40] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [05:14:52] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:15:44] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Closing this then, thanks! [05:15:46] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 7.517 second response time https://wikitech.wikimedia.org/wiki/Application_servers [05:15:52] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.196 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:21:27] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [05:29:43] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [05:29:52] ACKNOWLEDGEMENT - MegaRAID on es2026 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T264062 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [05:29:55] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T264062 (10ops-monitoring-bot) [05:30:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) a:05Marostegui→03Papaul @Papaul I think we need to ask for another disk or advise from Dell. These are the controller logs after the reboot: ` Time: Tue Sep 29 05:17:... [05:30:29] 10Operations, 10ops-codfw: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T264062 (10Marostegui) [05:31:02] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: Degraded RAID on es2026 - https://phabricator.wikimedia.org/T263837 (10Marostegui) [05:33:27] PROBLEM - Check systemd state on stat1007 is CRITICAL: CRITICAL - bdegraded: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:42:27] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:53:00] 10Operations, 10DBA, 10Performance-Team, 10Platform Engineering, 10User-Kormat: Remove groups from db configs - https://phabricator.wikimedia.org/T263127 (10Marostegui) Not sure what `category` is as it doesn't appear on https://noc.wikimedia.org/dbconfig/eqiad.json From there we have these active group... [06:02:53] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es2034 as es3 master in codfw T261717', diff saved to https://phabricator.wikimedia.org/P12823 and previous config saved to /var/cache/conftool/dbconfig/20200929-060253-marostegui.json [06:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:00] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [06:05:32] (03PS1) 10Marostegui: es2019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630719 (https://phabricator.wikimedia.org/T264063) [06:05:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2019 T264063', diff saved to https://phabricator.wikimedia.org/P12824 and previous config saved to /var/cache/conftool/dbconfig/20200929-060538-marostegui.json [06:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:05:44] T264063: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 [06:06:01] (03CR) 10Marostegui: [C: 03+2] es2019: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/630719 (https://phabricator.wikimedia.org/T264063) (owner: 10Marostegui) [06:06:58] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:10:24] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:13:38] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.111 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:16:46] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.489 second response time https://wikitech.wikimedia.org/wiki/Application_servers [06:17:48] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:47] 10Operations, 10netops: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) After manually setting `check_fail = 2` overnight the service stopped being randomly depooled. Bird restarts didn't trigger camping neither. [06:21:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:25:15] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:28:35] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:30] (03PS1) 10Ayounsi: Anycast: add check_fail [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) [06:31:57] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:35:39] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:38:01] (03CR) 10Ayounsi: [C: 04-1] "@John: PCC curently fails with:" [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [06:39:35] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:40:59] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:43:47] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:45:17] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 650 bytes in 8.560 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:45:19] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:46:47] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [06:48:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:51:41] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [06:52:31] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [06:53:11] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:55:52] (03CR) 10Muehlenhoff: base/firewall/check_conntrack.py: Port to Python3 (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630690 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [06:56:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [06:56:27] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:59:49] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:01:34] (03CR) 10Muehlenhoff: "From what I can tell, the script (and the rest of tlsproxy::ocsp) is obsolete. Brandon can comment best whether there's a reason to retain" [puppet] - 10https://gerrit.wikimedia.org/r/630693 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [07:07:48] (03CR) 10Jcrespo: "Stevie is driving the puppet refactoring efforts, Dzhan." [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [07:09:33] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:09:57] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 102 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:14:11] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:22] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:21:01] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 11 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:21:57] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [07:22:01] PROBLEM - Maps - OSM synchronization lag - eqiad on alert1001 is CRITICAL: 2.149e+07 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [07:22:21] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:19] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [07:25:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:27:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:17] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:52] (03CR) 10Jcrespo: "Cas: The alert seems to work as expected, but I got the following message: "CRITICAL - b'degraded': unexpected". Check if you would be exp" [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [07:35:15] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:38:25] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:40:55] (03PS1) 10Jcrespo: base/check_systemd_state.py: Decode bytes before strip [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) [07:41:49] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers [07:42:24] (03CR) 10Jcrespo: "I believe this could be a consequence of moving to python3?" [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) (owner: 10Jcrespo) [07:42:39] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 649 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:43:57] (03PS1) 10Marostegui: instances.yaml: Remove es2019 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/630760 (https://phabricator.wikimedia.org/T264063) [07:44:03] (03PS2) 10Jcrespo: base/check_systemd_state.py: Decode bytes before strip [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) [07:44:57] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Remove es2019 from dbctl [puppet] - 10https://gerrit.wikimedia.org/r/630760 (https://phabricator.wikimedia.org/T264063) (owner: 10Marostegui) [07:46:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Remove es2019 from dbctl T264063', diff saved to https://phabricator.wikimedia.org/P12825 and previous config saved to /var/cache/conftool/dbconfig/20200929-074602-marostegui.json [07:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:10] T264063: decommission es2019.codfw.wmnet - https://phabricator.wikimedia.org/T264063 [07:46:15] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [07:46:18] !log Stop MySQL on es2019 before decommissioning T264063 [07:46:20] (03CR) 10Ema: [C: 03+2] cache: upgrade Varnish to v6 in esams [puppet] - 10https://gerrit.wikimedia.org/r/630566 (https://phabricator.wikimedia.org/T263557) (owner: 10Ema) [07:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:47:01] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [07:47:19] (03CR) 10Jcrespo: [C: 03+2] backup2002: Change es1 replica [puppet] - 10https://gerrit.wikimedia.org/r/630715 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [07:47:49] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [07:48:01] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [07:48:43] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [07:51:19] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [07:51:20] (03CR) 10Jcrespo: "I have deployed the change and updated the grants on the new host. Thanks for the heads up!" [puppet] - 10https://gerrit.wikimedia.org/r/630715 (https://phabricator.wikimedia.org/T263740) (owner: 10Marostegui) [07:53:19] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:51] (03CR) 10Jcrespo: [C: 04-1] "I don't believe this is the right fix, but the bug is real- needs more research." [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) (owner: 10Jcrespo) [07:55:02] !log badblocks check on wdqs1009 - T263125 [07:55:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:07] T263125: Check for errors on wdqs1009 disks - https://phabricator.wikimedia.org/T263125 [08:01:35] !log cp3050: varnish upgrade to 6.0.6-1wm1 T263557 [08:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:41] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [08:01:41] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 110.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [08:01:57] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:01:59] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 2.502e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:02:13] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:39] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 6 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:05:13] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:06:57] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:07:13] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:09:01] (03PS1) 10Jcrespo: mariadb: Reenable notifications on db1150 after setup [puppet] - 10https://gerrit.wikimedia.org/r/630761 (https://phabricator.wikimedia.org/T257551) [08:10:28] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable notifications on db1150 after setup [puppet] - 10https://gerrit.wikimedia.org/r/630761 (https://phabricator.wikimedia.org/T257551) (owner: 10Jcrespo) [08:11:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=G [08:13:21] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:43] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:14:37] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 636 bytes in 5.207 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:15:40] ok time to stop the experiments on mwdebug [08:16:25] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:19:32] (03PS2) 10JMeybohm: Revert "Temporarily remove conf1006 from client SRV records" [dns] - 10https://gerrit.wikimedia.org/r/630407 [08:19:51] PROBLEM - PHP7 rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:20:16] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10akosiaris) @Cmjohnson, Wednesday it's fine. In fact, I think we can do all of these in a single maint window (say a 2-3hours). Since gracefully powering off a host (via... [08:20:39] (03CR) 10JMeybohm: [C: 03+2] Revert "pybal: Move from conf1006 to conf1005 as config_host in esams" [puppet] - 10https://gerrit.wikimedia.org/r/630408 (owner: 10JMeybohm) [08:20:53] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [08:21:01] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:21:17] (03CR) 10Volans: "As this is installed everywhere please make sure it works with python3.[4,5,7]" [puppet] - 10https://gerrit.wikimedia.org/r/630697 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [08:21:23] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:47] !log switching esams pybal back to conf1006 - T196487 [08:21:51] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 635 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Application_servers [08:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:53] T196487: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 [08:21:57] RECOVERY - mcrouter process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [08:21:59] RECOVERY - PHP7 rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 302 Found - 648 bytes in 0.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [08:24:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [08:24:12] (03CR) 10JMeybohm: [C: 03+2] Revert "Temporarily remove conf1006 from client SRV records" [dns] - 10https://gerrit.wikimedia.org/r/630407 (owner: 10JMeybohm) [08:33:08] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10Patch-For-Review: Move mathoid to use TLS only - https://phabricator.wikimedia.org/T255875 (10JMeybohm) [08:34:29] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [08:45:15] (03PS1) 10Vgutierrez: ATS: Turn DHE-RSA-AES128-SHA support off [puppet] - 10https://gerrit.wikimedia.org/r/630768 (https://phabricator.wikimedia.org/T258405) [08:55:11] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 55.93 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [08:57:54] 10Operations, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) [08:58:05] 10Operations, 10Analytics, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) [08:59:56] (03CR) 10Hashar: [C: 03+2] "+2ing to save Mukunda a bunch of time later today :]" [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630710 (https://phabricator.wikimedia.org/T263177) (owner: 10TrainBranchBot) [09:02:21] (03PS9) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [09:02:53] (03PS1) 10JMeybohm: citoid: Add zotero TLS port to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/630769 (https://phabricator.wikimedia.org/T255869) [09:03:17] (03CR) 10jerkins-bot: [V: 04-1] reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [09:06:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] otrs: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621368 (owner: 10Dzahn) [09:06:53] (03PS2) 10Alexandros Kosiaris: otrs: remove jessie support [puppet] - 10https://gerrit.wikimedia.org/r/621368 (owner: 10Dzahn) [09:08:09] !log update rails on puppetmasters [09:08:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:50] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: group alerts and add severity: page [puppet] - 10https://gerrit.wikimedia.org/r/630556 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [09:09:33] (03PS2) 10Filippo Giunchedi: alertmanager: group alerts and add severity: page [puppet] - 10https://gerrit.wikimedia.org/r/630556 (https://phabricator.wikimedia.org/T258948) [09:10:29] (03PS2) 10Filippo Giunchedi: am: tweak alert labels/annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630554 [09:11:40] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: tweak alert labels/annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630554 (owner: 10Filippo Giunchedi) [09:12:38] (03PS1) 10Marostegui: dbproxy1019: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/630770 [09:12:40] (03CR) 10Jbond: oozie: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [09:13:57] (03PS10) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [09:14:34] (03PS1) 10Alexandros Kosiaris: citoid: Allow zotero access to HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/630771 (https://phabricator.wikimedia.org/T255869) [09:15:25] are you trolling me, akosiaris? :-P https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/630769 [09:15:40] (03CR) 10Marostegui: [C: 03+2] dbproxy1019: Depool labsdb1010 [puppet] - 10https://gerrit.wikimedia.org/r/630770 (owner: 10Marostegui) [09:16:14] jayme: kinda :P [09:16:27] lemme abandon mine and comment on that [09:16:54] but you have the default-policy as well ...which I don't get [09:17:03] !log Depool labsdb1010 from web role [09:17:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:18] why do we need that at all? [09:17:59] cause the change you did is currently "informational". As in we need to upgrade calico for that to be honored. The other one is the one we do need, but that's going to die after the upgrade [09:18:25] the addition of those into values.yaml was prep work for the upgrade [09:18:57] so we are in a a bit of a pickle of having to remember to update both places while doing the upgrade (which should finally go into next Q ORKs) [09:19:03] (03CR) 10Muehlenhoff: reboot-groups (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [09:20:03] (03Abandoned) 10Alexandros Kosiaris: citoid: Allow zotero access to HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/630771 (https://phabricator.wikimedia.org/T255869) (owner: 10Alexandros Kosiaris) [09:20:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "helmfile.d/admin/common/calico/default-kubernetes-policy.yaml needs an update too" [deployment-charts] - 10https://gerrit.wikimedia.org/r/630769 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:20:45] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10ArielGlenn) @phuedx Please poke me via google when you are here today, so that I can verify that it's really you. [09:21:01] (03PS1) 10Marostegui: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) [09:21:15] akosiaris: ah, I see! Then I'm at least fine in clicking "submit" at the pending phab task browser window :) [09:21:36] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.11 [core] (wmf/1.36.0-wmf.11) - 10https://gerrit.wikimedia.org/r/630710 (https://phabricator.wikimedia.org/T263177) (owner: 10TrainBranchBot) [09:21:43] (03CR) 10Jbond: "See inline" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628460 (owner: 10Dzahn) [09:22:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] tools: puppetdb reduce postgres memory usage [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond) [09:22:41] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [09:24:16] (03PS1) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) [09:24:38] (03CR) 10Jbond: cassandra: add data types, remove validation code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630312 (owner: 10Dzahn) [09:24:43] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [09:24:48] (03PS2) 10Marostegui: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) [09:29:55] 10Operations, 10LDAP-Access-Requests: Access to Superset - https://phabricator.wikimedia.org/T263868 (10ArielGlenn) @JRabah I have added you to the wmf LDAP group. This should be all you need for access to Superset. You should now (or in a few minutes) be able to log in with your Wikitech username and password... [09:31:36] (03CR) 10Jbond: docker: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [09:33:07] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 101.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [09:36:20] (03CR) 10Effie Mouzeli: [C: 03+2] mwdebug1001: remove opcache tuning [puppet] - 10https://gerrit.wikimedia.org/r/630558 (owner: 10Effie Mouzeli) [09:37:41] (03PS3) 10Kormat: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) [09:38:01] (03PS2) 10Kormat: wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/629716 (https://phabricator.wikimedia.org/T239238) [09:38:33] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10ArielGlenn) @bete You have been added to the wmde and nda LDAP groups. Please verify that you have the access you need, and I'l close this task. [09:38:35] (03PS2) 10JMeybohm: citoid: Add zotero TLS port to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/630769 (https://phabricator.wikimedia.org/T255869) [09:39:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] citoid: Add zotero TLS port to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/630769 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:40:03] (03CR) 10Hnowlan: [C: 03+2] restbase: set role for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/630641 (https://phabricator.wikimedia.org/T261512) (owner: 10Hnowlan) [09:40:12] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1001/25489/maps1004.eqiad.wmnet/change.maps1004.eqiad.wmnet.err" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629439 (owner: 10Dzahn) [09:40:15] (03CR) 10Kormat: [C: 03+1] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/630774 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [09:41:01] (03CR) 10JMeybohm: [C: 03+2] citoid: Add zotero TLS port to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/630769 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:41:14] (03PS2) 10Hnowlan: restbase: set role for new nodes [puppet] - 10https://gerrit.wikimedia.org/r/630641 (https://phabricator.wikimedia.org/T261512) [09:41:25] (03CR) 10Jbond: [C: 03+1] "lgtm ill rebase mine when this is merged" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [09:41:42] godog: ^^ this one is ready for review now, ill refactor mine when this is merged [09:41:45] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) Per the page mentioned above, pinging @Legoktm , @wctaiwan and @DAlangi_WMF in case they have any... [09:43:05] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1131 to s6 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [09:43:07] (03Merged) 10jenkins-bot: citoid: Add zotero TLS port to egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/630769 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [09:43:10] jbond42: ack, thanks! I'll take a look [09:43:40] (03CR) 10Jbond: wmcs::postgres: hiera->lookup and add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/628459 (owner: 10Dzahn) [09:44:02] (03CR) 10Marostegui: [C: 04-2] mariadb: Promote db1131 to s6 master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630773 (https://phabricator.wikimedia.org/T263227) (owner: 10Marostegui) [09:47:16] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [09:47:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:05] I've just wracked the staging cluster - stay tuned [09:50:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630703 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [09:50:39] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10ArielGlenn) Identity of requestor and the request itself verified via Google Meet. Patch coming up. [09:51:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:06] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:51:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:36] !log kormat@cumin1001 dbctl commit (dc=all): 'Set db1104 with weight 0 T239238', diff saved to https://phabricator.wikimedia.org/P12829 and previous config saved to /var/cache/conftool/dbconfig/20200929-095135-kormat.json [09:51:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:41] T239238: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 [09:54:43] 10Operations, 10Analytics, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) I also checked with `top` on cp5011 to better visualize the graph, and the usage is really too much from what I used to see. We are in the process of evaluating atskafka but it w... [09:56:16] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10ArielGlenn) p:05Triage→03Medium [09:56:26] (03PS2) 10Jbond: Anycast: add check_fail [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [09:57:24] (03PS1) 10ArielGlenn: update Sam Smith's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/630777 (https://phabricator.wikimedia.org/T263992) [09:59:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [09:59:38] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [09:59:39] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [09:59:41] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:59:41] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [09:59:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:52] (03CR) 10ArielGlenn: [C: 03+2] update Sam Smith's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/630777 (https://phabricator.wikimedia.org/T263992) (owner: 10ArielGlenn) [10:01:03] PROBLEM - kubelet operational latencies on kubestage1001 is CRITICAL: instance=kubestage1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:01:19] me...I guess [10:01:21] (03CR) 10Jbond: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [10:01:35] (03PS4) 10Effie Mouzeli: WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [10:01:39] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:01:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [10:02:43] (03CR) 10jerkins-bot: [V: 04-1] WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [10:04:19] (03PS4) 10Kormat: mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) [10:04:24] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudgw: drop unused orig_nic hiera key [puppet] - 10https://gerrit.wikimedia.org/r/630778 (https://phabricator.wikimedia.org/T261724) [10:04:33] RECOVERY - kubelet operational latencies on kubestage1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:05:11] (03CR) 10Kormat: [C: 03+2] mariadb: Promote db1104 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/629707 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat) [10:05:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudgw: drop unused orig_nic hiera key [puppet] - 10https://gerrit.wikimedia.org/r/630778 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:05:57] !log Starting s8 eqiad failover from db1109 to db1104 - T239238 [10:06:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:02] T239238: Switchover s8 primary database master db1109 -> db1104 - 2020-09-29 08:00 UTC - https://phabricator.wikimedia.org/T239238 [10:07:24] !log kormat@cumin1001 dbctl commit (dc=all): 'Promote db1104 on s8 eqiad master T239238', diff saved to https://phabricator.wikimedia.org/P12830 and previous config saved to /var/cache/conftool/dbconfig/20200929-100723-kormat.json [10:07:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:09] RECOVERY - snapshot of s5 in eqiad on alert1001 is OK: Last snapshot for s5 at eqiad (db1145.eqiad.wmnet:3315) taken on 2020-09-29 09:03:30 (665 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Alerting [10:09:00] (03PS3) 10Volans: Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:10:09] (03CR) 10Kormat: [C: 03+2] wmnet: Update s8-master alias [dns] - 10https://gerrit.wikimedia.org/r/629716 (https://phabricator.wikimedia.org/T239238) (owner: 10Kormat) [10:10:19] (03PS1) 10Arturo Borrero Gonzalez: interfaces: drop aggregate (bonding) dead code [puppet] - 10https://gerrit.wikimedia.org/r/630779 [10:16:50] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:21] (03PS4) 10Volans: Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:20:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good and confirmed unused in current Puppet." [puppet] - 10https://gerrit.wikimedia.org/r/630779 (owner: 10Arturo Borrero Gonzalez) [10:21:38] (03PS5) 10Volans: Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:22:52] (03CR) 10Volans: [C: 03+1] "In the current status the change looks good to me. All records are accounted for unless they have an inline comment." (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [10:22:56] (03CR) 10Jbond: [C: 03+1] Anycast: add check_fail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [10:25:25] (03PS1) 10Alexandros Kosiaris: termbox: Harmonize service runner metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/630780 [10:27:35] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [10:28:31] (03CR) 10Jbond: "LGTM and PCC is pretty much noop, but check this is compatible with changes Stevie may be working on" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [10:29:41] (03CR) 10Jbond: [C: 03+2] tools: puppetdb reduce postgres memory usage [puppet] - 10https://gerrit.wikimedia.org/r/630574 (owner: 10Jbond) [10:29:53] (03PS2) 10Alexandros Kosiaris: termbox: Harmonize service runner metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/630780 [10:31:18] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10ArielGlenn) @phuedx Please try bast1002.wikimedia.org and verify that the new key is working., and I'll close this task. Note that it will take up to a half hour for the key to make it around to th... [10:31:52] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Legoktm) While MassMessage is how users see the problem (e.g. no one really notices a page's cache being... [10:32:00] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Kormat) [10:32:12] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10phuedx) >>! In T263992#6501367, @ArielGlenn wrote: > @phuedx Please try bast1002.wikimedia.org and verify that the new key is working., and I'll close this task. Note that it will take up to a half... [10:32:14] 10Operations, 10serviceops, 10Patch-For-Review: docker-reporter-releng-images failed on deneb - https://phabricator.wikimedia.org/T251918 (10ArielGlenn) 05Resolved→03Open New failure. Here's the output: ` ep 28 16:38:10 deneb docker-report-releng[23588]: INFO[docker-report] Building debmonitor report for... [10:32:19] 10Operations, 10SRE-Access-Requests: Add new SSH key for Sam Smith - https://phabricator.wikimedia.org/T263992 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn Great! Closing. [10:32:28] 10Operations, 10LDAP-Access-Requests: Access to Superset for Jack Rabah - https://phabricator.wikimedia.org/T263868 (10Aklapper) [10:32:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 25%: After reboot to troubleshoot a degraded RAID', diff saved to https://phabricator.wikimedia.org/P12831 and previous config saved to /var/cache/conftool/dbconfig/20200929-103253-root.json [10:32:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:09] (03CR) 10Jcrespo: "Just to be clear, I am 100% onboard with this, just mentioning it because core_test is a carbon copy of core, and it may be refactored dee" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [10:35:38] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630779 (owner: 10Arturo Borrero Gonzalez) [10:36:23] (03PS1) 10Filippo Giunchedi: am: quote URLs in annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630782 [10:39:40] !log installing libdbi-perl security updates for stretch/buster [10:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:24] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [10:40:24] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [10:40:25] !log hnowlan@cumin1001 START - Cookbook sre.hosts.downtime [10:40:27] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:40:27] !log hnowlan@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [10:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:21] !log hnowlan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:42:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:42] !log re-enable TFTP ALGs on all mr [10:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/25496/ms-fe1005.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [10:47:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 50%: After reboot to troubleshoot a degraded RAID', diff saved to https://phabricator.wikimedia.org/P12832 and previous config saved to /var/cache/conftool/dbconfig/20200929-104757-root.json [10:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:30] (03PS3) 10Jbond: role::analytics_test_cluster::coordinator: add analytics users without ssh keys [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [10:54:28] (03CR) 10Jbond: "I rebased this to try and recreate https://phabricator.wikimedia.org/T263876 however i notice that `analytics_test_cluster::coordinator` n" [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [10:55:30] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [10:57:16] (03PS1) 10Jbond: admin: test no ascii characters [puppet] - 10https://gerrit.wikimedia.org/r/630783 (https://phabricator.wikimedia.org/T263876) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for European mid-day backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:20] yup, looks like there’s nothing to do :) [11:00:26] hi Lucas_WMDE :) [11:00:30] o/ [11:00:37] indeed, empty window :/ [11:01:21] nice easy deploys :-D [11:01:29] hehe [11:03:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 75%: After reboot to troubleshoot a degraded RAID', diff saved to https://phabricator.wikimedia.org/P12833 and previous config saved to /var/cache/conftool/dbconfig/20200929-110300-root.json [11:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:37] (03PS1) 10Muehlenhoff: Enabled managed sources.list for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/630785 (https://phabricator.wikimedia.org/T158562) [11:06:43] (03PS1) 10Volans: junos: catch exceptions in rollback [software/homer] - 10https://gerrit.wikimedia.org/r/630807 [11:10:40] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudgw: add hiera keys for basic network bits [puppet] - 10https://gerrit.wikimedia.org/r/630809 (https://phabricator.wikimedia.org/T261724) [11:11:31] (03PS3) 10Ayounsi: Anycast: add check_fail [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) [11:13:10] (03PS2) 10Arturo Borrero Gonzalez: openstack: cloudgw: add hiera keys for basic network bits [puppet] - 10https://gerrit.wikimedia.org/r/630809 (https://phabricator.wikimedia.org/T261724) [11:13:15] (03CR) 10Ayounsi: [C: 03+2] "PCC looks good! https://puppet-compiler.wmflabs.org/compiler1003/25502/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630757 (https://phabricator.wikimedia.org/T262372) (owner: 10Ayounsi) [11:14:53] (03CR) 10Arturo Borrero Gonzalez: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25503/labtestvirt2003.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630809 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:14:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: cloudgw: add hiera keys for basic network bits [puppet] - 10https://gerrit.wikimedia.org/r/630809 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [11:18:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'es2026 (re)pooling @ 100%: After reboot to troubleshoot a degraded RAID', diff saved to https://phabricator.wikimedia.org/P12834 and previous config saved to /var/cache/conftool/dbconfig/20200929-111804-root.json [11:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:20] (03PS2) 10Jbond: python3: add tox checks for python3 [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/615793 [11:20:25] (03CR) 10Vgutierrez: [C: 03+2] ATS: Turn DHE-RSA-AES128-SHA support off [puppet] - 10https://gerrit.wikimedia.org/r/630768 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [11:23:28] 10Operations, 10netops, 10Patch-For-Review: Configure BGP route damping on Anycast sessions - https://phabricator.wikimedia.org/T262372 (10ayounsi) 05Open→03Resolved Fixed. [11:24:32] (03PS5) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [11:25:40] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [11:28:38] !log disabling DHE-RSA-AES128-SHA support - T258405 [11:28:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:44] T258405: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 [11:31:40] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25504/" [puppet] - 10https://gerrit.wikimedia.org/r/630785 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [11:35:59] PROBLEM - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is CRITICAL: 100.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [11:36:03] PROBLEM - ats-tls HTTPS en.wikipedia.org RSA on cp2033 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [11:37:37] (03CR) 10Ayounsi: "🚀" (035 comments) [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [11:42:29] (03PS6) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [11:43:32] (03CR) 10jerkins-bot: [V: 04-1] mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [11:47:22] (03PS1) 10Arturo Borrero Gonzalez: openstack: cloudgw: introduce native vlan for easier reimaging [puppet] - 10https://gerrit.wikimedia.org/r/630812 (https://phabricator.wikimedia.org/T263622) [11:48:45] PROBLEM - ats-tls HTTPS wikiworkshop.org RSA on cp3064 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [11:54:44] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [11:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:39] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [12:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:06:03] (03PS7) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [12:07:42] (03CR) 10Kormat: [C: 03+1] "LGTM. I plan to do the same with role::mariadb::core next Q, thanks for taking care of this one already :)" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [12:08:37] (03CR) 10Jcrespo: [C: 03+1] mariadb::core_test: convert role to profile, add data types [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [12:12:35] (03Abandoned) 10Elukey: admin: test no ascii characters [puppet] - 10https://gerrit.wikimedia.org/r/630783 (https://phabricator.wikimedia.org/T263876) (owner: 10Jbond) [12:12:45] (03Restored) 10Elukey: admin: test no ascii characters [puppet] - 10https://gerrit.wikimedia.org/r/630783 (https://phabricator.wikimedia.org/T263876) (owner: 10Jbond) [12:12:54] (03CR) 10Elukey: "sorry wrong patch!" [puppet] - 10https://gerrit.wikimedia.org/r/630783 (https://phabricator.wikimedia.org/T263876) (owner: 10Jbond) [12:13:48] (03CR) 10Elukey: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [12:25:51] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10hashar) Note we fetch from http://pkg.jenkins-ci.org/debian-stable/... [12:28:12] !log kormat@cumin1001 dbctl commit (dc=all): 'Temporarily add db2126 to dump/vslow T259831', diff saved to https://phabricator.wikimedia.org/P12835 and previous config saved to /var/cache/conftool/dbconfig/20200929-122811-kormat.json [12:28:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:17] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:28:36] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:28:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db2108 depooling: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12836 and previous config saved to /var/cache/conftool/dbconfig/20200929-122914-kormat.json [12:29:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:09] RECOVERY - Rate of JVM GC Old generation-s runs - elastic2029-production-search-psi-codfw on elastic2029 is OK: (C)100 gt (W)80 gt 79.32 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-codfw&var-instance=elastic2029&panelId=37 [12:40:15] (03PS1) 10JMeybohm: services_proxy: switch zotero to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/630788 (https://phabricator.wikimedia.org/T255869) [12:42:45] 10Operations, 10Continuous-Integration-Infrastructure, 10Jenkins, 10Release-Engineering-Team (CI & Testing services), and 2 others: Review process to fetch Jenkins Debian package from upstream - https://phabricator.wikimedia.org/T260282 (10MoritzMuehlenhoff) I think the exceptional case of double LTS relea... [12:44:40] (03PS1) 10Hnowlan: restbase: install libjemalloc2 if on buster or later [puppet] - 10https://gerrit.wikimedia.org/r/630829 (https://phabricator.wikimedia.org/T264092) [12:46:07] (03CR) 10JMeybohm: [C: 03+2] services_proxy: switch zotero to the TLS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/630788 (https://phabricator.wikimedia.org/T255869) (owner: 10JMeybohm) [12:47:08] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (there's already some conditional code for stretch/buster above, would also be an option to simply fold it in there, but that's fine " [puppet] - 10https://gerrit.wikimedia.org/r/630829 (https://phabricator.wikimedia.org/T264092) (owner: 10Hnowlan) [12:50:26] (03CR) 10Ayounsi: [C: 03+2] Remove need for codfw only SNMP community [puppet] - 10https://gerrit.wikimedia.org/r/627514 (owner: 10Ayounsi) [12:50:32] (03PS2) 10Ayounsi: Remove need for codfw only SNMP community [puppet] - 10https://gerrit.wikimedia.org/r/627514 [12:53:42] !log installing QT security updates [12:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 25%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12837 and previous config saved to /var/cache/conftool/dbconfig/20200929-125508-kormat.json [12:55:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:16] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [12:55:34] (03CR) 10Filippo Giunchedi: [C: 03+1] Enabled managed sources.list for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/630785 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [12:56:28] (03PS1) 10Ayounsi: Remove unused mock passwords [labs/private] - 10https://gerrit.wikimedia.org/r/630830 [12:56:48] !log jayme@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'citoid' for release 'staging' . [12:56:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:19] (03CR) 10Ayounsi: "Then remove the same ones from the real secret dungeon." [labs/private] - 10https://gerrit.wikimedia.org/r/630830 (owner: 10Ayounsi) [13:04:47] 10Operations, 10LDAP-Access-Requests: Access to Superset for Jack Rabah - https://phabricator.wikimedia.org/T263868 (10JRabah) @ArielGlenn I am all set! Thank you, feel free to resolve the this task. [13:10:12] !log kormat@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 50%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12838 and previous config saved to /var/cache/conftool/dbconfig/20200929-131011-kormat.json [13:10:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:17] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:20:19] (03Abandoned) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [13:21:20] (03PS1) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) [13:24:00] 10Operations, 10LDAP-Access-Requests: Access to Superset for Jack Rabah - https://phabricator.wikimedia.org/T263868 (10ArielGlenn) 05Open→03Resolved a:03ArielGlenn [13:25:15] !log kormat@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 75%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12839 and previous config saved to /var/cache/conftool/dbconfig/20200929-132515-kormat.json [13:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:21] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:28:15] !log installing lua5.3 security updates [13:28:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:56] !log jayme@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:31:30] (03PS1) 10BBlack: unified cert: esams and eqsin to use LE for now [puppet] - 10https://gerrit.wikimedia.org/r/630847 (https://phabricator.wikimedia.org/T261419) [13:32:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:32:58] 10Operations, 10Traffic, 10Patch-For-Review: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 (10Vgutierrez) [13:34:09] (03PS1) 10Andrew Bogott: Revert "Return cloudvirt1012, 1013, 1014 to the standard labvirt partman" [puppet] - 10https://gerrit.wikimedia.org/r/630848 [13:36:11] !log upload@esams: rolling varnish upgrade to 6.0.6-1wm1 T263557 [13:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:16] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [13:39:17] (03PS1) 10Jbond: differ: update encoding [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630849 [13:39:19] (03PS1) 10Jbond: 0.8.1: release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630850 [13:39:44] (03CR) 10jerkins-bot: [V: 04-1] differ: update encoding [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630849 (owner: 10Jbond) [13:39:50] (03CR) 10jerkins-bot: [V: 04-1] 0.8.1: release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630850 (owner: 10Jbond) [13:40:18] !log kormat@cumin1001 dbctl commit (dc=all): 'db2108 (re)pooling @ 100%: schema change T259831', diff saved to https://phabricator.wikimedia.org/P12840 and previous config saved to /var/cache/conftool/dbconfig/20200929-134018-kormat.json [13:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:26] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:40:27] I'm no longer getting live notifications on phabricator. did something change? [13:40:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove unused mock passwords [labs/private] - 10https://gerrit.wikimedia.org/r/630830 (owner: 10Ayounsi) [13:40:56] (03PS1) 10Volans: Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 [13:41:18] (03CR) 10Effie Mouzeli: [C: 03+2] change variable use_onhost_memcache to use_onhost_memcached [puppet] - 10https://gerrit.wikimedia.org/r/629441 (owner: 10Effie Mouzeli) [13:41:55] (03CR) 10jerkins-bot: [V: 04-1] Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 (owner: 10Volans) [13:42:57] (03CR) 10Kormat: "One comment. The recipe looks fine." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630848 (owner: 10Andrew Bogott) [13:43:59] (03Abandoned) 10Effie Mouzeli: hieradata: enable onhost memcached on mwdeb1001 [puppet] - 10https://gerrit.wikimedia.org/r/629369 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [13:47:10] !log text@esams: rolling varnish upgrade to 6.0.6-1wm1 T263557 [13:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:16] T263557: Upgrade production cache nodes to Varnish 6 - https://phabricator.wikimedia.org/T263557 [13:49:10] (03PS8) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [13:49:26] !log kormat@cumin1001 dbctl commit (dc=all): 'Remove db2126 from dump/vslow T259831', diff saved to https://phabricator.wikimedia.org/P12841 and previous config saved to /var/cache/conftool/dbconfig/20200929-134926-kormat.json [13:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:32] T259831: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 [13:49:44] (03PS2) 10Andrew Bogott: Revert "Return cloudvirt1012, 1013, 1014 to the standard labvirt partman" [puppet] - 10https://gerrit.wikimedia.org/r/630848 [13:50:25] (03CR) 10Kormat: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/630848 (owner: 10Andrew Bogott) [13:50:27] (03CR) 10Andrew Bogott: [C: 03+2] Revert "Return cloudvirt1012, 1013, 1014 to the standard labvirt partman" [puppet] - 10https://gerrit.wikimedia.org/r/630848 (owner: 10Andrew Bogott) [13:51:20] (03CR) 10jerkins-bot: [V: 04-1] Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [13:53:26] (03PS9) 10Gehel: Extracting obvious reporting code to a Reporter class. [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) [13:53:45] (03PS2) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) [13:54:28] !log jayme@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:08] (03PS1) 10Effie Mouzeli: hieradata: enable onhost memcached on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/630856 (https://phabricator.wikimedia.org/T263958) [13:57:48] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [13:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:54] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [13:57:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:08] (03CR) 10Hnowlan: [C: 04-1] Expose /page/descrtion API (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) (owner: 10Ppchelko) [14:00:35] (03PS2) 10Volans: Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 [14:00:37] (03PS1) 10Volans: Fix newly reported pylint issues [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 [14:01:50] (03CR) 10jerkins-bot: [V: 04-1] Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 (owner: 10Volans) [14:03:41] (03PS2) 10Volans: Fix newly reported pylint issues [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 [14:03:43] (03PS3) 10Volans: Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 [14:04:11] (03PS3) 10Effie Mouzeli: mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) [14:04:46] (03PS2) 10Jbond: differ: update encoding [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630849 [14:04:50] (03PS6) 10Volans: Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [14:04:55] (03PS2) 10Jbond: 0.8.1: release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630850 [14:05:21] (03PS1) 10CDanis: WIP: NEL all the things [puppet] - 10https://gerrit.wikimedia.org/r/630860 [14:05:30] (03CR) 10Jbond: [C: 03+2] differ: update encoding [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630849 (owner: 10Jbond) [14:05:44] (03CR) 10Jbond: [C: 03+2] 0.8.1: release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630850 (owner: 10Jbond) [14:06:08] (03Merged) 10jenkins-bot: differ: update encoding [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630849 (owner: 10Jbond) [14:06:12] !log installing facter updates from Buster 10.6 point release [14:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:32] (03Merged) 10jenkins-bot: 0.8.1: release [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/630850 (owner: 10Jbond) [14:07:44] 10Operations: Integrate Buster 10.6 point update - https://phabricator.wikimedia.org/T263974 (10MoritzMuehlenhoff) [14:07:51] (03PS2) 10Effie Mouzeli: hieradata: enable onhost memcached on mwdebug1001 [puppet] - 10https://gerrit.wikimedia.org/r/630856 (https://phabricator.wikimedia.org/T263958) [14:07:57] (03PS1) 10Elukey: Add profile::hadoop::worker::gpu to Hadoop workers' role [puppet] - 10https://gerrit.wikimedia.org/r/630861 (https://phabricator.wikimedia.org/T255138) [14:09:19] (03PS3) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) [14:09:21] (03PS2) 10Elukey: Add profile::hadoop::worker::gpu to Hadoop workers' role [puppet] - 10https://gerrit.wikimedia.org/r/630861 (https://phabricator.wikimedia.org/T255138) [14:09:23] (03PS4) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) [14:10:54] (03CR) 10Ppchelko: Expose /page/descrtion API (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) (owner: 10Ppchelko) [14:11:44] (03CR) 10Jbond: "PCC: now working" [puppet] - 10https://gerrit.wikimedia.org/r/630783 (https://phabricator.wikimedia.org/T263876) (owner: 10Jbond) [14:12:04] (03Abandoned) 10Jbond: admin: test no ascii characters [puppet] - 10https://gerrit.wikimedia.org/r/630783 (https://phabricator.wikimedia.org/T263876) (owner: 10Jbond) [14:12:34] (03PS2) 10CDanis: launch Network Error Logging on all WMF domains [puppet] - 10https://gerrit.wikimedia.org/r/630860 (https://phabricator.wikimedia.org/T257527) [14:12:48] (03CR) 10Effie Mouzeli: "PCC is NOOP https://puppet-compiler.wmflabs.org/compiler1001/25514/" [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [14:14:44] (03CR) 10Effie Mouzeli: "PCC https://puppet-compiler.wmflabs.org/compiler1003/25517/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630856 (https://phabricator.wikimedia.org/T263958) (owner: 10Effie Mouzeli) [14:15:08] (03CR) 10Effie Mouzeli: "PCC when it is enabled: https://puppet-compiler.wmflabs.org/compiler1003/25517/mwdebug1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630845 (https://phabricator.wikimedia.org/T244340) (owner: 10Effie Mouzeli) [14:15:17] (03CR) 10Vgutierrez: [C: 03+1] "this is going to need an ats-tls-restart on the affected servers" [puppet] - 10https://gerrit.wikimedia.org/r/630847 (https://phabricator.wikimedia.org/T261419) (owner: 10BBlack) [14:15:21] (03CR) 10Jbond: [C: 03+1] "LGTM" [software/homer] - 10https://gerrit.wikimedia.org/r/630807 (owner: 10Volans) [14:16:03] (03PS5) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) [14:16:08] (03CR) 10BBlack: [C: 03+1] Migrate EQSIN to Netbox Automation [dns] - 10https://gerrit.wikimedia.org/r/630644 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [14:16:52] (03CR) 10Ppchelko: Expose /page/descrtion API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) (owner: 10Ppchelko) [14:16:59] (03CR) 10Hnowlan: Expose /page/descrtion API (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) (owner: 10Ppchelko) [14:17:36] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/630218 (https://phabricator.wikimedia.org/T262660) (owner: 10Elukey) [14:18:30] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630785 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [14:19:10] (03PS1) 10Herron: logstash: add kibana7 SANs to kibana certificate [puppet] - 10https://gerrit.wikimedia.org/r/630862 [14:20:41] (03CR) 10Ayounsi: [C: 03+1] "Tested and lgtm too!" [software/homer] - 10https://gerrit.wikimedia.org/r/630807 (owner: 10Volans) [14:22:42] (03PS1) 10Giuseppe Lavagetto: wikifeeds: use the service proxy for restbase as well in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/630863 [14:22:44] (03PS1) 10Giuseppe Lavagetto: wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/630864 [14:22:49] (03CR) 10Muehlenhoff: [C: 03+2] Enabled managed sources.list for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/630785 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [14:22:51] (03PS6) 10Ppchelko: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) [14:22:56] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add kibana7 SANs to kibana certificate [puppet] - 10https://gerrit.wikimedia.org/r/630862 (owner: 10Herron) [14:23:21] 10Operations, 10ops-eqiad, 10DC-Ops: (Need By: TBD) install memory upgrades in ores100[1-9] - https://phabricator.wikimedia.org/T259909 (10Cmjohnson) @akosiaris Thanks! I will get this done for you tomorrow. [14:23:31] (03PS1) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [14:24:04] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/630862 (owner: 10Herron) [14:24:06] (03CR) 10Volans: [C: 03+2] junos: catch exceptions in rollback [software/homer] - 10https://gerrit.wikimedia.org/r/630807 (owner: 10Volans) [14:25:03] (03CR) 10Herron: [C: 03+2] logstash: add kibana7 SANs to kibana certificate [puppet] - 10https://gerrit.wikimedia.org/r/630862 (owner: 10Herron) [14:25:24] (03PS1) 10Ssingh: dnsdist: improve formatting of dnsdist.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/630866 [14:25:57] (03Merged) 10jenkins-bot: junos: catch exceptions in rollback [software/homer] - 10https://gerrit.wikimedia.org/r/630807 (owner: 10Volans) [14:26:11] (03CR) 10BBlack: [C: 03+2] unified cert: esams and eqsin to use LE for now [puppet] - 10https://gerrit.wikimedia.org/r/630847 (https://phabricator.wikimedia.org/T261419) (owner: 10BBlack) [14:26:42] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1001/25519/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630866 (owner: 10Ssingh) [14:27:11] (03CR) 10Ppchelko: "Confirmed that we are only using this URI to directly issue requests now and not storing it in the content anywhere." [deployment-charts] - 10https://gerrit.wikimedia.org/r/630863 (owner: 10Giuseppe Lavagetto) [14:27:24] (03PS1) 10Jbond: profile::prometheus::snmp_exporter: update snmp ro string [puppet] - 10https://gerrit.wikimedia.org/r/630867 [14:27:47] (03CR) 10Jbond: "See inline" (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/630830 (owner: 10Ayounsi) [14:28:47] (03CR) 10Cwhite: [C: 03+1] am: quote URLs in annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630782 (owner: 10Filippo Giunchedi) [14:28:57] 10Operations, 10DBA, 10Blocked-on-schema-change, 10User-Kormat: Schema change to make change_tag.ct_rc_id unsigned - https://phabricator.wikimedia.org/T259831 (10Kormat) [14:28:58] (03PS2) 10Jbond: profile::prometheus::snmp_exporter: update snmp ro string [puppet] - 10https://gerrit.wikimedia.org/r/630867 [14:29:31] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630867 (owner: 10Jbond) [14:30:07] (03CR) 10jerkins-bot: [V: 04-1] profile::prometheus::snmp_exporter: update snmp ro string [puppet] - 10https://gerrit.wikimedia.org/r/630867 (owner: 10Jbond) [14:30:26] !log switching eqsin and esams public-facing unified certs to letsencrypt - https://gerrit.wikimedia.org/r/c/operations/puppet/+/630847 [14:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:30:42] (03PS3) 10Jbond: profile::prometheus::snmp_exporter: update snmp ro string [puppet] - 10https://gerrit.wikimedia.org/r/630867 [14:30:43] !log elukey@cumin1001 START - Cookbook sre.hadoop.init-hadoop-workers [14:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:21] (03CR) 10Hnowlan: [C: 03+2] Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) (owner: 10Ppchelko) [14:32:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) [14:32:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:34] (03PS2) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [14:34:40] (03Merged) 10jenkins-bot: Expose /page/descrtion API [deployment-charts] - 10https://gerrit.wikimedia.org/r/630657 (https://phabricator.wikimedia.org/T262498) (owner: 10Ppchelko) [14:34:56] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [14:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:08] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 (owner: 10Volans) [14:35:25] (03CR) 10Hnowlan: [C: 03+2] restbase: install libjemalloc2 if on buster or later [puppet] - 10https://gerrit.wikimedia.org/r/630829 (https://phabricator.wikimedia.org/T264092) (owner: 10Hnowlan) [14:35:52] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 (owner: 10Volans) [14:36:17] (03CR) 10Mholloway: [C: 03+1] wikifeeds: use the service proxy for restbase as well in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/630863 (owner: 10Giuseppe Lavagetto) [14:38:10] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10elukey) @RobH I used the related spicerack cookbook to init an-worker1102 (install all partitions with proper labels etc..) and as far as I can see now... [14:38:15] (03PS3) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [14:38:49] (03CR) 10Mholloway: [C: 03+1] wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/630864 (owner: 10Giuseppe Lavagetto) [14:38:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: use the service proxy for restbase as well in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/630863 (owner: 10Giuseppe Lavagetto) [14:38:59] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:46] (03Merged) 10jenkins-bot: wikifeeds: use the service proxy for restbase as well in eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/630863 (owner: 10Giuseppe Lavagetto) [14:41:48] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:58] 10Operations, 10CheckUser, 10Traffic: Log source port for anonymous users and expose it for sysops/checkusers - https://phabricator.wikimedia.org/T181368 (10NickK) >>! In T181368#6488698, @BBlack wrote: > Is this still desirable for checkusers? Infrastructure has changed since then and is still-changing, bu... [14:43:10] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] am: quote URLs in annotations [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630782 (owner: 10Filippo Giunchedi) [14:43:47] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:43:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:14] (03PS4) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [14:46:58] (03PS1) 10Cmjohnson: Adding production dns for frdata1002 and frmx1001 [dns] - 10https://gerrit.wikimedia.org/r/630873 (https://phabricator.wikimedia.org/T260181) [14:47:06] (03CR) 10Jbond: reboot-groups (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [14:47:29] (03CR) 10jerkins-bot: [V: 04-1] Adding production dns for frdata1002 and frmx1001 [dns] - 10https://gerrit.wikimedia.org/r/630873 (https://phabricator.wikimedia.org/T260181) (owner: 10Cmjohnson) [14:48:22] 10Operations, 10Analytics, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10ema) >>! In T264074#6501340, @elukey wrote: > I think it is better to know if the increase is brought by the new VUT/VSL api or if it is something else. Other units such as `varnishmta... [14:48:38] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [14:48:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:48] (03PS1) 10Muehlenhoff: autoinstall: Also use mirrors.wikimedia.org for publi/esams [puppet] - 10https://gerrit.wikimedia.org/r/630876 (https://phabricator.wikimedia.org/T158562) [14:48:57] (03PS3) 10Volans: Fix newly reported pylint issues [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 [14:49:56] (03PS2) 10Cmjohnson: Adding production dns for frdata1002 and frmx1001 [dns] - 10https://gerrit.wikimedia.org/r/630873 (https://phabricator.wikimedia.org/T260181) [14:51:42] (03CR) 10Ssingh: [C: 03+2] dnsdist: improve formatting of dnsdist.conf.erb [puppet] - 10https://gerrit.wikimedia.org/r/630866 (owner: 10Ssingh) [14:53:17] (03PS5) 10CDanis: VCL: don't serve Set-Cookies for domains that aren't ours [puppet] - 10https://gerrit.wikimedia.org/r/630865 [14:53:59] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp5008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [14:54:26] (03CR) 10Jbond: [C: 03+1] Fix newly reported pylint issues [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 (owner: 10Volans) [14:54:50] (03CR) 10Cmjohnson: [C: 03+2] Adding production dns for frdata1002 and frmx1001 [dns] - 10https://gerrit.wikimedia.org/r/630873 (https://phabricator.wikimedia.org/T260181) (owner: 10Cmjohnson) [14:55:57] (03CR) 10Volans: [C: 03+2] Fix newly reported pylint issues [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 (owner: 10Volans) [14:56:12] bblack: ^^ that's been triggered by your restart? [14:56:14] (03PS4) 10Volans: Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 [14:56:22] (03PS1) 10Muehlenhoff: Enabled managed sources.list for esams/ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/630879 (https://phabricator.wikimedia.org/T158562) [14:56:24] (03CR) 10Volans: [C: 03+2] Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 (owner: 10Volans) [14:57:35] vgutierrez: I assume so [14:57:38] (03Merged) 10jenkins-bot: Fix newly reported pylint issues [cookbooks] - 10https://gerrit.wikimedia.org/r/630858 (owner: 10Volans) [14:57:52] the timing's about right, checking [14:58:17] (but also, shouldn't ats restart be seamless if it doesn't depool?) [14:58:21] (03Merged) 10jenkins-bot: Add group titles to missing ones [cookbooks] - 10https://gerrit.wikimedia.org/r/630851 (owner: 10Volans) [14:59:21] bblack: ats-tls-restart depools and repools the node [14:59:34] (03CR) 10Alexandros Kosiaris: [C: 03+2] termbox: Harmonize service runner metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/630780 (owner: 10Alexandros Kosiaris) [15:00:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/630864 (owner: 10Giuseppe Lavagetto) [15:00:18] oh I see it now [15:00:19] ok [15:00:21] so, yes [15:00:54] !log restarting acme-chief on acmechief1001 [15:00:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:05] but yeah, the stapling validity warnings are something else [15:01:13] I assume that's what you're fixing now heh [15:01:20] indeed [15:02:05] was it down or? [15:02:31] yeah.. this weird state that we've seen once in the past where a reload is triggered but nothing happens [15:02:43] (03Merged) 10jenkins-bot: termbox: Harmonize service runner metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/630780 (owner: 10Alexandros Kosiaris) [15:02:45] (03CR) 10jerkins-bot: [V: 04-1] wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/630864 (owner: 10Giuseppe Lavagetto) [15:02:48] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:02:48] https://www.irccloud.com/pastebin/DPskE0mG/ [15:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:52] like that [15:03:11] (03PS2) 10Giuseppe Lavagetto: wikifeeds: use the service proxy everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/630864 [15:03:23] 10Operations, 10Analytics, 10Traffic: varnishkafka 1.1.0 CPU usage increase - https://phabricator.wikimedia.org/T264074 (10elukey) https://github.com/wikimedia/varnishkafka/commit/b0675e80c2a059ba3a508d8ebfc16a79bee3e154 shows a big change in usage of VUT/VSL, that afaics should be easier (more responsibilit... [15:04:47] the LE restarts are working on the last eqsin node now, then moving onto esams for another 45 minutes or so [15:07:03] (03CR) 10CRusnov: "hello! Is there a way to repro this in practice? I'll look at it" [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) (owner: 10Jcrespo) [15:07:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:09:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:10:56] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [15:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:32] (03CR) 10Jcrespo: [C: 04-1] "Reproduce:" [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) (owner: 10Jcrespo) [15:12:30] (03Abandoned) 10Jcrespo: base/check_systemd_state.py: Decode bytes before strip [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) (owner: 10Jcrespo) [15:12:56] (03CR) 10CRusnov: "Haha okay. Thanks. I'll open a ticket for this problem 😊" [puppet] - 10https://gerrit.wikimedia.org/r/630759 (https://phabricator.wikimedia.org/T247364) (owner: 10Jcrespo) [15:13:23] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Papaul) ` papaul@asw-d-codfw# show | compare [edit interfaces interface-range vlan-private1-d-codfw] - member ge-1/0/4; [edit interfaces interface-range disabled]... [15:13:43] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Papaul) [15:14:07] PROBLEM - ats-tls HTTPS wikiworkshop.org ECDSA on cp3052 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [15:15:04] (03Abandoned) 10Giuseppe Lavagetto: wikifeeds: use the service proxy for reaching the MediaWiki api [deployment-charts] - 10https://gerrit.wikimedia.org/r/628756 (https://phabricator.wikimedia.org/T255878) (owner: 10Giuseppe Lavagetto) [15:15:09] ^ also the LE ats-tls-restarts [15:15:34] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [15:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:43] RECOVERY - ats-tls HTTPS en.wikipedia.org RSA on cp2033 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 416658 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2020-12-17 10:00:19 +0000 (expires in 78 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:15:45] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp3052 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 355457 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2020-12-10 17:00:15 +0000 (expires in 72 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:17:07] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [15:23:47] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:39] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission-hardware: decommission es2013.codfw.wmnet - https://phabricator.wikimedia.org/T263740 (10Papaul) [15:25:57] (03CR) 10Volans: [C: 03+1] "LGTM. Integration and integration-min tests still passing, I think we can merge and fix any potential minor issue that might arise later i" [software/cumin] - 10https://gerrit.wikimedia.org/r/626660 (https://phabricator.wikimedia.org/T212783) (owner: 10Gehel) [15:28:20] RECOVERY - ats-tls HTTPS wikiworkshop.org RSA on cp3064 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 354701 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2020-12-10 17:00:15 +0000 (expires in 72 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:33:47] (03PS11) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [15:35:43] (03CR) 10jerkins-bot: [V: 04-1] reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [15:35:56] (03PS1) 10Catrope: GrowthExperiments: Enable for newcomers on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630886 (https://phabricator.wikimedia.org/T255027) [15:37:15] (03PS1) 10Jbond: puppetdb: add ability to mount the stockpile queue dir as tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/630887 (https://phabricator.wikimedia.org/T263578) [15:37:17] (03PS1) 10Jbond: puppetdb2002: update puppetdb server to use tmpfs stockpile queue [puppet] - 10https://gerrit.wikimedia.org/r/630888 (https://phabricator.wikimedia.org/T263578) [15:37:19] (03PS1) 10Jbond: puppetmaster::puppetdb enable tmpfs stockpile queue by default in production [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) [15:38:05] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630887 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:38:21] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/630888 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:38:59] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25525/" [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [15:39:46] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:39:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:47] (03CR) 10CRusnov: "Turns out the .strip.decode thing might have also been a problem, but the real problem was in the next section where it also wasn't decodi" [puppet] - 10https://gerrit.wikimedia.org/r/630891 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:42:39] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630879 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [15:43:00] 10Operations, 10SRE-swift-storage, 10Goal: Plan logical and physical design for media backups - https://phabricator.wikimedia.org/T262669 (10jcrespo) @LSobanski What do you think we should have as blockers to close this task? Obviously the analysis document needs some final touches- but what should be the go... [15:44:20] (03CR) 10Jcrespo: "Ah! So I wasn't too far from the problem! +1 as long as you have tested it and see it fixing ongoing issues." [puppet] - 10https://gerrit.wikimedia.org/r/630891 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:44:36] (03CR) 10Jcrespo: [C: 03+1] base/check_systemd_state.py: Fix encoding issue [puppet] - 10https://gerrit.wikimedia.org/r/630891 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:44:56] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:46:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:46:20] (03CR) 10CRusnov: [C: 03+2] base/check_systemd_state.py: Fix encoding issue [puppet] - 10https://gerrit.wikimedia.org/r/630891 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:46:38] RECOVERY - ats-tls HTTPS wikiworkshop.org ECDSA on cp5008 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 353603 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2020-12-10 17:00:15 +0000 (expires in 72 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:46:59] !log hnowlan@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:25] !log hnowlan@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:40] !log hnowlan@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [15:48:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:15] (03CR) 10Ebernhardson: [C: 04-2] "I don't think we can deploy 60 shards to commonswiki without further investigation into the impacts of such a change. Initially i'm worrie" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628980 (https://phabricator.wikimedia.org/T260083) (owner: 10Ryan Kemper) [15:49:19] (03PS12) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [15:50:28] 10Puppet, 10SRE-tools, 10Patch-For-Review, 10Python3-Porting, and 3 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10crusnov) [15:51:02] (03CR) 10jerkins-bot: [V: 04-1] reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [15:51:31] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:41] (03PS1) 10Volans: scripts: don't allocate primary IPs in frack [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/630895 [15:56:38] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) [15:57:20] (03PS13) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [15:58:00] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/630895 (owner: 10Volans) [15:58:51] (03CR) 10jerkins-bot: [V: 04-1] reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [15:58:53] (03CR) 10Volans: [C: 03+2] scripts: don't allocate primary IPs in frack [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/630895 (owner: 10Volans) [15:58:55] (03PS1) 10Jbond: puppet-merge: correctly handle passing through sha1 vs FETCH_HEAD [puppet] - 10https://gerrit.wikimedia.org/r/630897 (https://phabricator.wikimedia.org/T264014) [15:59:13] (03CR) 10jerkins-bot: [V: 04-1] [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [16:00:04] jbond42 and cdanis: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T1600). [16:00:04] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:41] o/ [16:01:02] (03PS14) 10Muehlenhoff: reboot-groups [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [16:01:31] (03PS2) 10Jbond: puppet-merge: correctly handle passing through sha1 vs FETCH_HEAD [puppet] - 10https://gerrit.wikimedia.org/r/630897 (https://phabricator.wikimedia.org/T264014) [16:02:27] (03CR) 10Muehlenhoff: reboot-groups (034 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [16:02:50] tgr_: I’d like to test something on mwdebug; can you let me know when you’re done? [16:02:57] or are you waiting for a puppet deployer anyways? [16:03:25] yeah. I don't think puppet deploys interfere with mwdebug, in any case. [16:04:22] they don't in general, but mwdebug wouldn't be a bad place to test that particular patch [16:04:22] ok thanks [16:04:39] (I'm happy to swing in if jbond42 and cdanis aren't available, will give them a few minutes though) [16:04:50] (03PS2) 10Matthias Mullie: [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) [16:05:09] rzl: just running pcc [16:05:15] 👍 [16:05:28] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) @RLazarus, any news about the date confirmation and the task? [16:05:41] (03CR) 10jerkins-bot: [V: 04-1] [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [16:06:54] 10Operations, 10ops-eqiad, 10Analytics-Radar: an-presto1004 down - https://phabricator.wikimedia.org/T253438 (10RobH) Self dispatch SR1038108849 entered with Chris as the contact. They should call him to schedule the on-site work. Since this has 'undefined' broken parts, it is easier overall to schedule th... [16:07:25] (03CR) 10Jbond: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/629853 (https://phabricator.wikimedia.org/T263800) (owner: 10Gergő Tisza) [16:07:38] tgr_: merging now [16:07:57] (03CR) 10Jbond: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/25528/" [puppet] - 10https://gerrit.wikimedia.org/r/629853 (https://phabricator.wikimedia.org/T263800) (owner: 10Gergő Tisza) [16:09:33] tgr_: i have deployed that change and deployed to mwdebug1002 [16:10:06] looking [16:10:15] hm, shouldn’t we use 200* during the datacenter switch? [16:10:56] Lucas_WMDE: 200* is probably better but for this change i suspect its fine [16:11:04] ok [16:11:27] (03CR) 10Volans: [C: 03+1] "I'm not familiar with the details of puppet's mount options but the change looks sane to me." [puppet] - 10https://gerrit.wikimedia.org/r/630887 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:11:42] (03CR) 10Volans: [C: 03+1] "LGTM from the compiler result" [puppet] - 10https://gerrit.wikimedia.org/r/630888 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:12:30] (03CR) 10Volans: [C: 03+1] "If all goes well in the single host testing no reason why not making it the default." [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:13:39] (03PS3) 10Jbond: puppet-merge: correctly handle passing through sha1 vs FETCH_HEAD [puppet] - 10https://gerrit.wikimedia.org/r/630897 (https://phabricator.wikimedia.org/T264014) [16:13:47] jbond42: hm, it is working but I made a mistake in the patch (should not have / at the end of the url) [16:13:54] can I do a quick followup? [16:14:03] sure [16:14:11] tgr_: ping me when its up [16:15:27] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10RLazarus) Apologies, I've had a hectic few days. :) It turns out the engprod offsite is the week of Oct 26, so there already won't be a MW... [16:16:23] (03PS1) 10Gergő Tisza: Fix .well-known/change-password URL [puppet] - 10https://gerrit.wikimedia.org/r/630899 (https://phabricator.wikimedia.org/T263800) [16:16:51] jbond42: ^ [16:17:06] tgr_: on it [16:17:34] (03CR) 10Jbond: [C: 03+2] Fix .well-known/change-password URL [puppet] - 10https://gerrit.wikimedia.org/r/630899 (https://phabricator.wikimedia.org/T263800) (owner: 10Gergő Tisza) [16:18:09] (03PS3) 10Matthias Mullie: [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) [16:18:59] (03CR) 10jerkins-bot: [V: 04-1] [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [16:19:06] (03PS1) 10Filippo Giunchedi: am: categorise netbox links [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/630900 [16:19:15] tgr_: deployed to mwdebug2002 [16:21:53] it looks like I messed up the URL formatting, it redirects to https://en.wikipedia.org/wiki/Special:ChangeCredentials/MediaWikiCAuthCPasswordAuthenticationRequest [16:22:02] not https://en.wikipedia.org/wiki/Special:ChangeCredentials/MediaWiki%5CAuth%5CPasswordAuthenticationRequest as it should [16:22:18] I guess % needs to be escaped in rewrite rules? [16:25:24] tgr_: just looking [16:26:22] (03CR) 10Filippo Giunchedi: [C: 03+1] "Modulo what John mentioned" [puppet] - 10https://gerrit.wikimedia.org/r/630879 (https://phabricator.wikimedia.org/T158562) (owner: 10Muehlenhoff) [16:26:40] seems like %5 would be interpreted as a RewriteCond backreference. I don't see anything about escaping in the doc though. [16:28:16] tgr_: haven't looked closely yet, but maybe you need http://httpd.apache.org/docs/2.2/rewrite/flags.html#flag_ne ? [16:28:26] or maybe it should not be urlencoded at all? [16:28:32] or that, I think [16:28:47] according to the NE flag doc it is supposed to hexcode anything not hexencoded [16:28:50] looking also -- btw this seems like a good place to plug https://wikitech.wikimedia.org/wiki/Httpbb :D [16:28:52] tgr_: please don't feel any need to do this urgently (although it might help you debug), but it'd be cool if at some point you wrote an httpbb assertion for this -- https://wikitech.wikimedia.org/wiki/httpbb and modules [16:28:56] wow rzl [16:29:13] modules/profile/files/httpbb has the test cases [16:30:31] fair, I should have tested this beforehand. I always underestimate Apache's capacity for making simple tasks complicated. [16:30:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:31:16] that is the reason why httpbb exists, tbh [16:31:17] no shame intended, the whole situation is a mess [16:33:47] no gerrit bot? [16:33:58] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:34:08] !log deploying eqsin automated DNS [16:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:22] rzl: did you ever look at running httpbb against mediawiki-vagrant btw? (also, tbh I'm not sure if mediawiki-vagrant is still in wide use, although I imagine it is?) [16:35:48] I have not tried that, it's a cool idea [16:38:06] I'm not quite sure how it's implemented but I think it specifies a bunch of puppet roles to be installed? if so, it should probably just include httpbb's role [16:38:30] jbond42: ^ actually tested this one, sorry for the trouble. [16:38:46] hm, gerritbot is being lazy [16:38:53] it's https://gerrit.wikimedia.org/r/c/operations/puppet/+/630902 [16:38:55] (this was my optimistic conclusion of a thought train that started with "how could tgr do a code/test/debug cycle on this patch that didn't suck) [16:39:26] tgr_: just curious if you also tested \%5C? [16:41:56] jbond42: did now. It works as well, the result is the same (the c still gets lowercased) [16:42:23] * volans in a meeting but seems that wikibugs is not writing anymore here [16:42:26] if anyone could have a look [16:43:13] tgr_: deployed on mwdebug2002 lgtm, thanks [16:43:52] will look into httpbb later. In this case though just adding to some config file in Vagrant manually was simple. It just never occurred to me that Apache will process the URL (which in hindsight seems obvious) [16:45:12] jbond42: yeah it works properly now, thanks. [16:45:19] np [16:45:48] interesting that on Vagrant the hex digits get lowercased and in production uppercased [16:46:21] hm, what's the apache version on vagrant? [16:46:55] production is Version: 2.4.25-3+deb9u9 [16:47:10] of course there's other layers in between in the onion of production [16:48:36] 10Operations, 10Puppet, 10Patch-For-Review: unbound variable error when calling puppet-merge script with an explicit treeish - https://phabricator.wikimedia.org/T264014 (10jbond) test updated [16:48:41] the version is the same [16:48:48] ok, some other part of the onion then :) [16:49:16] volans: i have used the instructions at https://www.mediawiki.org/wiki/Wikibugs#Restarting_wikibugs it and looks ike its back [16:49:31] jbond42: thanks a lot! [16:49:45] np [16:50:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:51:41] (03PS1) 10Ahmon Dancy: Disable more settings for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630904 [16:52:18] (03CR) 10Ahmon Dancy: [C: 03+2] Disable more settings for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630904 (owner: 10Ahmon Dancy) [16:52:28] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:52:37] (03PS1) 10Jgreen: add frmx2001 to nsca_frack.cfg.erb, alphabetize config [puppet] - 10https://gerrit.wikimedia.org/r/630905 [16:53:07] tgr_: im seeing lowercase from mwdebug at least (Location: https://en.wikipedia.org/wiki/Special:ChangeCredentials/MediaWiki%5cAuth%5cPasswordAuthenticationRequest [16:53:32] (03CR) 10Jgreen: [C: 03+2] add frmx2001 to nsca_frack.cfg.erb, alphabetize config [puppet] - 10https://gerrit.wikimedia.org/r/630905 (owner: 10Jgreen) [16:53:35] (03Merged) 10jenkins-bot: Disable more settings for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/630904 (owner: 10Ahmon Dancy) [16:56:34] (03CR) 10Jbond: [C: 03+2] puppetdb: add ability to mount the stockpile queue dir as tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/630887 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:56:43] (03CR) 10Jbond: [C: 03+2] puppetdb2002: update puppetdb server to use tmpfs stockpile queue [puppet] - 10https://gerrit.wikimedia.org/r/630888 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [16:57:59] !log disable puppet to deploy puppetdb change [16:58:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:22] (03PS1) 10Jason Linehan: clientError: Expand coverage to all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) [17:00:04] chrisalbon and accraze: (Dis)respected human, time to deploy Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T1700). Please do the needful. [17:02:12] !log re-enable puppet to post deploy puppetdb change [17:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:24] tgr_, jbond42: what’s the status of the redirect deployment? :) [17:03:05] Lucas_WMDE: sorry that is complete you are good to proceed [17:03:10] ok thanks! [17:03:19] (I assume nothing’s happening in the Graphoid window, ping me otherwise) [17:03:45] (03CR) 10Jbond: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/630889 (https://phabricator.wikimedia.org/T263578) (owner: 10Jbond) [17:04:01] testing some IS.php changes on mwdebug2001, then [17:05:33] 10Operations, 10Traffic: External Monitoring alerting on 400 Bad Request errors - https://phabricator.wikimedia.org/T264111 (10colewhite) I cannot find any indication that the 400s are originating from our servers either in webrequest log or turnilo. I have temporarily disabled the alerts until this can be lo... [17:05:50] (03CR) 10Jbond: reboot-groups (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [17:11:48] (03CR) 10Herron: logstash-next: change backend naming from kibana-next to kibana7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [17:12:15] (03CR) 10Jdlrobson: [C: 04-1] clientError: Expand coverage to all Wikipedias besides enwiki. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [17:12:26] !log update libdbi-perl on dbmonitor1001 and helium [17:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:36] (03PS2) 10Jason Linehan: clientError: Expand coverage to all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) [17:13:53] (03CR) 10Jason Linehan: clientError: Expand coverage to all Wikipedias besides enwiki. (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [17:14:52] (03CR) 10Jdlrobson: [C: 03+1] clientError: Expand coverage to all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) (owner: 10Jason Linehan) [17:16:19] PROBLEM - Disk space on puppetdb2002 is CRITICAL: DISK CRITICAL - /var/lib/puppetdb/stockpile/cmd/q is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=puppetdb2002&var-datasource=codfw+prometheus/ops [17:22:54] ok, I’m done experimenting on mwdebug2001 :) [17:23:02] did a `scap pull` to ensure everything’s back to order [17:23:37] (03PS3) 10Jason Linehan: clientError: Enable on Wikidata + all Wikipedias besides enwiki. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630908 (https://phabricator.wikimedia.org/T255585) [17:27:06] ooooh [17:27:08] exciting [17:30:08] !log ported cassandra-tools-wmf to wikimedia-buster [17:30:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:44:49] (03PS2) 10Muehlenhoff: Enabled managed sources.list for esams/eqsin [puppet] - 10https://gerrit.wikimedia.org/r/630879 (https://phabricator.wikimedia.org/T158562) [17:51:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:54:57] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T1800) [18:00:05] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:03:27] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) >>! In T244808#6502397, @RLazarus wrote: > Apologies, I've had a hectic few days. :) It is a feeling I know very well. :) > T... [18:11:33] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3050 is OK: HTTP OK: HTTP/1.0 200 OK - 23563 bytes in 0.256 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:16:21] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:17:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:29:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:31:09] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:47:45] (03PS1) 10Volans: scripts: dns, mark eqsin as migrated to Netbox [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/630918 (https://phabricator.wikimedia.org/T258729) [18:49:13] (03PS1) 10Volans: Set eqsin as migrated to the DNS Netbox automation [cookbooks] - 10https://gerrit.wikimedia.org/r/630919 (https://phabricator.wikimedia.org/T258729) [18:53:03] (03CR) 10Volans: [C: 03+2] "eqsin has been migrated" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/630918 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [18:53:15] (03CR) 10Volans: [C: 03+2] "eqsin has been migrated" [cookbooks] - 10https://gerrit.wikimedia.org/r/630919 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [18:54:19] (03Merged) 10jenkins-bot: Set eqsin as migrated to the DNS Netbox automation [cookbooks] - 10https://gerrit.wikimedia.org/r/630919 (https://phabricator.wikimedia.org/T258729) (owner: 10Volans) [18:55:59] 10Operations, 10LDAP-Access-Requests: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10Dzahn) @ArielGlenn LDAP-only users also need to be added to puppet admins module nowadays. [19:00:04] twentyafterfour and hashar: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Mediawiki train - American+European Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T1900). [19:07:08] (03PS1) 10Dzahn: admin: add Bereket Teshome to LDAP-only admins (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/630925 (https://phabricator.wikimedia.org/T262921) [19:14:30] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "thank you both for the prompt reviews :) https://puppet-compiler.wmflabs.org/compiler1002/25529/db1133.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [19:16:51] (03CR) 10Dzahn: "> Patch Set 8:" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [19:19:43] (03CR) 10Dzahn: "confirmed on db1133 - puppet changed the motd slightly but that is all that happened:" [puppet] - 10https://gerrit.wikimedia.org/r/630317 (owner: 10Dzahn) [19:29:06] !log Checked out mediawiki 1.36.0-wmf.11 on deploy1001 see T263177 [19:29:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:29:12] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [19:32:26] (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.11 refs T257978 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630929 [19:32:28] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.11 refs T257978 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630929 (owner: 1020after4) [19:32:33] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 132, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:32:51] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:33:35] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.11 refs T257978 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630929 (owner: 1020after4) [19:35:27] !log twentyafterfour@deploy1001 Started scap: testwikis to 1.36.0-wmf.11 refs T263177 [19:35:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:33] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [19:40:11] !log temp. disabling puppet on ms-fe (swift-proxy) hosts, applying puppet refactoring change carefully [19:40:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:41:32] (03CR) 10Dzahn: [C: 03+2] "> Patch Set 6: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [19:44:30] (03CR) 10Dzahn: "re-enabled puppet first on ms-fe1005 - complete NOOP. then same on ms-fe2006 - complete NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/628970 (owner: 10Dzahn) [19:44:41] !log apt-get update && apt-get upgrade on wikitech-static [19:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:51] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1665 bytes in 0.127 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:52:25] PROBLEM - Wikitech-static main page has content on cloudweb2001-dev is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1665 bytes in 0.128 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:52:42] (03CR) 10Dzahn: docker: add data types (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630661 (owner: 10Dzahn) [19:52:45] (03PS4) 10Dzahn: docker: add data types [puppet] - 10https://gerrit.wikimedia.org/r/630661 [19:52:51] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 MediaWiki configuration Error - string Wikitech not found on https://wikitech-static.wikimedia.org:443/wiki/Main_Page?debug=true - 1665 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [19:52:55] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 134, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:53:11] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:57:31] (03CR) 10Dzahn: oozie: hiera->lookup, add data types (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/629443 (owner: 10Dzahn) [19:57:45] (03PS1) 10CDanis: logstash: add throttle-exempt; don't throttle NEL or client errors [puppet] - 10https://gerrit.wikimedia.org/r/630931 (https://phabricator.wikimedia.org/T257527) [19:58:04] (03PS4) 10Dzahn: oozie: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/629443 [19:58:56] (03CR) 10jerkins-bot: [V: 04-1] logstash: add throttle-exempt; don't throttle NEL or client errors [puppet] - 10https://gerrit.wikimedia.org/r/630931 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [19:59:52] (03CR) 10Dzahn: "This needs a rebase now since the parent change has been merged. The button in Gerrit web UI did not get this one though." [puppet] - 10https://gerrit.wikimedia.org/r/630568 (https://phabricator.wikimedia.org/T247956) (owner: 10Jbond) [20:00:00] (03PS2) 10CDanis: logstash: add throttle-exempt; don't throttle NEL or client errors [puppet] - 10https://gerrit.wikimedia.org/r/630931 (https://phabricator.wikimedia.org/T257527) [20:01:19] 10Operations: Please replace Shannon Baileys SSH key - https://phabricator.wikimedia.org/T264127 (10Sbailey) [20:08:03] (03CR) 10Dzahn: [C: 03+1] "seems good, it's just that I would like to compile it and running that on "P:wmcs::nfsclient" as it also appears in the commit message her" [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [20:08:31] (03CR) 10Herron: [C: 03+1] logstash: add throttle-exempt; don't throttle NEL or client errors [puppet] - 10https://gerrit.wikimedia.org/r/630931 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [20:09:36] (03PS3) 10CDanis: logstash: add throttle-exempt; don't throttle NEL or client errors [puppet] - 10https://gerrit.wikimedia.org/r/630931 (https://phabricator.wikimedia.org/T257527) [20:10:54] (03PS1) 10Gehel: Adding some type annotations to clustershell.py [software/cumin] - 10https://gerrit.wikimedia.org/r/630934 [20:16:01] (03CR) 10ArielGlenn: [C: 03+1] admin: add Bereket Teshome to LDAP-only admins (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/630925 (https://phabricator.wikimedia.org/T262921) (owner: 10Dzahn) [20:19:14] (03PS3) 10Dzahn: start DHCP service on install5001, stop it on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/629849 (https://phabricator.wikimedia.org/T252526) [20:19:16] (03CR) 10Dzahn: [C: 03+2] admin: add Bereket Teshome to LDAP-only admins (wmde,nda) [puppet] - 10https://gerrit.wikimedia.org/r/630925 (https://phabricator.wikimedia.org/T262921) (owner: 10Dzahn) [20:21:03] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10Dzahn) [20:22:34] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: Add Bereket teshome to the ldap/wmde and ldap/nda group - https://phabricator.wikimedia.org/T262921 (10Dzahn) a:05ArielGlenn→03bete @bete You have been added to both groups as requested. Things should work now. Let us know if any issues. [20:27:05] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 25556 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:27:39] RECOVERY - Wikitech-static main page has content on cloudweb2001-dev is OK: HTTP OK: HTTP/1.1 200 OK - 25557 bytes in 0.278 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:27:58] (03CR) 10Cwhite: [C: 03+2] logstash: add throttle-exempt; don't throttle NEL or client errors [puppet] - 10https://gerrit.wikimedia.org/r/630931 (https://phabricator.wikimedia.org/T257527) (owner: 10CDanis) [20:28:07] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 25556 bytes in 0.235 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [20:34:55] (03PS3) 10MusikAnimal: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [20:35:36] (03CR) 10Bstorm: [C: 04-1] "This is one of the most dangerous classes to change for cloud clients because applying it incorrectly will disconnect all NFS clients. The" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [20:40:06] (03PS1) 10Andrew Bogott: Cloudvirt1016 -> Buster/Ceph [puppet] - 10https://gerrit.wikimedia.org/r/630941 (https://phabricator.wikimedia.org/T259399) [20:40:46] 10Operations, 10Projects-Cleanup, 10Release-Engineering-Team-TODO, 10fixcopyright.wikimedia.org, 10Wiki-Setup (Delete / Redirect): Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10BBlack) [20:42:19] (03CR) 10Andrew Bogott: [C: 03+2] Cloudvirt1016 -> Buster/Ceph [puppet] - 10https://gerrit.wikimedia.org/r/630941 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [20:45:11] !log twentyafterfour@deploy1001 Finished scap: testwikis to 1.36.0-wmf.11 refs T263177 (duration: 69m 57s) [20:45:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:18] T263177: 1.36.0-wmf.11 deployment blockers - https://phabricator.wikimedia.org/T263177 [20:47:42] (03PS4) 10Dzahn: start DHCP service on install5001, stop it on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/629849 (https://phabricator.wikimedia.org/T252526) [20:48:49] (03CR) 10Dzahn: [C: 03+2] start DHCP service on install5001, stop it on bast5001 [puppet] - 10https://gerrit.wikimedia.org/r/629849 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [20:49:07] PROBLEM - Host db2125 is DOWN: PING CRITICAL - Packet loss = 100% [20:49:18] 10Operations, 10Puppet, 10Traffic, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10BBlack) [20:49:57] 10Operations, 10Puppet, 10User-herron: Puppet hosts with signed certificate present on agent but not master - https://phabricator.wikimedia.org/T185239 (10BBlack) `lvs100[789]` don't exist anymore, removing #Traffic from this. [20:51:41] RECOVERY - Host db2125 is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [20:51:46] !log DHCP server for EQSIN switched from bast5001 to install5001 (T252526) [20:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:53] T252526: serve tftpboot environment from the install servers and create one in each edge POP - https://phabricator.wikimedia.org/T252526 [20:53:48] PROBLEM - mysqld processes #page on db2125 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:53:53] PROBLEM - MariaDB Replica SQL: s2 #page on db2125 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:53:56] 👋 [20:54:00] * volans here [20:54:00] PROBLEM - MariaDB Replica IO: s2 #page on db2125 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:54:06] 👋 [20:54:08] ACK [20:54:11] hi [20:54:17] o/ [20:54:17] do we just depool? [20:54:29] it seemed like maintenance at first [20:54:32] \o [20:54:35] PROBLEM - logstash syslog TCP port on logstash1009 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused https://wikitech.wikimedia.org/wiki/Logstash [20:54:38] about if needed. [20:54:39] <_joe_> let's depool first, ask questions later? [20:54:40] yeah just depool [20:54:50] and downtime it for 24h [20:54:53] !log cdanis@cumin1001 dbctl commit (dc=all): 'depool db2125', diff saved to https://phabricator.wikimedia.org/P12843 and previous config saved to /var/cache/conftool/dbconfig/20200929-205453-cdanis.json [20:54:54] host got rebooted [20:54:54] +1 [20:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:59] up 3 min [20:54:59] * jbond42 here [20:55:03] that host is having HW issues [20:55:03] go ahead with depool [20:55:06] is that a familiar name? [20:55:10] yes [20:55:11] but yeah mysqld may come back post-reboot, but who knows stability until we have time to look [20:55:13] ah! [20:55:16] * godog here too [20:55:22] here [20:55:28] just depool and leave it to me for tomorrow [20:55:32] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [20:55:33] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:55:35] db2125. are you kidding me. [20:55:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:41] 24h downtime done [20:55:51] is someone depooling? [20:55:59] c.danis already did [20:56:02] ok [20:56:10] I was going to suggest a 16h downtime in case Manuel forgets [20:56:16] but seems fine [20:56:17] RECOVERY - logstash syslog TCP port on logstash1009 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 https://wikitech.wikimedia.org/wiki/Logstash [20:56:21] db2125 is from the set of servers with the notorious freezes which once hit the caches a lot: https://phabricator.wikimedia.org/T238305 [20:56:22] let me douvble check mw errors [20:56:34] sent ACK number to VO [20:56:37] cdanis: even 24 is fine [20:56:59] also reopened the last db2125 task [20:57:12] cdanis excellent thank you [20:57:32] marostegui: 💜 [20:57:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10CDanis) 05Resolved→03Open crashed again [20:57:48] wikibugs asleep on the job [20:58:01] I am going back to bed. thank you everyone [20:58:08] cdanis: second tme, was already restarted earlier [20:58:22] good night marostegui [21:00:04] mdholloway: Dear deployers, time to do the Push notifications deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T2100). [21:00:12] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [21:00:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:53] I've added the hw error to the tas [21:00:55] *task [21:01:38] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Volans) from HW logs ` -------------------------------------------------------------------------------- SeqNumber = 286 Message ID... [21:02:16] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [21:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:02] 10Operations, 10Traffic: Servers freezing across the caching cluster - https://phabricator.wikimedia.org/T238305 (10Dzahn) Just a few minutes ago: db2125 crashed - mgmt iface also not available T260670 [21:03:35] twentyafterfour: train done for today? [21:03:45] and to that tracking task for crashing servers [21:04:25] (03CR) 10Bstorm: [C: 04-1] labstore::nfs_mount: drop support for empty string share_path (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/630589 (owner: 10Jbond) [21:05:17] mdholloway: I still need to deploy to group0, just sync'd to testwikis so far [21:05:20] (03PS6) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [21:05:22] ok, no rush [21:06:01] 10Operations, 10Traffic, 10Performance-Team (Radar): Consider allowing H2 coalesce for upload.wikimedia.org for images used in wiki articles - https://phabricator.wikimedia.org/T116132 (10BBlack) All the perf tradeoffs and relatively-trivial work aside, the major blocker we still face here is the likely pro... [21:09:21] !log rebooting testvm5001 for install test after switching DHCP/TFTP in eqsin to new dedicated VM [21:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:12:36] 10Operations, 10Puppet, 10Traffic, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BBlack) [21:12:54] 10Operations, 10Puppet, 10Traffic, 10Technical-Debt: Fix rule violation in the lvs balancer role - https://phabricator.wikimedia.org/T264132 (10BBlack) p:05Triage→03Low [21:14:32] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10BBlack) [21:15:14] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10BBlack) Added a subtask for the one #traffic case I can find here, and removing our tag from this. [21:16:15] (03PS1) 10Cicalese: Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) [21:17:02] 10Operations, 10Puppet, 10Cloud-Services, 10Patch-For-Review, 10Technical-Debt: Convert all of our site.pp/roles to the role/profile paradigm - https://phabricator.wikimedia.org/T159412 (10Dzahn) This topic branch shows previous work I had done for this but never linked to this ticket: 2 more were just... [21:17:15] (03CR) 10jerkins-bot: [V: 04-1] Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) (owner: 10Cicalese) [21:19:25] 10Operations, 10Analytics, 10Traffic, 10User-Elukey: Sort out analytics service dependency issues for cp* cache hosts - https://phabricator.wikimedia.org/T128374 (10BBlack) 05Open→03Declined This is too-stale now and a lot of these bits have been replaced over time and are known to have their deps corr... [21:19:31] (03CR) 10Ppchelko: [C: 04-1] Add beta config for API Portal/OAuth communications (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) (owner: 10Cicalese) [21:21:34] ugh [21:21:43] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/630797 [21:22:42] I doubt I can resolve that conflict so I might have to roll back the train. [21:22:59] !log temp stopping DHCP service on install2003 for a test [21:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:24:56] * twentyafterfour is currently on slow internet connection so that's not gonna make it easier [21:25:12] (03PS2) 10Cicalese: Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) [21:26:11] live notifications on phab are no longer working for me. Did something change? [21:26:51] (03PS3) 10Cicalese: Add beta config for API Portal/OAuth communications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) [21:27:03] 10Operations, 10Commons, 10SRE-swift-storage, 10Traffic, and 2 others: upload-lb.ulsfo.wikimedia.org still allow access to some deleted files - https://phabricator.wikimedia.org/T133819 (10BBlack) [21:27:08] 10Operations, 10Commons, 10SRE-swift-storage, 10Traffic, and 2 others: Deleted files sometimes remain visible to non-privileged users if permanently linked - https://phabricator.wikimedia.org/T109331 (10BBlack) [21:27:14] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, 10Patch-For-Review: Image cache issue when 'over-writing' an image on commons - https://phabricator.wikimedia.org/T119038 (10BBlack) [21:27:18] DannyS712: that would be aphlict. no changes I am aware of but will take a look [21:27:34] (03CR) 10Ppchelko: [C: 03+1] "please self-merge when ready for testing" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630947 (https://phabricator.wikimedia.org/T261358) (owner: 10Cicalese) [21:28:08] so first of all.. the service for that is still running [21:28:08] 10Operations, 10Traffic, 10serviceops, 10Performance-Team (Radar), 10Sustainability: Make CDN purges reliable - https://phabricator.wikimedia.org/T133821 (10BBlack) 05Stalled→03Resolved a:03ema This should've been closed back when T250781 closed - all purge traffic now goes via kafka queues and mul... [21:30:27] !log started DHCP service on install2003 again [21:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:24] DannyS712: so there is a dedicated VM running that is just the notification service for Phabricator and I can confirm it is both running and still receiving messages [21:31:47] (03PS1) 1020after4: testwikis wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630949 [21:31:49] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630949 (owner: 1020after4) [21:31:58] hmm, odd. If its still not sending me notifications in a few hours or tomorrow should I file a bug? [21:32:20] (03PS1) 10Andrew Bogott: cloudvirt1016: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/630950 [21:32:26] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630949 (owner: 1020after4) [21:32:38] I too seem to be disconnected from aphlict [21:32:39] DannyS712: yes, please do. this is called "aphlict" [21:32:52] twentyafterfour: should we try a simple restart first? [21:32:59] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1016: update nic names [puppet] - 10https://gerrit.wikimedia.org/r/630950 (owner: 10Andrew Bogott) [21:33:03] but i DO see it receiving messages [21:33:15] in aphlict.log [21:33:23] (03PS4) 10Dmaza: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) [21:33:31] when I refresh the page the notifications are there, I just don't get live ones [21:34:02] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.10 [21:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:24] mutante: I'm not sure what's up, it doesn't seem like phab is even trying to connect [21:34:28] DannyS712: yea, that would be consistent with aphlict being down [21:34:30] maybe the config got changed somehow [21:35:24] (03CR) 10Dmaza: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630951 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [21:35:46] aphlict itself seems to be working. so yea, will look at phab side [21:36:36] mutante: notification.servers is empty in the config [21:36:51] in phab's local settings [21:37:08] `sudo /srv/phab/phabricator/bin/config get notification.servers` [21:37:25] twentyafterfour: ugh.. OK.. well I wonder why [21:37:27] which is managed by puppet [21:37:41] PROBLEM - PHP opcache health on mwdebug1002 is CRITICAL: CRITICAL: opcache free space is below 50 MB https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:37:50] oooh.. yes, looking [21:38:03] twentyafterfour: you can worry about the train thing for now, brb [21:38:12] I think it was taking the database value before but for some reason it seems like it decided to take the local setting version from puppet instead of db? [21:39:00] I ..eh.. can't confirm. [21:39:06] (03PS1) 10Andrew Bogott: Update partman recipes for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/630954 (https://phabricator.wikimedia.org/T263677) [21:39:10] I do see the notification server now [21:39:16] oh? [21:39:56] ugh I made the mistake of running deploy-promote to revert to wmf.10, but that's gonna take forever [21:39:59] you are right, it SHOULD use the values from the database, which is what I'm seeing [21:40:09] mutante: hmmm interesting [21:40:17] but local is empty [21:40:33] right ... local shouldn't really be empty I think [21:40:55] but maybe I'm wrong about whether it's using the db or the local setting [21:41:08] I just see "disconnected" in the notification menu [21:41:28] I made this change recently: https://gerrit.wikimedia.org/r/c/operations/puppet/+/630309/2/modules/phabricator/manifests/aphlict.pp [21:41:57] it's just the data type [21:42:38] hmmm [21:42:51] I don't see how it would break.. but that was recent [21:43:37] i was thinking for a momemt a change to a template would have triggered puppet refresh which somehow failed.. but does not look like it either [21:47:45] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.10 (duration: 13m 45s) [21:47:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:55] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:48:13] the service is listening on its port but i dont see any traffic for it with tcpdump, i will try restarting aphlict [21:48:23] RECOVERY - PHP opcache health on mwdebug1002 is OK: OK: opcache is healthy https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_opcache_health [21:49:27] and now I DO see traffic again [21:49:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:49:47] but when I clicked a "test notification" button I did not see it yet.. [21:49:57] !log restarted aphlict service on aphlict1001 [21:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:50:13] DannyS712: any change? i see traffic between phab1001 and aphlict1001 now [21:50:53] I'm still seeing it disconnected in the notification menu [21:51:21] twentyafterfour: on both sides I can see some activity now with "tcpdump port 22281" [21:51:48] hitting refresh in one tab - once I get a notification I'll switch to the other and see if it was shown automatically [21:52:02] can we tell phab to remove and add it again? [21:52:26] hmm maybe [21:54:28] suddenly much more activity [21:55:33] I manually set notification.servers with config set [21:55:41] but puppet will overwrite it next run [21:56:18] the really strange part is I still don't have a connection to aphlict [21:56:26] maybe the status is cached somewhere [21:57:29] twentyafterfour: if we disable puppet, change config, re-enable puppet and it re-breaks it, that would tell us something at least [21:58:14] yeah [21:58:29] twentyafterfour: this works too: @phab1001:~# telnet aphlict.discovery.wmnet 22281 [21:59:08] it seems to me it's just the config that's broken in phab especially since I don't see any errors anywhere other than "disconnected" [21:59:18] !log temp. disabled puppet on phab1001 [21:59:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:00:04] confirmed to still be an issue [22:01:08] strange I guess setting the config isn't all that's needed... [22:01:15] twentyafterfour: remind me if you can change the global notification server settings in web UI (while puppet is disabled) [22:01:29] oh, you did [22:01:31] not in the web ui but from the cli I can (and I did) [22:01:35] ok [22:01:56] but it seems not to have fixed it. and I don't even see an attempt at a websocket connection from my browser [22:02:49] PROBLEM - High average GET latency for mw requests on appserver in codfw on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:04:25] oh now I see the error: https://phabricator.wikimedia.org/config/cluster/notifications/ [22:04:27] RECOVERY - High average GET latency for mw requests on appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=appserver&var-method=GET [22:04:43] " Got HTTP 200, but expected HTTP 501 (WebSocket Upgrade)!" [22:04:46] ok. so.. service is running, phab can connect to it, there seems to be some activity between them. can it be at traffic layer? [22:04:49] oh [22:05:14] yeah so I guess it's envoy [22:05:44] phab->aphlict is working but client->aphlict isn't [22:05:51] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:07:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:11:30] well, the envoy config has upgrade_type: "websocket" and it's running [22:11:56] hmm very strange [22:12:25] DannyS712: do we know when it stopped? [22:14:03] (03Abandoned) 10Reedy: Update size dblists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/607605 (owner: 10Reedy) [22:25:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:26:34] !log phab1001 - re-enabled puppet and running it [22:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:13] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:33:03] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (5) node(s) change every puppet run: restbase1028.eqiad.wmnet, restbase1029.eqiad.wmnet, wdqs1009.eqiad.wmnet, icinga1001.wikimedia.org, restbase1030.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [22:33:58] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [22:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:04] (03PS1) 10Dzahn: DHCP: switch TFTP server for ulsfo from bast4002 to install4001 [puppet] - 10https://gerrit.wikimedia.org/r/630964 (https://phabricator.wikimedia.org/T252526) [22:44:43] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32368154} Nope no deploys have happened recently. It has been happening every few hours since the 24th [22:45:36] (03PS1) 10Dzahn: DHCP switch TFTP server for esams from bast3004 to install3001 [puppet] - 10https://gerrit.wikimedia.org/r/630966 (https://phabricator.wikimedia.org/T252526) [22:49:44] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [22:49:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:00] (03PS2) 10Dzahn: DHCP switch TFTP server for esams from bast3004 to install3001 [puppet] - 10https://gerrit.wikimedia.org/r/630966 (https://phabricator.wikimedia.org/T252526) [22:54:25] (03PS2) 10Dzahn: DHCP: switch TFTP server for ulsfo from bast4002 to install4001 [puppet] - 10https://gerrit.wikimedia.org/r/630964 (https://phabricator.wikimedia.org/T252526) [22:55:56] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) ` 2020-09-29 20:47:59 287 SYS1003 System CPU Resetting. 2020-09-29 20:47:51 286 PWR2270 The Intel Management Engine has encounte... [23:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Evening backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200929T2300) [23:00:04] dmaza: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:25] I can deploy today! [23:00:34] Thank you very much Urbanecm [23:00:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:01:03] (03PS5) 10Urbanecm: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [23:01:11] (03CR) 10Urbanecm: [C: 03+2] Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [23:01:20] (03CR) 10Dzahn: [C: 04-2] "this should include manually removing package/service/firewall hole ..." [puppet] - 10https://gerrit.wikimedia.org/r/629496 (https://phabricator.wikimedia.org/T252526) (owner: 10Dzahn) [23:02:19] (03Merged) 10jenkins-bot: Enable watchlist expiry feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630653 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [23:02:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:02:57] (03PS1) 10Dzahn: DHCP: set TFTP servers in other DC to bootstrap install servers [puppet] - 10https://gerrit.wikimedia.org/r/630971 (https://phabricator.wikimedia.org/T252526) [23:03:02] !log dzahn@cumin1001 START - Cookbook sre.ganeti.makevm [23:03:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:03:28] dmaza: could you test the first change (https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/630653) at mwdebug2001, please? [23:03:46] yes, thank you. Testing no [23:03:51] *now [23:05:19] (03PS2) 10Urbanecm: Enable watchlist expiry feature (wikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630951 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [23:05:33] Urbanecm: I'm getting readonly mode [23:05:37] yeah same here [23:05:41] (03CR) 10Urbanecm: [C: 03+2] Enable watchlist expiry feature (wikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630951 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [23:06:02] Urbanecm: last time the other deployer had to switch to the other datacenter (codfw) [23:06:15] (03Merged) 10jenkins-bot: Enable watchlist expiry feature (wikisource) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630951 (https://phabricator.wikimedia.org/T260461) (owner: 10Dmaza) [23:06:17] dmaza: musikanimal: can you confirm you use mwdebug2001? [23:06:34] doh!! my bad [23:06:58] dmaza: it doesn't turn readonly on my end [23:07:13] it works fine.. sorry.. I choose the wrong server [23:07:21] give me a few to run more tests [23:07:22] no problem, can happen to anybody :) [23:07:27] sure, take your time [23:11:22] Urbanecm: looks good [23:11:58] dmaza: thanks, syncing [23:13:39] (03PS1) 10Dzahn: add testvm3001.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630972 [23:13:40] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: bc6dda2885906b2561c757b4ac54c90c052b8df0: Enable watchlist expiry feature (T260461) (duration: 00m 58s) [23:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:45] T260461: Watchlist Expiry: Enable feature for Group 1 pilot wikis [Start: Sept 28/29] - https://phabricator.wikimedia.org/T260461 [23:13:45] thanks [23:13:57] no problem dmaza :) [23:14:01] (03CR) 10jerkins-bot: [V: 04-1] add testvm3001.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630972 (owner: 10Dzahn) [23:14:25] dmaza: the second patch is now at mwdebug2001, can you test, please? [23:14:33] testing... [23:15:38] (03PS2) 10Dzahn: add testvm3001.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630972 [23:17:18] (03PS3) 10Dzahn: add testvm3001.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630972 [23:18:04] @Urbanecm everything looks good [23:18:12] thanks, syncing [23:19:02] (03CR) 10Dzahn: [C: 03+2] add testvm3001.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630972 (owner: 10Dzahn) [23:19:11] (03PS4) 10Dzahn: add testvm3001.esams.wmnet [dns] - 10https://gerrit.wikimedia.org/r/630972 [23:19:36] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 68d7af9cb38de09b4cb8655f0b095b60d470fbbc: Enable watchlist expiry feature (wikisource; T260461) (duration: 00m 58s) [23:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:19:41] T260461: Watchlist Expiry: Enable feature for Group 1 pilot wikis [Start: Sept 28/29] - https://phabricator.wikimedia.org/T260461 [23:19:44] dmaza: should be live :) [23:19:48] anything else? [23:19:56] that's all.. thank you very much for your help [23:20:13] (03PS7) 10Mholloway: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) [23:20:45] no problem! [23:20:56] !log Evening B&C window completed [23:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:27] Urbanecm: what decides which server a particular change's going to be merged to before syncing? ie mwdebug1001 and mwdebug2001 [23:21:28] (03CR) 10Mholloway: [C: 03+2] Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [23:22:03] dont|panic: me :D (or whoever leads the given window) [23:22:14] (03Merged) 10jenkins-bot: Echo: Enable push on all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/628343 (https://phabricator.wikimedia.org/T262936) (owner: 10Mholloway) [23:22:29] oh lol [23:24:52] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable Echo app push on all Wikipedias (T262936) (duration: 00m 59s) [23:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:58] T262936: Enable app push notifications in production - https://phabricator.wikimedia.org/T262936 [23:26:11] dont|panic: and "); [23:26:21] $wmfMasterDatacenter = $etcdConfig->get( 'common/WMFMasterDatacenter' ); decides which one is readonly [23:30:38] yes, I can theoretically pull a change to mwdebug1002, let's say, but you'd be in read only mode, because eqiad is read-only right now :- [23:30:59] !log dzahn@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [23:31:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:32:57] mutante: shouldn't cookbook's messages have some message, similar to what scap publishes? [23:34:55] Urbanecm: yes [23:35:57] !log created testvm3001.esams.wmnet to test install3001 [23:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:46] (03PS1) 10Dzahn: DHCP: add testvm3001, testvm4001 [puppet] - 10https://gerrit.wikimedia.org/r/630974 [23:41:49] (03CR) 10Dzahn: [C: 03+2] DHCP: add testvm3001, testvm4001 [puppet] - 10https://gerrit.wikimedia.org/r/630974 (owner: 10Dzahn) [23:41:55] (03PS2) 10Dzahn: DHCP: add testvm3001, testvm4001 [puppet] - 10https://gerrit.wikimedia.org/r/630974 [23:46:45] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:48:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:58] I'll tack on a last-minute patch to B&C [23:55:09] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [23:56:43] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX