[00:46:49] 10Operations, 10Performance-Team, 10media-storage, 10serviceops, and 2 others: Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10Krinkle) [00:47:21] (03PS2) 10Krinkle: Remove unused wmgReduceStartupExpiry logic in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542641 (https://phabricator.wikimedia.org/T235314) [00:47:29] (03CR) 10Krinkle: [C: 03+2] Remove unused wmgReduceStartupExpiry logic in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542641 (https://phabricator.wikimedia.org/T235314) (owner: 10Krinkle) [00:48:17] (03Merged) 10jenkins-bot: Remove unused wmgReduceStartupExpiry logic in CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542641 (https://phabricator.wikimedia.org/T235314) (owner: 10Krinkle) [00:51:45] (03PS2) 10Krinkle: Remove wmgReduceStartupExpiry (no longer used) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542642 (https://phabricator.wikimedia.org/T235314) [00:52:27] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: ec77b1b515940c73 (duration: 00m 55s) [00:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:29:43] PROBLEM - Host mr1-eqsin.oob is DOWN: PING CRITICAL - Packet loss = 100% [01:41:21] RECOVERY - Host mr1-eqsin.oob is UP: PING OK - Packet loss = 0%, RTA = 222.49 ms [02:24:59] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1259831664 and 67 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:25:17] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 270353424 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:31:31] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 78464 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:31:49] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 44400 and 20 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:35:09] PROBLEM - Host db2068 is DOWN: PING CRITICAL - Packet loss = 100% [03:50:07] RECOVERY - Host db2068 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [03:53:25] PROBLEM - DPKG on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [03:53:35] PROBLEM - Disk space on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2068&var-datasource=codfw+prometheus/ops [03:53:41] PROBLEM - configured eth on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [03:53:51] PROBLEM - Check whether ferm is active by checking the default input chain on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [03:53:53] PROBLEM - dhclient process on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [03:54:01] PROBLEM - Check size of conntrack table on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [03:54:09] PROBLEM - MariaDB disk space on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:54:31] PROBLEM - MariaDB Slave IO: s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:54:35] PROBLEM - MariaDB Slave SQL: s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [03:54:47] PROBLEM - MariaDB read only s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:54:47] PROBLEM - Check systemd state on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:54:51] PROBLEM - mysqld processes on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [03:56:07] PROBLEM - puppet last run on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [04:05:51] PROBLEM - MariaDB Slave Lag: s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:11:37] PROBLEM - HP RAID on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [04:13:19] PROBLEM - Check the NTP synchronisation status of timesyncd on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [04:35:47] PROBLEM - IPMI Sensor Status on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [04:39:51] PROBLEM - MariaDB Slave Lag: m1 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 348.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [04:49:39] RECOVERY - MariaDB Slave Lag: m1 on db2078 is OK: OK slave_sql_lag Replication lag: 0.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:57:50] 10Operations, 10DBA: db2068 is misbehaving (but is depooled) - https://phabricator.wikimedia.org/T235366 (10jijiki) [07:02:34] ACKNOWLEDGEMENT - Check size of conntrack table on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [07:02:34] ACKNOWLEDGEMENT - Check systemd state on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:02:34] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/NTP [07:02:34] ACKNOWLEDGEMENT - Check whether ferm is active by checking the default input chain on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [07:02:34] ACKNOWLEDGEMENT - DPKG on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [07:02:35] ACKNOWLEDGEMENT - Disk space on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2068&var-datasource=codfw+prometheus/ops [07:02:35] ACKNOWLEDGEMENT - HP RAID on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [07:02:36] ACKNOWLEDGEMENT - IPMI Sensor Status on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [07:02:36] ACKNOWLEDGEMENT - MariaDB Slave IO: s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:02:37] ACKNOWLEDGEMENT - MariaDB Slave Lag: s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:02:37] ACKNOWLEDGEMENT - MariaDB Slave SQL: s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [07:02:38] ACKNOWLEDGEMENT - MariaDB disk space on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:02:38] ACKNOWLEDGEMENT - MariaDB read only s7 on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:02:39] ACKNOWLEDGEMENT - configured eth on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [07:02:39] ACKNOWLEDGEMENT - dhclient process on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [07:02:40] ACKNOWLEDGEMENT - mysqld processes on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [07:02:40] ACKNOWLEDGEMENT - puppet last run on db2068 is CRITICAL: connect to address 10.192.48.20 port 5666: Connection refused Effie Mouzeli Host is depooled T235366 https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:26:01] (03CR) 10Daimona Eaytoy: [C: 03+1] build: Upgrade mediawiki-codesniffer to v28.0.0 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542522 (owner: 10Jforrester) [13:06:55] (03CR) 10CDanis: prometheus global: add rules for correct global HTTP avail (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/540676 (https://phabricator.wikimedia.org/T234567) (owner: 10CDanis) [13:07:53] PROBLEM - Check systemd state on elastic2037 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:09:58] (03Abandoned) 10Urbanecm: Grant autocreateaccount to everyone on closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/540380 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [13:13:58] (03PS1) 10Urbanecm: Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) [13:14:47] (03CR) 10jerkins-bot: [V: 04-1] Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [13:16:46] (03PS2) 10Urbanecm: Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) [13:17:36] (03CR) 10jerkins-bot: [V: 04-1] Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [13:18:21] (03PS3) 10Urbanecm: Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) [13:19:22] (03CR) 10jerkins-bot: [V: 04-1] Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [13:22:03] (03PS4) 10Urbanecm: Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) [13:23:01] (03CR) 10jerkins-bot: [V: 04-1] Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) (owner: 10Urbanecm) [13:24:02] (03PS5) 10Urbanecm: Allow certain users to create account at closed wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542755 (https://phabricator.wikimedia.org/T222117) [13:27:19] RECOVERY - Check systemd state on elastic2037 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:32] 10Operations, 10SRE-Access-Requests: Requesting access to 'analytics-privatedata-users' and 'researchers' for Jerrie Kumalah - https://phabricator.wikimedia.org/T234433 (10jkumalah) @Aklapper i figured out the issue. I had the wrong ssh key file in my configuration. [16:34:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:37:21] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [22:12:41] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:23:17] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [22:36:19] (03CR) 10DannyS712: "Is there another commit that I can't find for banwiki? The following seem to be missing:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [22:37:24] (03CR) 10Urbanecm: "> Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [22:48:11] (03PS4) 10DannyS712: Initial configuration for banwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [22:48:39] (03CR) 10DannyS712: "> > Patch Set 3:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541527 (https://phabricator.wikimedia.org/T234768) (owner: 10Urbanecm) [22:50:06] (03PS6) 10DannyS712: Add `autopatrol` to translation administrators on mediawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/541057