[00:25:30] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:30:40] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:52:00] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:55:08] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:57:10] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:57:22] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:57:42] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [00:59:54] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:00:50] RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:52] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:06] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:26] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:10:28] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:11:44] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:13:02] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:14:14] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:18:15] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:21:30] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:21:30] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:22:40] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:26:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:29:52] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:31:52] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40 [01:31:52] ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:35:04] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:37:02] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:37:30] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:48:42] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:48:52] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:51:12] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:56:32] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [01:58:00] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:05:58] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:10:00] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:14:28] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:16:24] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:16:46] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:23:52] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:24:18] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:25:12] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:29:24] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:30:16] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:31:50] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne [02:31:50] status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:34:28] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:35:26] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:35:50] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:38:22] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:42:10] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:43:06] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:44:30] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Jpita) [02:47:16] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:50:52] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:53:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:54:54] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:58:32] PROBLEM - Long running screen/tmux on people1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 18618, 1729058s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [02:58:40] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [02:59:56] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:00:46] 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error: The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle) [03:01:14] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:02:30] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:02:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:03:28] 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 10 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle) [03:04:03] 10Operations, 10Performance-Team, 10serviceops: Test gutter pool failover in production and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle) [03:06:00] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:06:00] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:08:30] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:08:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:12:18] PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:38] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:19:12] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:21:50] (03PS1) 10Guozr.im: RemoteExecution: Typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 [03:22:54] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:26:59] (03CR) 10Guozr.im: "Hi guys," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [03:30:34] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:37:52] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:38:14] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:40:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:45:34] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:46:54] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (ex [03:46:54] domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:47:20] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:52:32] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:54:30] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:54:58] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:55:02] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:56:34] RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [03:58:20] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:58:40] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [03:59:38] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:01:14] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:02:10] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:03:24] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:07:42] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:10:12] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:32:34] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:33:10] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:35:08] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:35:46] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:54:48] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:57:22] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [04:58:24] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:07:42] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:08:36] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:12:46] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:20:16] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:22:48] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:28:02] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:34:06] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:34:06] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:34:34] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:36:40] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:36:40] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:37:06] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:40:42] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:40:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:43:16] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:44:18] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40 [05:44:18] ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:49:22] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:51:52] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:54:26] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [05:56:48] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:02:34] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:05:08] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:07:00] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:11:18] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:13:50] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:17:40] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:17:54] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:20:14] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:23:00] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:25:02] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:30:54] PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores [06:31:04] PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:31:22] PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:31:34] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40 [06:31:34] ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:31:44] PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:31:46] PROBLEM - configured eth on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:31:50] PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:31:50] PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:32:12] PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:32:28] PROBLEM - Disk space on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:32:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:34:06] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:35:06] PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:38:20] <_joe_> !log restart envoy with 10 requests per connection on mw2231, T247484 [06:38:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:26] T247484: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484 [06:40:22] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:41:08] RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [06:41:26] RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient [06:41:36] RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:41:50] RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack [06:41:50] RECOVERY - configured eth on ores1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [06:41:54] RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [06:41:56] RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [06:41:59] 10Operations: decom racktables? - https://phabricator.wikimedia.org/T247646 (10MoritzMuehlenhoff) I think deploying it on Buster will be unproblematic, the current host is already on Stretch, so the big incompatibilities between PHP 5 and 7 are already addressed. Racktables is also still maintained (last mainten... [06:42:20] RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:42:34] RECOVERY - Disk space on ores1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops [06:42:52] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:42:52] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:44:58] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:45:05] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [06:45:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:47:30] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:50:08] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:52:40] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [06:54:29] !log removing some library packages from jessie/stretch after labstore1006/1007 dist-upgrade to buster [06:54:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:01:05] (03PS2) 10Brian Wolff: Add wikidata.beta.wmflabs.org to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 [07:02:00] 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui) [07:02:09] (03CR) 10jerkins-bot: [V: 04-1] Add wikidata.beta.wmflabs.org to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [07:02:44] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:03:12] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:03:58] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:05:12] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:05:16] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:05:46] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:06:30] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:06:50] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:07:44] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:09:08] (03PS5) 10Elukey: admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) [07:09:20] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:12:13] (03CR) 10Elukey: [C: 03+2] admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey) [07:14:49] !log installing libgd2 security updates on jessie [07:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:08] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:21:34] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:22:40] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:24:04] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:24:26] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:24:51] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2121" [puppet] - 10https://gerrit.wikimedia.org/r/579886 [07:25:14] (03PS2) 10Marostegui: Revert "install_server: Allow reimage db2121" [puppet] - 10https://gerrit.wikimedia.org/r/579886 [07:26:58] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:27:53] (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db2121" [puppet] - 10https://gerrit.wikimedia.org/r/579886 (owner: 10Marostegui) [07:30:16] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:32:46] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:36:42] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:40:44] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:41:42] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:45:46] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:47:15] !log installing lxml security updates [07:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:16] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:51:08] 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) 05Resolved→03Open The issue occurred again on cp4025. Reopening. ` Mar 14 15:51:49 cp4025 varnishd[20511]: Child (20592) not responding to CLI, killed it. Mar 14 15:51:49 cp4025 varnishd[205... [07:52:42] !log cp4025: restart varnish-fe to clear 'child restarted' alert T185968 [07:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:46] T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 [07:54:16] RECOVERY - Varnish frontend child restarted on cp4025 is OK: (C)2 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4025&var-datasource=ulsfo+prometheus/ops [07:54:16] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [07:54:28] (03PS1) 10KartikMistry: apertium-es-pt: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/579889 (https://phabricator.wikimedia.org/T247585) [07:57:10] (03PS3) 10KartikMistry: apertium-br-fr: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/579463 (https://phabricator.wikimedia.org/T247585) [07:57:40] !log installing libxslt security updates [07:57:41] (03PS2) 10KartikMistry: apertium-cy-en: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-cy-en] - 10https://gerrit.wikimedia.org/r/579683 (https://phabricator.wikimedia.org/T247585) [07:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:20] (03PS3) 10KartikMistry: apertium-cat-ita: Fix FTBFS with apertium 3.6 + 0.2.1 release [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/579509 (https://phabricator.wikimedia.org/T247585) [07:58:43] 10Operations, 10Traffic: OOM killer killed varnihsd cache-main on cp3053 - https://phabricator.wikimedia.org/T247195 (10ema) [07:58:46] 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) [07:59:26] (03PS2) 10KartikMistry: apertium-en-es: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/579757 (https://phabricator.wikimedia.org/T247585) [08:02:48] !log cp4025 restart trafficserver-tls to clear 'tls process restarted' alert T241593 T185968 [08:02:52] (03PS3) 10Brian Wolff: Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 [08:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:54] (03PS1) 10Brian Wolff: Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) [08:02:55] T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 [08:02:55] T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593 [08:06:32] RECOVERY - traffic_server tls process restarted on cp4025 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4025&var-layer=tls [08:08:30] ACKNOWLEDGEMENT - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T247561 [08:08:30] (03CR) 10Jcrespo: "The diff is ok, but the commit messages doesn't follow the guidelines:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [08:09:29] (03CR) 10Jcrespo: "Note there is not a single verb + subject sentence on the commit, all of them should be, IMHO." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [08:13:04] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:13:52] (03PS1) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891 [08:15:36] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:15:58] !log Review and enable events on recently migrated 10.4 hosts - T247728 [08:16:02] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:04] T247728: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728 [08:16:56] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:18:34] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:19:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [08:24:40] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:27:14] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:31:06] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:31:20] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:38:22] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:38:38] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:39:48] (03PS2) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891 [08:39:50] (03PS1) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) [08:40:52] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:41:26] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:41:36] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:41:38] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:42:24] (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:42:49] (03PS2) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) [08:43:01] (03PS3) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) [08:44:10] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:49:08] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:51:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] "The topic was wrong btw. It's eventgate-analytics, not evenstreams." [deployment-charts] - 10https://gerrit.wikimedia.org/r/579324 (https://phabricator.wikimedia.org/T247484) (owner: 10Alexandros Kosiaris) [08:51:22] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:51:40] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:52:40] (03PS1) 10KartikMistry: Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) [08:53:38] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [08:53:41] (03CR) 10jerkins-bot: [V: 04-1] Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) (owner: 10KartikMistry) [08:53:46] (03PS4) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) [08:53:52] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:54:10] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [08:55:50] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) (owner: 10KartikMistry) [08:55:58] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.328e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:00:11] (03PS1) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579894 (https://phabricator.wikimedia.org/T138562) [09:02:46] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article t [09:02:46] unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:03:30] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:07:48] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:08:30] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:22:59] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579894 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [09:23:02] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) [09:23:40] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) Do they need to have a phabricator/wikitech account first, @Aklapper ? [09:24:26] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:26:28] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:26:58] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:28:42] PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.332e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [09:29:57] (03PS2) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579894 (https://phabricator.wikimedia.org/T138562) [09:30:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1011 to es2 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10700 and previous config saved to /var/cache/conftool/dbconfig/20200316-093048-marostegui.json [09:30:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:55] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [09:31:30] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:32:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1015 for upgrade and restart T239791', diff saved to https://phabricator.wikimedia.org/P10701 and previous config saved to /var/cache/conftool/dbconfig/20200316-093228-marostegui.json [09:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:42] PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:35:14] RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:35:28] (03CR) 10Volans: "Couple of questions inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [09:37:28] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_analytics_external_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:39:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:41:16] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:42:02] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the [09:42:02] s 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:43:50] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:44:36] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:45:02] 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10ArielGlenn) If there are no objection on this task by Wednesday Mar 18, I'll prepare a patch and this request can go ahead. [09:45:16] 10Operations, 10Research: reccommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey) [09:45:32] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey) [09:46:15] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title [09:46:15] xpected status 404 (expecting: 200) Elukey T247732 https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:46:15] ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) Elukey T247732 https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:46:27] ok acked [09:51:45] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:52:49] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [09:55:24] (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [09:56:58] (03PS3) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891 [09:57:00] (03PS1) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579901 (https://phabricator.wikimedia.org/T138562) [09:58:09] (03PS2) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579901 (https://phabricator.wikimedia.org/T138562) [09:58:54] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10ArielGlenn) @darthmon_wmde ldap entries are predicated on having a wikitech account. A manager from WMDE should sign off on this here on the task as well. And... [09:59:11] 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) The OOM killer intervened due to "Normal" (non-DMA) free memory on NUMA node 0 going below min (1380412 < 1387544): ` [Sat Mar 14 15:51:23 2020] Node 0 Normal free:1380412kB min:1387544kB low:17... [10:01:28] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579901 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo) [10:02:20] !log start of ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=0 --file=15march2217-holes-nulls.list on screen (T219123) [10:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:25] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:04:02] (03CR) 10Elukey: kibana: refactor kibana role to kibana profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles) [10:06:14] (03CR) 10Filippo Giunchedi: ELk7: add curator job to require disktype hdd after 7 days (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron) [10:09:23] (03CR) 10Filippo Giunchedi: prometheus::ops: Add prometheus job to scrape Netbox scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov) [10:10:45] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10ArielGlenn) Hi @Jpita, because your phabricator account is seemingly not linked to your official (https://meta.wikimedia.org/wiki/Special:CentralAuth/JPita_(WMF)) account, I can't easily v... [10:10:57] !log Stop mysql for upgrade on es1015 T239791 [10:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:03] T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 [10:12:38] 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui) [10:13:38] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10Aklapper) >>! In T247731#5971402, @darthmon_wmde wrote: > Do they need to have a phabricator/wikitech account first? @darthmon_wmde: See the instructions on... [10:13:59] 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) Also worth mentioning that in the specific case of cp4025, the trouble was caused by a sudden [[https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&var-server=cp4025&var-datas... [10:14:31] (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) [10:14:33] (03PS1) 10Giuseppe Lavagetto: eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) [10:15:28] (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [10:16:35] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:17:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10702 and previous config saved to /var/cache/conftool/dbconfig/20200316-101707-marostegui.json [10:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:43] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:19:59] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:20:04] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) > ldap entries are predicated on having a wikitech account. got it, thanks! > A manager from WMDE should sign off on this here on the task as... [10:20:53] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:22:05] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:25:03] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:25:54] (03PS1) 10Ema: cache: decrease varnish-frontend malloc cache size [puppet] - 10https://gerrit.wikimedia.org/r/579906 (https://phabricator.wikimedia.org/T185968) [10:26:32] ok ack is not really working for recommendation api [10:26:50] downtime only the service probably is best [10:28:29] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10ArielGlenn) >>! In T247731#5971523, @darthmon_wmde wrote: > >> A manager from WMDE should sign off on this here on the task as well. > That'd be me as Engin... [10:28:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10703 and previous config saved to /var/cache/conftool/dbconfig/20200316-102829-marostegui.json [10:28:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1030). [10:31:28] <_joe_> elukey: just a q: did you try to restart the service? [10:31:36] (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::services_proxy: allow defining a retry policy (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [10:32:14] (03CR) 10Alexandros Kosiaris: [C: 04-1] "typo, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [10:32:23] _joe_ nope, but I can try now [10:32:45] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:33:11] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:34:22] (03PS1) 10Ema: cache: limit upload transient storage usage [puppet] - 10https://gerrit.wikimedia.org/r/579907 (https://phabricator.wikimedia.org/T185968) [10:35:53] (03CR) 10Giuseppe Lavagetto: profile::services_proxy: allow defining a retry policy (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [10:36:04] !log roll restart of recommendation service on scb* as attempt to fix the flapping alerts - T247732 [10:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:11] T247732: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 [10:37:52] (03PS1) 10Ladsgroup: Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 [10:38:13] (03PS2) 10Ladsgroup: Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 (https://phabricator.wikimedia.org/T219123) [10:38:40] I'm deploying this quickly [10:38:41] ^ [10:39:19] (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [10:40:03] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10704 and previous config saved to /var/cache/conftool/dbconfig/20200316-104002-marostegui.json [10:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:39] (03Merged) 10jenkins-bot: Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [10:40:41] marostegui: ^ FYI [10:41:22] Amir1: ok [10:43:26] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: "Set term store to WRITE_BOTH for all of Wikidata" (T219123) (duration: 01m 13s) [10:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:31] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:43:54] (03PS2) 10Jbond: data_admin: delete this file it seems to be unused [puppet] - 10https://gerrit.wikimedia.org/r/579591 (https://phabricator.wikimedia.org/T247364) [10:44:51] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:45:09] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: "Set term store to WRITE_BOTH for all of Wikidata" (T219123), take II (duration: 01m 07s) [10:45:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:35] hey here, sorry posted this in the wrong channel. I won't be doing the portal deploy today, there are a few bugs I need to fix in the build... [10:46:57] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40 [10:46:57] ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:47:11] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:47:11] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:47:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10705 and previous config saved to /var/cache/conftool/dbconfig/20200316-104723-marostegui.json [10:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:41] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (ex [10:47:41] ps://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:48:06] That's not a great line split... [10:49:42] yeah [10:51:03] There's a few crappy ones in there [10:51:06] * Reedy dumps on a bug [10:51:34] T230799 [10:51:34] T230799: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 [10:52:37] marostegui: it's time to bother you more, which replicas I need to warm before I move to reading? on s8 I mean [10:52:39] 10Operations, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10Reedy) Relatedly... There's some that spill over to multiple lines, and lose letters `lang=irc [10:46:57] PROBLEM - recommendation_api endpoints health on scb2001 is CRI... [10:52:45] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:53:12] I am downtiming those as they come --^ [10:53:46] Amir1: checking, db1126 and db1111 for sure [10:53:49] let me double check [10:54:37] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:55:02] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list - https://phabricator.wikimedia.org/T247737 (10Lantus) [10:55:08] Amir1: db1111, db1126 db1104 and ideally db1092 if possible to [10:55:12] just to be on the safe side [10:55:44] !log warming up db1026 for up to Q35M for the new term store (T219123) [10:55:48] cool [10:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:49] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [10:55:54] marostegui: I keep it in mind [10:55:58] Thanks [10:56:55] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [10:57:12] db1026? [10:57:17] does that exist? [10:57:50] sorry, db1126 [10:57:52] it was decommed at T174763 [10:57:54] T174763: Decommission db1026 - https://phabricator.wikimedia.org/T174763 [10:57:58] oh, I see [10:57:59] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list - https://phabricator.wikimedia.org/T247737 (10Reedy) a:05Lantus→03None [10:58:06] np [10:59:19] (03PS2) 10Giuseppe Lavagetto: profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) [10:59:21] (03PS2) 10Giuseppe Lavagetto: eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) [10:59:36] (03CR) 10Jbond: [C: 03+2] data_admin: delete this file it seems to be unused [puppet] - 10https://gerrit.wikimedia.org/r/579591 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond) [10:59:47] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (dbprov2001), No backups: 2 (dbprov2001, ...), Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [10:59:57] ^I will ack [11:00:00] it is expected [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:10] will fix itself once a new run happens [11:01:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] Make update-special-pages handle dblist comments [puppet] - 10https://gerrit.wikimedia.org/r/579876 (https://phabricator.wikimedia.org/T247716) (owner: 10Reedy) [11:01:24] ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (dbprov2001), No backups: 2 (dbprov2001, ...), Fresh: 96 jobs Jcrespo running backups under new name https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [11:04:04] !log Warming up InnoDB buffer pool cache in db1111, db1126, db1104, db1092 (T219123) [11:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:10] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [11:04:38] !log ... for Q30M-Q35M of the new term store [11:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:46] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10ArielGlenn) I have verified via google hangout using his wikimedia email account (and checking his image against the office picture :-)) that it's really JPita asking for access. [11:06:51] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey) [11:07:42] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [11:07:52] (03PS1) 10Ladsgroup: Set up read new term store up to Q35M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579913 (https://phabricator.wikimedia.org/T219123) [11:08:06] (03CR) 10Alexandros Kosiaris: [C: 03+1] eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [11:09:21] (03CR) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond) [11:13:24] (03PS1) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) [11:21:44] (03PS2) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) [11:21:47] (03CR) 10Alexandros Kosiaris: "> Having said that, I do not think we should rush and merge this right away - changeprop chart wasn't deployed yet and new followups are b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [11:22:42] (03PS3) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) [11:22:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21442/mw1331.eqiad.wmnet/ this is in all effects a noop." [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [11:25:02] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/21446/install1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [11:26:21] 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 and stat1005 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I had a chat with Arzhel today and we didn't find a lot. From his perspective, it seems that something in the middle between the switch and stat1005 is not worki... [11:32:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto) [11:38:19] (03CR) 10Vgutierrez: [C: 03+1] cache: decrease varnish-frontend malloc cache size [puppet] - 10https://gerrit.wikimedia.org/r/579906 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema) [11:38:40] (03CR) 10Vgutierrez: [C: 03+1] cache: limit upload transient storage usage [puppet] - 10https://gerrit.wikimedia.org/r/579907 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema) [11:40:27] 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10Aklapper) [11:52:22] !log manually fix prometheus squid exporter on install1003 [11:52:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:54:04] (03CR) 10Jbond: Prometheus Squid exporter, specify proxy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [11:57:41] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:59:08] (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q35M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579913 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [12:00:00] (03Merged) 10jenkins-bot: Set up read new term store up to Q35M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579913 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [12:00:19] (03PS1) 10Jbond: New release 1.0.4 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 [12:05:32] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579913|Set up read new term store up to Q35M (T219123)]] (duration: 01m 08s) [12:05:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:37] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [12:09:39] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579913|Set up read new term store up to Q35M (T219123)]], take II (duration: 01m 07s) [12:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:53] !log warming up cache for Q35M to Q40M for new term store on db1111, db1126, db1104, db1092 (T219123) [12:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:45] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:15:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [12:17:47] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:20:17] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) [12:21:42] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) 2 ouf of the 3 have already wikitech accounts. Could we proceed with these two? [12:22:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "+1 for the openstack part. Please collect a +1 from someone related to the logstash/kafka thing." [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [12:22:54] 10Operations: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10WMDE-leszek) [12:23:13] 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10WMDE-leszek) [12:24:38] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please collect +1/+2 from somebody more related with this redis module." [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [12:26:38] 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 10 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Esanders) >>! In T120085#5545232, @Krinkle wrote: > So the question is whether it would be a problem... [12:27:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:31:27] 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10Aklapper) [12:31:42] (03PS2) 10Jbond: New release 1.0.4 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 [12:33:13] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:34:19] (03PS3) 10Jbond: New release 1.0.5 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 [12:35:39] (03PS4) 10Jbond: New release 1.0.4 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 [12:35:45] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:35:50] (03PS5) 10Jbond: New release 1.0.5 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 [12:36:18] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Volans) ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report. What should be the correct state for now? [12:36:21] (03CR) 10Holger Knust: "Thanks, we were targeting end of this week to get both charts in a "potentially deployable" state, meaning tweaks aside the bulk of the wo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [12:36:40] (03CR) 10Jbond: [C: 03+2] New release 1.0.5 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 (owner: 10Jbond) [12:37:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:42:49] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:43:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1015', diff saved to https://phabricator.wikimedia.org/P10706 and previous config saved to /var/cache/conftool/dbconfig/20200316-124309-marostegui.json [12:43:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:53:19] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [12:58:05] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10ArielGlenn) >>! In T247731#5971941, @darthmon_wmde wrote: > 2 ouf of the 3 have already wikitech accounts. Could we proceed with these two? Sure. We should m... [12:58:21] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:01:34] (03PS4) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) [13:03:44] (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/21447/install1003.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [13:07:01] (03PS1) 10Jbond: Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 [13:08:05] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10ArielGlenn) I see you already in the group: ariel@mwmaint1002:~$ ldapsearch -x cn=wmf | grep josepita member: uid=josepita,ou=people,dc=wikimedia,dc=org [13:08:43] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 102 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:09:33] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 (owner: 10Jbond) [13:13:13] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:13:29] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:14:05] downtiming --^ [13:15:27] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:15:57] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:17:17] (03PS2) 10Jbond: Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 [13:17:25] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) @Cmjohnson there seems to be a problem with the host's serial: https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/ [13:17:34] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) 05Resolved→03Open [13:19:36] (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 (owner: 10Jbond) [13:19:40] 10Operations, 10netops, 10Patch-For-Review, 10User-Elukey: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10ayounsi) >>! In T246186#5960144, @elukey wrote: > If the cardinality of the three new dimensions are not too big we could definitely... [13:25:31] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:26:06] (03PS1) 10Ladsgroup: Set up read new term store up to Q40M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579925 (https://phabricator.wikimedia.org/T219123) [13:28:52] (03PS3) 10Jbond: Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 [13:30:12] jbond42: hi! do you need the CI container bumped/updated? [13:30:22] !log depooling wdqs1005 to catch up on lag [13:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:31] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:30:57] hashar: i have https://gerrit.wikimedia.org/r/#/c/integration/config/+/579924 but just making sure i have all the rby dependencies in order first, will ping when the gemfile changes are merged, thanks [13:32:15] (03CR) 10Jbond: [C: 03+2] Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 (owner: 10Jbond) [13:32:34] jbond42: cool :] [13:32:45] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) [13:33:03] hashar: just merged so if you could merge https://gerrit.wikimedia.org/r/#/c/integration/config/+/579924 that would be great :) [13:33:23] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) Awesome, thanks @ArielGlenn ! I just added the third user name =) [13:33:33] doing so! [13:33:44] thanks <3 [13:34:50] 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) >>! In T228924#5971978, @Volans wrote: > ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report. > What sh... [13:36:22] the container is building, I will bump the jobs [13:37:17] great thanks [13:40:29] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:41:04] ACKNOWLEDGEMENT - WDQS high update lag on wdqs1005 is CRITICAL: 4.676e+04 ge 4.32e+04 Gehel currently depooled to catch up on lag https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [13:42:21] !log restarting blazegraph on wdqs1007 [13:42:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:43] (03PS1) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [13:44:10] 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) I'll revert 576921 (that was a leftover of testing), but with the service ID pointing to 443 (and CASRootProxiedAs set to https://cas-logstash.wikimedia.org (as Envoy only goes one way and other it would report... [13:45:09] ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad Ayounsi https://phabricator.wikimedia.org/T245176#5972066 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:47:06] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [13:48:43] jbond42: the container has failed. Tox could not reach some network resource through the proxy. I am trying again [13:49:48] hashar: ack thanks [13:49:50] !log upload atskafka 0.1 to buster-wikimedia T237993 [13:49:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:54] (03PS1) 10Muehlenhoff: Revert service ID for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/579954 (https://phabricator.wikimedia.org/T246998) [13:49:55] T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993 [13:51:04] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:56:38] (03CR) 10Jbond: [C: 03+1] "LGTM and thanks for adding types/lookup to the other parameter <3" [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [13:56:59] (03PS2) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [13:59:18] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:01:13] Cannot connect to proxy.', timeout('timed out [14:01:14] bah [14:01:26] jbond42: I am not sure what is going on. Gotta dig into it [14:02:28] PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:33] hashar: i know that mutante and moritzm added a new web proxy server last week i wonder if that could be the cause of the problems [14:04:09] indeed, webproxy in prod now uses install1003.wikimedia.org, where did it fail? somewhere in labs? [14:04:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:04:48] this alert is known btw ^ being worked on [14:05:06] !log installing libxslt security updates [14:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:44] on contint1001 inside a docker container. I am not even sure which url it uses as a proxy ;) [14:06:20] http://webproxy.eqiad.wmnet:8080http://webproxy.eqiad.wmnet:8080 [14:06:34] 10Operations, 10ops-eqiad: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10fgiunchedi) [14:06:45] which indeed seems to point to install1003 [14:06:57] ACKNOWLEDGEMENT - IPMI Sensor Status on mw1373 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Filippo Giunchedi https://phabricator.wikimedia.org/T247755 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:08:18] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 36 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:10:48] (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q40M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579925 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [14:10:50] (03PS1) 10Ema: ATS: add tls and backend log config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/579955 [14:11:46] (03Merged) 10jenkins-bot: Set up read new term store up to Q40M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579925 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [14:15:02] !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --from-id 87500000 --to-id 87767570 --batch-size=10 --sleep=5 (T219123) [14:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:07] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:15:25] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) While there are many features that I would like to improve (like adding some state management for new and removed jobs, total size monitoring,... [14:16:51] !log rolling restart of FPM on mw1261-mw1265 to pick up libxslt security updates [14:16:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:55] jbond42: moritzm: so it seems the Docker container resolves webproxy.eqiad.wmnet to install1003 and the connections time out :/ [14:18:54] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q40M (T219123)]] (duration: 01m 07s) [14:18:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:08] hashar: that seems to be somein with the Docker container setup, I see plenty of successful accesses from contint1001 (for pypi and other URLs) on install1003 [14:22:13] !log warming up cache for Q40M to Q50M for new term store on db1111, db1126, db1104, db1092 (T219123) [14:22:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:18] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [14:22:19] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q40M (T219123)]], take II (duration: 01m 06s) [14:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:30] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:23:58] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:26:00] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:27:32] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:27:42] 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10observability, and 2 others: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete (i.e. {T242609}), resolving. Feel free to reopen though! [14:29:18] (03PS6) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497) [14:29:20] (03PS4) 10Ema: cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) [14:29:22] (03PS4) 10Ema: cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497) [14:30:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:32:49] (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: remove all obsolete v2 policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579634 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [14:32:57] 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) 05Open→03Resolved Thanks, @elukey fixed the issue in netbox [14:34:34] (03CR) 10jerkins-bot: [V: 04-1] cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema) [14:35:08] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid,swagger_check_cxserver_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:37:12] RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium-recursive package [debs/contenttranslation/apertium-recursive] - 10https://gerrit.wikimedia.org/r/578704 (https://phabricator.wikimedia.org/T234181) (owner: 10KartikMistry) [14:38:24] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium-anaphora package [debs/contenttranslation/apertium-anaphora] - 10https://gerrit.wikimedia.org/r/578705 (https://phabricator.wikimedia.org/T234181) (owner: 10KartikMistry) [14:39:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki: Change php-wmerrors channel from "fatal" to as "exception" [puppet] - 10https://gerrit.wikimedia.org/r/577645 (https://phabricator.wikimedia.org/T247113) (owner: 10Krinkle) [14:40:21] (03CR) 10Ayounsi: [C: 03+2] Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi) [14:40:42] (03PS4) 10Andrew Bogott: nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573) [14:42:25] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10hashar) When building a docker container on contint1001.wikimedia.org with docker-pkg, pip gets proxy timeout error when using `http://webproxy.eqiad.wmnet:8080`. I have manually switched to the... [14:42:52] moritzm: webproxy.codfw.wmnet works fine though! I have commented about it on the task [14:42:59] jbond42: container build, I am updating the jobs [14:43:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [14:43:28] XioNoX: ^ suspicious timeouts reaching the proxy from the outside too, like the exporter is experiencing [14:43:45] (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [14:44:19] :/ [14:44:38] webproxy.codfw.wmnet worked for me [14:45:14] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:47:02] (03CR) 10Marostegui: "The only thing I have in mind at the moment is...given that this script is very specific for es (ie: will only work with single PKs, which" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/577224 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo) [14:47:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) Machine is built and has accounts for fr-analytics. [14:48:06] hashar: act thanks :) [14:48:24] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [14:49:38] 10Operations: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10ayounsi) p:05Triage→03High [14:49:54] godog: https://phabricator.wikimedia.org/T247759 [14:50:08] nobody owns squid though [14:50:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:51:47] XioNoX: thanks! yeah in my mind it is foundations, but I don't want to voluntell anyone heh [14:51:56] (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: raise max_execution_time [puppet] - 10https://gerrit.wikimedia.org/r/579961 (https://phabricator.wikimedia.org/T247622) [14:52:12] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:52:49] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond) [14:53:09] godog: I bolded it in today's meeting so at least people can be aware of it [14:53:43] +1 thank you [14:55:12] maybe one can pull out install1003 from webproxy dns entry? [14:55:16] or maybe it is just overloaded [14:57:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:58:17] godog: I'm wondering if pulling it from prometheus doesn't make things worse? [14:59:22] XioNoX: possible for sure! should be easy to quickly test by pulling the prometheus ferm rules from install1003 [14:59:56] (03PS4) 10Andrew Bogott: nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573) [14:59:58] (03PS2) 10Guozr.im: RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 [15:01:05] (03CR) 10Guozr.im: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [15:01:36] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:02:35] !log rolling restart of FPM/apache on netmon* to pick up libxslt security updates [15:02:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:21] !log T234181 upload apertium-anaphora_0.0.4-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main [15:04:21] !log T234181 upload apertium-recursive_0.0.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main [15:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:30] T234181: Package apertium-anaphora and apertium-recursive - https://phabricator.wikimedia.org/T234181 [15:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:58] (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [15:05:18] (03PS1) 10KartikMistry: apertium-fr-es: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/580053 (https://phabricator.wikimedia.org/T247585) [15:05:18] 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10ayounsi) I opened T247759 to track this issue. [15:05:30] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:06:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:08:17] (03CR) 10Gergő Tisza: [C: 03+1] Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 (owner: 10Alex Monk) [15:10:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21448/ the change does the right thing. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/579961 (https://phabricator.wikimedia.org/T247622) (owner: 10Giuseppe Lavagetto) [15:10:09] (03CR) 10Ayounsi: [C: 03+1] "LGTM, please sync up so I can update the router's forwarders." [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn) [15:13:06] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:15:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) [15:16:40] /query ema [15:16:43] uff [15:16:59] elukey: <3 [15:17:00] (03CR) 10Jbond: [C: 03+1] Revert service ID for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/579954 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [15:17:14] elukey: italian mafia leak detected [15:17:26] * vgutierrez hides [15:17:26] (03CR) 10Jcrespo: [C: 03+1] RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [15:17:30] vgutierrez: you got me [15:17:38] (03CR) 10Muehlenhoff: [C: 03+2] Revert service ID for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/579954 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff) [15:19:55] (03CR) 10Jcrespo: [C: 03+2] RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [15:19:57] (03CR) 10Marostegui: [C: 03+1] RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [15:19:59] (03CR) 10Jbond: [C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [15:20:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:20:49] (03CR) 10Jbond: [C: 03+2] "LGTM thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [15:22:11] (03CR) 10Jbond: [C: 03+2] "lgtm, thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/578575 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK) [15:22:42] (03PS4) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891 [15:23:02] (03PS4) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) [15:23:47] (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry) [15:25:17] (03CR) 10Jcrespo: "I hope it was clear that my purpose here was not to annoy you with red tape 0:-D" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [15:27:34] (03PS5) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) [15:30:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:32:58] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891 (owner: 10Jcrespo) [15:34:26] * Krinkle testing on mwdebug1002 [15:34:57] 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10Sarahmarie1981) [15:35:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:35:58] (03PS2) 10Krinkle: wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) [15:36:03] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Sarahmarie1981) [15:36:09] (03CR) 10Krinkle: [C: 03+2] wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [15:37:02] (03Merged) 10jenkins-bot: wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [15:38:03] 10Operations, 10Patch-For-Review: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) >>! In T246998#5966192, @colewhite wrote: > Can the idp redirect to https? What happens when this is configured? The server-side IDP and Apache config has been adapted, if anyone wants to... [15:39:06] (03PS4) 10KartikMistry: apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) [15:39:43] (03CR) 10CRusnov: "Thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [15:42:52] (03CR) 10Guozr.im: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im) [15:46:40] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:49:04] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [15:49:28] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:53:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:50] (03PS1) 10Muehlenhoff: Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074 [15:54:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:54:50] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff) [15:55:08] (03CR) 10Dzahn: "isn't it "insetup_noferm" ?" [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff) [15:55:50] (03CR) 10Jcrespo: "> Hi Jcrespo," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [15:56:06] mutante: ack, amended in PS2 [15:56:09] (03PS2) 10Muehlenhoff: Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074 [15:56:23] 10Operations, 10Traffic, 10observability, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10ema) 05Open→03Resolved a:03ema Metrics added a while ago, closing! [15:56:26] (03CR) 10Dzahn: [C: 03+1] Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff) [15:56:27] moritzm: yep, +1 [15:58:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:30] (03CR) 10Vgutierrez: [C: 03+1] ATS: add tls and backend log config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/579955 (owner: 10Ema) [15:59:04] (03CR) 10Ema: [C: 03+2] ATS: add tls and backend log config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/579955 (owner: 10Ema) [15:59:09] (03CR) 10QEDK: [C: 03+1] Fix typos (boostrap -> bootstrap) [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [16:02:12] (03CR) 10Jbond: [C: 03+2] "LGTM will merge thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712) [16:03:24] (03CR) 10Guozr.im: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im) [16:06:28] (03CR) 10Muehlenhoff: [C: 03+2] Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff) [16:07:02] (03PS1) 10Krinkle: Revert "wgConf: Remove unused 'fullLoadCallback' property assignment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580078 [16:07:21] (03CR) 10Krinkle: [C: 03+2] "Will reconsider later in the stack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580078 (owner: 10Krinkle) [16:08:16] (03Merged) 10jenkins-bot: Revert "wgConf: Remove unused 'fullLoadCallback' property assignment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580078 (owner: 10Krinkle) [16:08:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:09:15] (03PS2) 10Krinkle: wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) [16:09:30] (03CR) 10Jforrester: [C: 03+1] Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [16:09:38] (03CR) 10Jforrester: [C: 03+1] Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff) [16:09:42] (03CR) 10Krinkle: [C: 03+2] wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [16:10:40] (03Merged) 10jenkins-bot: wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [16:13:58] (03PS3) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 [16:14:52] !log krinkle@deploy1001 Synchronized wmf-config/wgConf.php: Ie9002d9095ee (duration: 01m 08s) [16:14:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:30] (03PS3) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 [16:15:46] (03CR) 10Krinkle: [C: 03+2] wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 (owner: 10Krinkle) [16:15:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:16:40] (03Merged) 10jenkins-bot: wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 (owner: 10Krinkle) [16:21:25] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I08af45e2e47 (duration: 01m 07s) [16:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:35] (03PS5) 10DannyS712: trwiki: Grant interface editors editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672) [16:22:20] !log copied envoyproxy_1.13.1-1 from buster-wikimedia to stretch-wikimedia [16:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:29] (03PS2) 10Krinkle: Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821) [16:22:31] (03PS2) 10Krinkle: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) [16:22:57] (03CR) 10Krinkle: [C: 03+2] wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 (owner: 10Krinkle) [16:24:06] (03Merged) 10jenkins-bot: wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 (owner: 10Krinkle) [16:29:43] (03CR) 10Krinkle: [C: 03+2] Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [16:29:59] James_F: can do a few beta patches in the mean time if you want [16:30:03] !log krinkle@deploy1001 Synchronized wmf-config/wgConf.php: I870122f946d (duration: 01m 07s) [16:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:09] I could use a short break after this one [16:30:39] (03Merged) 10jenkins-bot: Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [16:33:07] !log krinkle@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: I498e2ebd8c9 (no-op) (duration: 01m 07s) [16:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:43] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I498e2ebd8c9 (duration: 01m 07s) [16:34:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:21] * Krinkle done with mwdebug1002 [16:38:00] (03PS1) 10Elukey: Raise MaxGCPauseMillis on Hadoop HDFS Namenodes' GC settings [puppet] - 10https://gerrit.wikimedia.org/r/580079 [16:40:37] ACKNOWLEDGEMENT - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. Ayounsi known https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:40:47] (03PS4) 10Jforrester: Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [16:41:24] (03CR) 10Jforrester: [C: 03+2] Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [16:42:15] (03Merged) 10jenkins-bot: Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff) [16:43:46] (03PS2) 10Jforrester: Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff) [16:44:02] OK, let's do this. [16:45:55] (03CR) 10Jforrester: [C: 03+2] Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff) [16:46:55] (03Merged) 10jenkins-bot: Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff) [16:48:42] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wmgUseCSP false everywhere T244124 (duration: 01m 07s) [16:48:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:49] T244124: Make CSP enforce on beta cluster - https://phabricator.wikimedia.org/T244124 [16:50:50] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s) [16:50:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:47] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Enforce Content Security Policy if wmgUseCSP is set T244124 (duration: 01m 06s) [16:52:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:54:06] RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.1e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:54:49] !log repooling wdqs1005 [16:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:56:40] (03CR) 10Elukey: [C: 03+2] Raise MaxGCPauseMillis on Hadoop HDFS Namenodes' GC settings [puppet] - 10https://gerrit.wikimedia.org/r/580079 (owner: 10Elukey) [16:58:51] (03PS1) 10Ladsgroup: Set up read new term store up to Q50M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580082 (https://phabricator.wikimedia.org/T219123) [16:59:55] (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q50M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580082 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [17:00:04] gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1700). [17:01:26] (03Merged) 10jenkins-bot: Set up read new term store up to Q50M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580082 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup) [17:03:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:03:36] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q50M (T219123)]] (duration: 01m 06s) [17:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:41] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [17:06:56] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q50M (T219123)]], take II (duration: 01m 08s) [17:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:08:11] !log warming up cache for Q50M to Q60M for new term store on db1111, db1126, db1104, db1092 (T219123) [17:08:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:19] (03CR) 10Ottomata: "Have not totally followed this discussion so feel free to ignore this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [17:13:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:19:30] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:21:41] looks like that spike is the deploy or cache-warmup, Amir [17:24:08] yea, it's over [17:24:10] though [17:24:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:27:28] Amir1: addshore please keep in mind https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1 [17:27:39] jynus: thanks [17:27:41] I will [17:28:01] there was spikes on db1126, probably causing the issue below [17:28:08] *above [17:29:23] I keep it tamed for now [17:29:29] Commit failed on server(s) [17:31:16] that was errors on the wikidata master ^ [17:31:22] so not reads [17:31:34] oh the writes [17:31:37] let me check [17:32:05] give it a look- it passed so no big deal, but maybe interesting for code reasons [17:37:17] (03PS1) 10Dzahn: site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586) [17:38:03] (03PS1) 10Dzahn: remove production IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580089 (https://phabricator.wikimedia.org/T229586) [17:38:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:40:06] (03PS1) 10Dzahn: remove mgmt IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580091 (https://phabricator.wikimedia.org/T229586) [17:40:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:42:01] is the prometheus because of it too? [17:43:40] I let it stay there for a bit [17:43:54] Amir1: no, unrelated [17:43:59] cool [17:44:04] afk for a bit [17:44:24] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:45:16] (03PS1) 10Dzahn: site: let labtest* use role(test), not spare::system [puppet] - 10https://gerrit.wikimedia.org/r/580092 [17:47:11] (03PS2) 10Dzahn: site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586) [17:47:49] it's page creations going up: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops [17:47:56] https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=now-1h&to=now&fullscreen&panelId=10 [17:48:29] will go down quickly, it's going to flap, I'm already asking users to reduce the items created for a bit, until this is over [17:50:32] ok, thanks for the updates Amir [17:50:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:51:09] there's not much we can do about it, we are writing to the term stores for new items since today and it's going to overlap quite a lot [17:51:25] but we should improve it nonetheless [17:51:52] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:53:12] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:54:54] (03PS1) 10Krinkle: [WIP] logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) [17:55:23] (03CR) 10Krinkle: "Blocked as WIP because we're on Monolog 1.25, not >= 2.x" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle) [17:55:31] Reedy: ^ :D [17:58:14] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:58:19] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] toolforge: support canonical redirects in urlproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [18:00:04] RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:03:34] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:05:29] !log mforns@deploy1001 Started deploy [analytics/refinery@1681b92]: deploying refinery to add forgotten artifacts for v0.0.118 [18:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:44] (03PS3) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [18:11:01] (03PS3) 10Krinkle: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) [18:11:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:13:18] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:14:31] (03PS4) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) [18:15:39] jouncebot: next [18:15:39] In 1 hour(s) and 44 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2000) [18:15:41] jouncebot: now [18:15:41] For the next 0 hour(s) and 44 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1800) [18:17:17] (03CR) 10Krinkle: [C: 03+2] Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [18:17:28] (03PS1) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 [18:17:28] * Krinkle testing on mwdebug1002 [18:17:30] (03Merged) 10jenkins-bot: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [18:17:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, please merge so we can work on follow up patches :-)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (owner: 10BryanDavis) [18:18:09] (03CR) 10BryanDavis: toolforge: support canonical redirects in urlproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez) [18:18:31] !log mforns@deploy1001 Finished deploy [analytics/refinery@1681b92]: deploying refinery to add forgotten artifacts for v0.0.118 (duration: 13m 01s) [18:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:18:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, please merge so we can followup with other patches." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [18:19:12] (03PS4) 10Krinkle: Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles) [18:19:28] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:19:31] (03PS2) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 [18:20:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578407 (owner: 10BryanDavis) [18:20:56] PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad [18:21:08] PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw [18:21:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578408 (https://phabricator.wikimedia.org/T246689) (owner: 10BryanDavis) [18:22:28] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) Any news? From possible solutions like T238751, T240442, T245144 and @Ladsgroup's T247459? La... [18:23:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:25:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. I would suggest to include the final help output in the commit message (if it fits, anyway). That should help with patch reviewing t" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578409 (owner: 10BryanDavis) [18:29:10] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:30:31] (03PS1) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 [18:31:13] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [18:34:18] !log krinkle@deploy1001 Synchronized docroot/noc/: I2c3217fb3 (duration: 01m 07s) [18:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. The benefit of using this vs using kubectl directly is that this is persisting the info into the manifest, right? so next start uses" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578412 (owner: 10BryanDavis) [18:35:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:36:29] !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: no-op, courtesy of opcache (duration: 01m 06s) [18:36:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:36:53] 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:37:25] 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:37:39] 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:38:00] 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:38:04] !log krinkle@deploy1001 Synchronized wmf-config/: I2c3217fb3da8bb65 (duration: 01m 07s) [18:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:17] (03PS3) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) [18:38:42] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis) [18:38:49] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:38:59] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn) [18:39:00] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22098 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [18:41:16] (03PS2) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 (https://phabricator.wikimedia.org/T247780) [18:41:26] (03PS1) 10Dzahn: remove production IPs of mw1221 through mw1226 [dns] - 10https://gerrit.wikimedia.org/r/580107 (https://phabricator.wikimedia.org/T247780) [18:41:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:42:18] (03CR) 10RLazarus: [C: 03+1] site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [18:43:13] PROBLEM - MariaDB Slave SQL: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:43:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:43:24] around [18:43:25] * volans looking [18:43:27] here [18:43:29] same [18:43:36] here [18:43:44] here [18:43:45] load average: 74.63, 128.69, 163.81 [18:43:45] MW exceptions don't seem critical [18:43:58] looking in [18:43:58] here [18:44:04] Amir1: hi, fyi [18:44:08] checking [18:44:16] level=error msg="Error pinging mysqld: Error 1040: Too many connections" [18:44:27] parsercache with too many? [18:44:32] there is "cache warm up" going on [18:44:34] massive invalidation pehaps? [18:44:35] apparently and very high load average [18:44:43] 17:08 < Amir1> !log warming up cache for Q50M to Q60M for new term store on db1111, db1126, db1104, db1092 (T219123) [18:44:44] T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123 [18:44:46] here [18:44:46] etc [18:44:56] mutante: don't think that should affect parsercache [18:44:59] ok [18:45:02] or at least it never did before [18:45:03] that should be about dbs cache not pc [18:45:09] "should" [18:45:29] marostegui: I cannot find the mysql error log could be it was never created? [18:45:32] back now, [18:45:41] let me read [18:46:07] lots of REPLACE /* SqlBagOStuff::updateTableKeys api.php@mw1313 * [18:46:28] that's not me [18:46:36] the cache warmup is direct query to datbase [18:46:38] it started just before 18:00 [18:46:39] *database [18:46:47] lets failover pc2? marostegui to see if server not service? [18:46:49] pc1008's disk is maxed out too [18:46:49] (03PS3) 10Krinkle: [WIP] Remove use of the $globals cache temporary file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) [18:46:50] by cache we mean innodb buffer pool [18:46:54] or do you know it is service? [18:46:58] jynus: I am checking if the rest are the same [18:47:00] appserver latency also spiked at the same time, is still high https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 [18:47:05] https://phabricator.wikimedia.org/T219123#5924185 [18:47:17] they all have a big increase in connections [18:47:24] since 18:00 [18:47:26] then lets not touch it [18:47:30] if train, revert [18:47:35] is not train [18:47:51] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Remove use of the $globals cache temporary file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [18:48:13] (03PS2) 10Krinkle: [WIP] Remove "Cache-control: no-cache" hack from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579815 (https://phabricator.wikimedia.org/T189966) [18:48:18] who's IC? [18:48:25] (03PS3) 10Krinkle: Remove "Cache-control: no-cache" hack from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579815 (https://phabricator.wikimedia.org/T189966) [18:48:51] there's definitely something happening around 18:00 [18:48:57] that had created lots of connections there [18:48:58] it is all updates [18:49:21] marostegui: write policy and BBU seems ok [18:49:27] it is sql [18:49:28] jynus: replaces from what I can see [18:49:29] not server [18:49:42] yes, I mean updates as processlists says [18:49:51] not UPDATE sql [18:50:01] maybe a massive expiration or something? [18:50:11] that could be caused by something expiring all keys [18:50:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:51:15] (03CR) 10Andrew Bogott: [C: 03+2] site: let labtest* use role(test), not spare::system [puppet] - 10https://gerrit.wikimedia.org/r/580092 (owner: 10Dzahn) [18:51:42] at 5000 connections the query killers kicks in [18:51:52] what's the current TTL for items in parsercache? [18:51:55] where is this connections? s8? [18:51:56] I am not seeing any traffic increase or anything [18:52:02] Amir1: no, parsercache [18:52:04] volans: like 3 months or something [18:52:04] Amir1: parsercache [18:52:08] volans: 30 days I reckon [18:52:25] could be someone asking a lot of uncached pages? [18:52:28] revisions? [18:52:40] then it's not the term store stuff [18:52:46] Amir1: it is not you [18:52:55] so I get out of the way [18:52:58] (03PS1) 10Andrew Bogott: Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109 [18:53:13] I cannot correlate the increase on parsercache with any increase on traffic [18:53:20] At least not with requests [18:53:21] RECOVERY - MariaDB Slave SQL: pc2 #page on pc1008 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [18:53:36] the others are a bit healthier [18:53:41] that's just a temporary recovery [18:53:49] (03CR) 10jerkins-bot: [V: 04-1] Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109 (owner: 10Andrew Bogott) [18:54:05] I wonder if we should try to depool, unless you have a better idea [18:54:13] can we get some MW expert in here? [18:54:24] as in failover to 1010 [18:54:26] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:54:27] jynus: The others are suffering kinda the same, just smaller hit [18:54:36] any hot key? [18:54:47] (03PS3) 10Andrew Bogott: nova policy.json: Remove all redundant policies [puppet] - 10https://gerrit.wikimedia.org/r/579637 (https://phabricator.wikimedia.org/T247573) [18:54:50] jynus: pc1010 might be worse, as it is probably cold :( [18:55:14] I know, that is why I was asking for a better idea [18:55:16] :-D [18:55:30] it smells like a massive expiry doesn't it [18:55:39] marostegui: I know a little bit about PC [18:55:39] it is weird that pc1008 is showing any packet drops at all [18:55:42] https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1 [18:55:46] hit rate plummeted [18:56:14] Amir1: help welcomed! :) [18:56:33] https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1584384914633&to=1584384969481&fullscreen&panelId=6 [18:56:34] jynus: but only once things were bad [18:56:35] cdanis: it is being overloaded by connections [18:56:36] disk space [18:56:47] doc? https://docs.google.com/document/d/1GsyYu_ruw58SSIJrYTQncLJzoXvSoBQdWQDNrFED_iQ/edit?usp=sharing [18:57:00] volans: we are ok on disk space [18:57:01] sorry, those are small difs [18:57:04] it would take hours to [18:57:05] volans: note axes, that's almost no change [18:57:11] yeah it foolefd me :D [18:57:16] volans: check the axe, it is very little [18:57:18] thanks grafana :( [18:57:42] Amir1: to sum up, we are seeing a huge increase in connections on all parsercache hosts, being pc1008 the one with more [18:58:09] connections are going down now [18:58:10] (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: Remove all redundant policies [puppet] - 10https://gerrit.wikimedia.org/r/579637 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [18:58:12] also on pc1008 [18:58:40] something created either a cache miss avalanche or some weied mw pattern [18:58:43] volans: to answer your previous question, no, I see random tables being involved all the times and random keys [18:58:47] or memcached content got lost [18:58:50] many options [18:58:53] marostegui: ack, I see that too [18:58:55] this is just possible causes [18:59:10] yeah, different languages involved [18:59:34] okay, let me think, it can be that someone is trying to load pages in multilingual wikis (commons, wikidata) in non-main language [18:59:41] that would rebuild PC for each one of them [19:00:12] let me dig a bit [19:00:13] PROBLEM - MariaDB Slave Lag: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:00:20] yeah, it came back [19:00:52] Amir1: any possible way to identify that from a db point of view? [19:00:55] I don't see anything that unsuual in memcache dashboards so far [19:01:07] * chaomodus afk ping if necessary [19:01:10] (03PS2) 10Andrew Bogott: Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109 [19:01:10] on the memcached side, I don't see any abnormal activity from a quick look https://grafana.wikimedia.org/d/000000316/memcache?orgId=1 [19:01:17] they have keys, the keys are actually values of another key [19:01:31] querying sys.processlist is better here [19:01:33] not locking [19:01:54] for each PC entry there are two rows (in two different databases), one refers from a general key to a specific key, the other from the sepcific key to the value [19:02:29] (03CR) 10Andrew Bogott: [C: 03+2] Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109 (owner: 10Andrew Bogott) [19:02:50] I see a few enwiktionary:pcache:idoptions [19:02:53] I have the feeling that most of them are SqlBagOStuff::updateTableKeys api.php instead of SqlBagOStuff::updateTableKeys index.php [19:03:04] :pcache:idoptions: [19:03:09] is that normal? [19:03:13] PROBLEM - MariaDB Slave SQL: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:03:20] (03CR) 10Andrew Bogott: [C: 03+2] "I merged this in error, and reverted in" [puppet] - 10https://gerrit.wikimedia.org/r/580109 (owner: 10Andrew Bogott) [19:03:28] can someone downtime pc1008? [19:03:45] marostegui: doing [19:03:54] cdanis: thanks :* [19:04:17] jynus: yup, the value of that would point out to the key to the actual value... [19:04:24] why it's like this, I don't know [19:05:23] RECOVERY - MariaDB Slave Lag: pc2 #page on pc1008 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:06:01] marostegui: I still think its is ok to try to pool pc1010 [19:06:05] what do we have to lose? [19:06:10] we can revert [19:06:20] how are pc1007 and pc1009? [19:06:23] Let me prepare the patch [19:06:27] less loaded [19:06:31] it mostly hits pc1008 [19:06:34] ok, pc1010 replicates from pc1007 [19:06:37] although it affects all [19:06:42] from some grep I did [19:06:42] 157 updateTableKeys RunSingleJob.php@ [19:06:42] 222 updateTableKeys index.php@ [19:06:42] 331 updateTableKeys api.php@ [19:06:46] (03PS4) 10Krinkle: Remove "Cache-control: no-cache" hack from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579815 (https://phabricator.wikimedia.org/T247783) [19:06:49] so pc1010 will be pretty cold [19:06:53] I know [19:06:55] let me try to pool it [19:06:56] empty actually [19:07:08] yep [19:07:11] but we can discard the server, even if doesn't fix anything [19:07:11] preparing the patch [19:07:16] it is not dbctl [19:07:20] nope [19:07:23] and then a long tail of language names with few hits [19:07:29] (03PS4) 10Andrew Bogott: nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573) [19:07:30] hmm, can it be the elastic jobs? let me check [19:07:41] I don't have a better suggestion at the moment [19:07:42] (03PS4) 10Andrew Bogott: nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639 [19:07:49] Amir1: my sample data had 1336 lines [19:07:56] (03PS4) 10Krinkle: [WIP] Remove use of the $globals cache temporary file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) [19:08:04] so half was the long tail [19:08:08] how is latency, availabillity affected? [19:08:08] and half those 3 [19:08:17] volans: can I see some of the tail? [19:08:21] RECOVERY - MariaDB Slave SQL: pc2 #page on pc1008 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [19:08:26] i am keeping notes on the doc. does this currently have user impact? [19:08:34] mutante: appserver latency elevated [19:08:36] that is my question [19:08:38] ok [19:08:41] ok [19:08:47] so I prefer to try something [19:08:48] (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [19:08:53] mostly long-tail -- 95%ile from 500ms to several seconds [19:08:57] if there is imapact, at least to help debugging [19:08:59] error rate looks okay [19:09:02] (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639 (owner: 10Andrew Bogott) [19:09:05] discard server variables [19:09:10] (03PS1) 10Marostegui: db-eqiad.php: Emergency pool pc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580117 [19:09:13] even if it is likely to make things worse [19:09:16] jynus: please review ^ [19:09:18] or equally bad [19:09:35] (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Emergency pool pc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580117 (owner: 10Marostegui) [19:09:37] Go [19:09:46] (03CR) 10Marostegui: [V: 03+2 C: 03+2] db-eqiad.php: Emergency pool pc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580117 (owner: 10Marostegui) [19:09:49] we can at least get information for debugging [19:10:04] * addshore reads up [19:10:07] (03CR) 10Volans: "Those are still pooled AFAIC (except mw1221)" [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [19:10:22] deploying [19:10:32] rlazarus: i guess you can call me IC if updating the doc counts [19:10:40] volans: yea, i was about to depool when this started [19:10:54] can someone check potential related deploys (even if unlikely) [19:11:12] and someone check changes in traffic pattern for requests that could lead to extra parsing [19:11:14] Amir1: hows the "check on elastic jobs" looking? [19:11:25] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1010 instead of pc1008 as pc1008 is overloaded (duration: 01m 06s) [19:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:11:31] another person to check memcache hit rates [19:11:34] let's monitor pc1010 closely [19:11:44] we can prepare the revert [19:11:56] mutante: nothing out of ordinary [19:11:58] it is not as if I have high confidence on that fixing the issue [19:12:04] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:12:25] re: memcache hit rates: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1 [19:12:30] connections growing [19:12:34] but let's wait [19:12:53] I don't see any deploy around 17:58 [19:13:01] appserver error rate increasing [19:13:03] memcached traffic looks stable? [19:13:07] thanks, those checks helps even if to discard [19:13:14] so far pc1010 has the same amount as pc1007 and 1009 [19:13:22] Amir1: thanks, ok [19:13:23] that woudl be better than pc1008 [19:13:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:13:40] if at least stays available [19:13:46] appserver tail and average latency is down a lot [19:13:49] 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10wiki_willy) a:03Jclark-ctr [19:13:51] parsercache hit ratio sinking, as expected [19:13:59] sure but that was the downside [19:14:08] maybe there was 2 issues [19:14:12] a request mw one [19:14:19] and a server one taht made the issue worse [19:14:23] but too early to say [19:14:36] yeah, but something definitely happen at 18:00 [19:14:40] requests seem capped at 500 [19:14:42] mcrouter traffic is elevated [19:14:43] is taht normal? [19:14:46] slightly before [19:14:46] but slightly [19:15:01] 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10wiki_willy) @Jclark-ctr - this might be just a loose power cord, but if its an actually power supply that's bad, it looks like this machine is under warranty. Thanks, Willy [19:15:09] should we diff pc1008 and pc1010? [19:15:09] pc1010 seems stable at around 800 connections [19:15:15] jynus: also hardware diff [19:15:20] ok, appserver error rate is back to 0, average latency and 95%ile is still down [19:15:21] like raid stripe and all that [19:15:34] so, marostegui so far the idea wasn't as crazy lol [19:15:49] i am voting for 2 issues [19:15:59] this will not be free, though [19:16:05] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579640 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [19:16:07] increaded latency for days [19:16:07] not sure if it was mentioned, mediawiki appservers workers saturation had a spike at the same time [19:16:12] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [19:16:17] volans: good to know [19:16:33] volans: the issue is to diferenciate between cause and consequences [19:16:40] yeah that seems effect [19:16:41] mutante: 2 causes seem to be happening [19:16:55] pc1008 likely to have performance issues (source unknown) [19:17:04] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:17:04] that was mitigated (so far) [19:17:12] but there is still a change in mw parsing [19:17:17] which is overloading all servers [19:17:17] jynus: ok, ack [19:17:24] the errors I see are mostly commons [19:17:25] that is the most worrying one [19:17:29] things like /w/index.php?title=Image:71_Winter_Hawk.jpg&action=render&uselang=en [19:17:30] I hope we are not getting new keys or something, that means more disk usage [19:17:41] https://logstash.wikimedia.org/goto/6268ff8273f61fb3a2ff0a9f33028c5d [19:17:45] what time did this start? [19:17:46] marostegui: that is a worry for tomorrows manuel and jaime [19:17:54] :-D [19:17:54] I think it's a bot going crazy [19:17:56] addshore: 1800 UTC [19:17:58] addshore: around 18:00 UTC [19:17:59] Amir1: could be [19:18:02] 17:58 [19:18:08] I woudl search for requests parsing lots of pages [19:18:14] old revisions [19:18:15] addshore: with alerts at 18:43 [19:18:16] the only increase in appserver CPU I see is a recent one, at about 19:12 [19:18:18] but highly parallel [19:18:24] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:18:50] can I get a status check from everyone, are we better than before in avaiulability and performance? [19:18:56] jynus: yes [19:19:01] just to make sure we stay with pc1010 [19:19:09] anyone else agrees? [19:19:17] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&fullscreen&panelId=10 [19:19:19] for example [19:19:21] I agree from db layer [19:19:21] pc1010 is performing better [19:19:31] but would prefer someone from app later, like cdanis [19:19:32] and stable at around 800 connections, which is still quite a lot [19:19:41] yeah, that is the main issue here [19:19:45] +1 [19:19:50] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&fullscreen&panelId=92 [19:19:53] seems better too [19:19:55] but at least we are on a more sane state [19:19:57] ok, let's stay at pc1010 [19:20:18] when you don't know what to do, do something, as long as it is safe [19:20:26] that's my philosophy [19:20:29] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10leila) @elukey thanks for flagging this. @bmansurov can you look into this and let me know what the best course of action is? [19:20:30] jynus: appserver errors are back to normal low [19:20:30] :-D [19:20:39] responses still slow, but let's see if they recover [19:20:39] 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10leila) p:05Triage→03High [19:20:41] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&fullscreen&panelId=88&from=now-3h&to=now [19:20:43] mutante: pc conenctions are not normal [19:20:46] appserver CPU is slightly increased, and appserver network traffic is also increased, I think a direct result of the lower hit rate [19:20:48] they are still elevated [19:21:14] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@f5600d6]: Update mobileapps to 8a6e403 [19:21:17] but all the other graphs are returning to ttheir previous state [19:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:19] so promising [19:21:51] mutante: we have around 250 connections normally, and now we have between 700 and 1000 [19:22:02] marostegui: I will stop replication on pc1010 [19:22:10] jynus: +1 [19:22:11] forgot that [19:22:34] there is also an increase in all traffic metrics for memcached, but nothing really horrible [19:22:43] pc1007-bin.080617:259138670 [19:23:06] !log stop replication at pc1010 at pos pc1007-bin.080617:259138670 [19:23:06] so the initial issue is mitigated, but we need MW follow up on what has generated this amount of connections [19:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:18] we need to replicate the codfw one from it now [19:23:23] but that can wait [19:23:44] yeah, no rush on that [19:24:05] almost 18:00 to the second [19:24:10] elukey: Can I see the graphs? It might be related to my work [19:24:19] a bit before actually [19:24:26] 17:58 a first spike [19:24:45] Amir1: it is https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-3h&to=now but around 20:11, when manuel switched the pc shard.. [19:24:47] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1008&var-port=9104 [19:24:57] ^mutate for the doc, if you add the ranges [19:24:57] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [19:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:25:04] err 19:11 UTC [19:25:28] 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10RStallman-legalteam) @ darthmon_wmde - Happy to help with the NDAs. Could you provide me with a physical (mailing) address for thephp.cc? A physical address i... [19:25:42] maybe we should check if there were some relevant cronjobs starting at around these times [19:25:43] ? [19:25:59] sure [19:26:05] I was checking more graphs [19:26:12] the overload is happening also on the others [19:26:22] they are not healthy [19:26:25] high latency [19:26:28] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' . [19:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:40] mmmm [19:26:46] pc1007 and pc1009 have fully recovered [19:26:52] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1009&var-port=9104&fullscreen&panelId=37 [19:26:53] really= [19:27:00] https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen&panelId=37&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104 [19:27:00] I am not 100% sure of that [19:27:10] check those graphs [19:27:14] sure [19:27:19] but see latency of scapring [19:27:23] very unusual [19:27:31] pc1010 still having 700 connections [19:27:34] this is super weird :-/ [19:27:59] normally it is 200 [19:28:02] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@f5600d6]: Update mobileapps to 8a6e403 (duration: 06m 48s) [19:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:09] the others are in around 300 [19:28:20] yeah, but a massive drop to almost normal values [19:28:21] sometimes more [19:28:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:28:25] from 800 to 300 [19:28:27] what I mean is that [19:28:33] they are not as good as they used to be [19:28:38] they are just "stable" [19:28:41] but not normal [19:29:01] however, if they were worse because pc1008 [19:29:03] I dont really see any increase in parser cache generation / misses [19:29:03] https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&var-contentModel=Campaign&var-contentModel=CollaborationHubContent&var-contentModel=CollaborationListContent&var-contentModel=JsonConfig_Dashiki&var-contentModel=JsonSchema&var-contentModel=JsonZeroConfig&var-contentModel=Json_JsonConfig&var-contentModel=Map_JsonConfig&var-contentModel=MassMessageListContent&var-contentModel=Scribunto&var-contentModel=SecurePoll&var-co [19:29:03] ntentModel=Tabular_JsonConfig&var-contentModel=css&var-contentModel=flow_board&var-contentModel=hit&var-contentModel=javascript&var-contentModel=json&var-contentModel=miss&var-contentModel=proofread_index&var-contentModel=proofread_page&var-contentModel=sanitized_css&var-contentModel=text&var-contentModel=wikibase_item&var-contentModel=wikibase_lexeme&var-contentModel=wikibase_property&var-contentModel=wikitext&var-contentModel=ya [19:29:03] ml&from=1584381707179&to=1584386903777 [19:29:06] i cannot see why [19:29:08] urgf, bad lin [19:29:09] link [19:29:18] minimize [19:29:22] w.wiki [19:29:26] PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 15.14 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:29:42] https://bit.ly/2WtZD5V [19:29:43] could elastic parse stuff, someone proposed? [19:29:50] Amir1 mentioned it [19:30:08] nothing out of ordinary in the jobs [19:30:39] marostegui: jynus that elastic related one was / is [19:30:40] https://phabricator.wikimedia.org/T239931 [19:30:43] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1 [19:31:11] coudl it be one of those edits [19:31:17] that parses half of wikipedias [19:31:18] anything I can help with right now? [19:31:21] on wikidata [19:31:23] addshore: that sanitizer is disabled for wikidata, other wikis, I don't know if it would be big enough [19:31:24] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Bstorm) [19:31:28] or somewhere else (a template) [19:31:30] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) 05Open→03Stalled While we now have an improved failover experience with these systems, th... [19:31:42] jynus: that's disabled for a really long time now [19:31:46] jynus: but would have hit all the servers the same, or at least, the recovery, pc1010 still have double connections [19:31:54] RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.175 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash [19:32:42] gwtoolsetUploadMediafileJob is that normal? [19:32:49] https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&fullscreen&panelId=11&from=1584365565964&to=1584387165964&var-site=eqiad&var-type=All [19:32:58] I am just looking at random graphs [19:33:59] appserver latency is still up from before, but it's within what i think is a healthy range [19:34:05] mutante: if things are "normal" maybe we can stop here [19:34:22] and try to find a root cause tomorrow [19:34:27] cdanis: yeah, we still see more connections than usual, specially on pc1010 (which replaced pc1008) [19:34:37] plus the extra latency [19:34:39] jynus: i added the grafana links with proper date range now [19:34:42] I think we need 2 different tasks, one for pc1008 and another one for mw [19:34:51] jynus: it seems like it, yes [19:34:53] cdanis: marostegui: coming from invalidating 33% of our cache disk [19:34:56] (03PS3) 10Andrew Bogott: nova policy: convert from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/579647 [19:35:13] jynus: we've never seen more connections because of the invalidation [19:35:21] I go eat for now, Reedy can you take a look? [19:35:26] maybe pc1008 had some issues that only showed up under high load [19:35:28] He knows PC very well [19:35:29] cdanis: can you explain this one? [19:35:30] https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=now-3h&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1238&fullscreen&panelId=88 [19:35:31] We will probably need CPT involved in the MW investigation [19:35:35] lets create a ticket [19:35:39] as all the other graphs in the same dashboard have recovered [19:35:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:35:52] and ask sam tim or brad [19:35:54] Then let's include Reedy as well on the task :-) [19:36:39] (03CR) 10Andrew Bogott: [C: 03+2] nova policy: convert from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/579647 (owner: 10Andrew Bogott) [19:36:47] should I create it or are you already? [19:36:49] volans: reparses? [19:36:55] not sure [19:37:00] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:37:06] jynus: I am creating the MW task [19:37:12] 2 tickets? one for CPT and one to check pc1008 for hardware issues? [19:37:16] yep [19:37:20] doing the other one [19:37:20] well, I didn't want to separate them yet [19:37:20] me neither has been almost 30m that the other graphs have recovered in the same dashboard [19:37:21] I am creating the one for CPT [19:37:33] until we know what's the deal with with pc1008 [19:37:37] but it is ok [19:37:37] * volans dinner is coming up [19:37:47] if you don't have anything immedate for me I'll be afk for a bit [19:37:49] I wanted to create just one with "overload" [19:37:51] jynus: I prefer to separate them because we may spam stuff about HW, mysql confg etc [19:38:00] We can always merge them later [19:38:20] more mw errors, mostly for wikidata and commonswiki [19:38:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:38:32] oh, they should be separate, but wouldn't know what to say about pc1008 yet [19:38:38] as we depooled it blindly [19:38:50] i think these are the same errors amir was talking about [19:38:58] T247787 [19:38:58] 10Operations: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Dzahn) [19:38:59] T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 [19:39:17] ok, a title works for now [19:39:24] lets add that to the doc [19:39:31] and I will add some tags [19:40:13] 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10wiki_willy) a:05Christopher→03Cmjohnson [19:40:16] 10Operations, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) [19:40:28] added under actionables. and one to follow-up with CPT. [19:40:34] yep [19:40:34] 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10wiki_willy) a:05Christopher→03Cmjohnson [19:40:50] status now: pc1008 failed over to pc1010 [19:41:00] 33% less cache on disk [19:41:16] still connection/load issues in general, but to a level we can cope [19:41:41] ^if you want to copy and paste [19:42:01] added [19:42:45] hit rate went from 81% to 64% [19:43:02] which is impacting but not out of the ordinary [19:43:20] we have 3 servers precisely for that :-D [19:44:31] 10Operations, 10Core Platform Team, 10MediaWiki-Cache: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) p:05Triage→03High [19:44:32] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:44:33] task created ^ [19:44:45] thanks, will link them [19:44:49] ok, as you suggested i guess at this point we can stop and it's not an active incident. but the search for root cause continues tomorrow [19:45:05] 10Operations, 10Core Platform Team, 10MediaWiki-Cache: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) [19:45:09] mutante: yeah, I believe so [19:45:19] 10Operations, 10Core Platform Team, 10MediaWiki-Cache: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10jcrespo) pc1008 conincidental hw issues handled separatelly at T247787 [19:45:21] alright, ACK [19:46:02] if everyone things mw state is ok until tomorrow (now that peak time will finish) we will reasearch more tomorrow [19:46:15] 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Cmjohnson) 05Open→03Resolved [19:47:03] jynus: +1 [19:47:06] Let's continue tomorrow, it is quite late for EU time [19:47:11] yep [19:47:15] 10Operations, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Reminder: pc1010 stopped replication, but pc2 on codfw needs to replicate from it. [19:47:18] Thanks everyone who showed up, specially mutante for coordinating! [19:47:19] ^added that reminder [19:47:29] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Krinkle) [19:47:36] can you paste the url of incident somewhere? [19:47:43] we will put it on the wiki when we can [19:47:46] https://docs.google.com/document/d/1GsyYu_ruw58SSIJrYTQncLJzoXvSoBQdWQDNrFED_iQ/edit#heading=h.vg6rb6x2eccy [19:47:48] thanks [19:48:01] we will put it public soon [19:48:10] (tomorrow) [19:48:17] on wikitech [19:48:21] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Krinkle) [19:50:13] Ok, I am going offline [19:50:14] Thanks everyone [19:50:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:51:13] thanks DBAs, good night [19:51:54] added "having pc1010 around to be able to pool" under "what went well" [19:52:53] 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Both pending documentation and more research, but it is mitigated by being depooled. [19:52:57] mutante: he he [20:00:04] halfak and accraze: May I have your attention please! Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2000) [20:00:48] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:01:31] (03PS1) 10Andrew Bogott: nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 [20:01:44] well, it's true :) [20:02:11] (03PS1) 10Hashar: README.md: update docker-pkg command line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580124 [20:02:19] (03CR) 10jerkins-bot: [V: 04-1] nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 (owner: 10Andrew Bogott) [20:03:30] (03PS2) 10Andrew Bogott: nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 (https://phabricator.wikimedia.org/T247573) [20:04:06] !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw122[1-6].eqiad.wmnet [20:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:37] (03CR) 10Jforrester: [C: 03+1] README.md: update docker-pkg command line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580124 (owner: 10Hashar) [20:04:38] !log depool (yes->no) mw1221 - mw1226 (T247780) [20:04:39] (03CR) 10Andrew Bogott: [C: 03+2] nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott) [20:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:45] T247780: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 [20:05:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:06:58] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:08:16] hrmm [20:08:40] ok, that is really small though compared to earlier [20:12:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:13:38] (03PS1) 10Andrew Bogott: nova policy: restrict VM creation to project admins [puppet] - 10https://gerrit.wikimedia.org/r/580125 [20:18:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:21:46] 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10Jclark-ctr) 05Open→03Resolved Reseated power cable [20:22:18] 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10wiki_willy) Thanks @Jclark-ctr [20:23:26] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:28:19] (03PS6) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) [20:29:14] RECOVERY - IPMI Sensor Status on mw1373 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [20:32:06] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:33:28] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:34:48] PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100% [20:35:28] RECOVERY - Host ms-be1030 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [20:35:58] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:35] (03CR) 10CRusnov: [C: 03+2] puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov) [20:37:15] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) [20:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:37:21] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1221.eqiad.wmnet` - mw1221.eqiad.wmnet (**PASS**) - Downtimed host on Icinga - Found... [20:37:40] mutante: if you're decomming hosts and yo have a bunch [20:37:51] can you try multiple at the same time too please? [20:38:26] volans: yea, will do that. i just wanted to be careful with the first one and the right order of things [20:38:36] sure, thanks a lot! [20:38:40] doing the other 5 at once then or so [20:39:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:39:37] volans: would you agree it's "pooled => 'no'" (but not 'inactive'), run decom script, remove from site.pp and conftool at once? [20:39:56] why not inactive? [20:40:24] because when i did that before running the decom script, the icinga alerts started about "not in dsh group" [20:40:34] and the decom script does "no -> inactive" ? [20:41:08] or maybe i should separate "remove from site" and "remove from conftool" [20:41:09] the decom cookbook doesn't touch conftool [20:41:11] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py [20:41:15] ok [20:41:24] list of actions at the top or running it with -h [20:42:07] !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw122[1-6].eqiad.wmnet [20:42:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:17] !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission [20:43:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:53] volans: ATTENTION: destructive action for 5 hosts: mw[1222-1226].eqiad.wmnet [20:44:02] jouncebot: next [20:44:02] In 0 hour(s) and 15 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2100) [20:44:15] mutante: yep :D [20:44:18] i gotta remove from conftool right after [20:44:21] before the deploy [20:44:24] k [20:44:30] or they get errors on scap [20:44:38] sure, once the decom has run they are gone [20:44:50] you can remove from everything but mgmt just in case [20:44:54] **Failed to wipe bootloaders, manual intervention required to make it unbootable [20:45:02] on one of them [20:45:07] was a broken one? [20:45:11] could we ssh into it? [20:45:34] i did not test that, it was 1223 [20:45:39] not the one already depooled [20:45:41] so it was pooled [20:45:48] the others did not have the issue [20:45:56] !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) [20:46:03] 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1222-1226].eqiad.wmnet` - mw1222.eqiad.wmnet (**PASS**) - Downtimed host on Icinga... [20:46:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:07] but it should have been working or we would have noticed in icinga before [20:46:35] let me see the logs [20:46:37] volans: well, it's exit_code=1 at the end but just because of that one step, all else looks good [20:46:46] k [20:47:25] (03PS4) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) [20:47:58] teh wipe command returned 123 as status code [20:48:18] we could try to power it up again and see if I can repro [20:48:42] yea, it was serving until recently https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=mw1223&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver [20:48:47] depending at which part of the command it failed, it might have partially worked making the repro harder [20:48:48] ok, let me see if it boots [20:49:06] well.. after the merge [20:49:16] because of the deployment in 10 min [20:49:36] sure no hurry [20:50:25] (03CR) 10Dzahn: [C: 03+2] "depooled to inactive and ran decom cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn) [20:52:42] !log 5 old API appservers in eqiad removed [20:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:54:44] !log powercycling mw1223 [20:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:27] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:55:57] hmmm.. kind of want a cumin alias for scap proxies to run puppet on them [20:56:43] volans: it boots into PXE / Debian installer [20:56:53] did not merge the DHCP removal :p [20:57:10] ok, so I guess no repro :D [20:58:01] !log mw1223 power down [20:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:11] thanks for trying [20:58:15] yw [20:58:37] (03PS3) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 (https://phabricator.wikimedia.org/T247780) [20:59:06] mutante: C:profile::mediawiki::scap_proxy? [20:59:19] volans: overall cookbook works very well. thx [20:59:30] wait no, that's every mw host o_O [20:59:31] rlazarus: ah, thanks, that looks good [20:59:41] heh [20:59:53] oh, I see # Sets up an rsync proxy for scap, if the server is set up to be one [20:59:58] mutante: thanks for testing it! [21:00:04] Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2100). [21:00:15] Reedy: are you deploying something? [21:00:20] Nope [21:00:29] ok, nice [21:00:46] then i won't worry about forcing puppet [21:01:11] mutante: 'R:ferm::service = rsyncd_scap_proxy' [21:01:43] rlazarus: ^^^ [21:01:52] nice, thanks [21:02:09] oh, via ferm, nice [21:02:35] alternatively 'R:rsync::server::module%path = /srv/mediawiki' [21:02:56] tries it.. 9 hosts [21:03:12] those seems to be the only two matching things looking at modules/profile/manifests/mediawiki/scap_proxy.pp [21:03:17] I'd use the first one, seems more safe [21:03:23] and simpler :D [21:03:54] using the ferm::service is a nice one that would work for other things as well [21:04:06] 100.0% (9/9) success ratio [21:04:30] ok, i hope 6 servers already makes a difference for power usage [21:04:37] doing more later [21:04:43] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:04:54] maybe more like tomorrow [21:05:44] (03PS1) 10Brian Wolff: Add prod domains to beta CSP policy to allow easier gadget testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580127 [21:12:07] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:13:55] (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580127 (owner: 10Brian Wolff) [21:21:34] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:22:08] (03PS1) 10Hashar: Add an image for python2 app based on Buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) [21:23:33] (03CR) 10Hashar: "I have build it locally with:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [21:24:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:24:27] (03CR) 10Hashar: "Here is a python2-build-buster image, that is to be able to craft the wheels for Zuul which is a python 2 app. We are too short to attempt" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [21:24:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:25:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:27:49] (03PS2) 10Hashar: Add an image for python2 app based on Buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) [21:28:42] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:29:48] (03CR) 10Andrew Bogott: [C: 03+2] nova policy: restrict VM creation to project admins [puppet] - 10https://gerrit.wikimedia.org/r/580125 (owner: 10Andrew Bogott) [21:33:00] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:34:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:34:47] (03PS1) 10Andrew Bogott: nova policy: restrict VM update to projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/580129 [21:35:59] (03CR) 10Andrew Bogott: [C: 03+2] nova policy: restrict VM update to projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/580129 (owner: 10Andrew Bogott) [21:37:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:44:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:51:20] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_mobileapps_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:52:44] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:53:33] (03PS3) 10Hashar: Add an image for python2 app based on Buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) [21:53:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:54:54] (03CR) 10BryanDavis: "> LGTM. The benefit of using this vs using kubectl directly is that" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578412 (owner: 10BryanDavis) [21:57:46] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:58:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:06:16] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:07:06] (03PS1) 10Cmjohnson: Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/580132 (https://phabricator.wikimedia.org/T244506) [22:07:48] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar) [22:09:40] (03PS2) 10Cmjohnson: Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/580132 (https://phabricator.wikimedia.org/T244506) [22:11:56] (03CR) 10Cmjohnson: [C: 03+2] Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/580132 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson) [22:12:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:13:24] (03PS1) 10Bstorm: toolforge-k8s: remove old legacy-cluster code from bastion [puppet] - 10https://gerrit.wikimedia.org/r/580134 (https://phabricator.wikimedia.org/T246689) [22:13:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:17:27] (03CR) 10Alex Monk: [C: 03+1] toolforge-k8s: remove old legacy-cluster code from bastion [puppet] - 10https://gerrit.wikimedia.org/r/580134 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [22:18:47] i feel like i'm seeing a pattern of errors that echoes thursday's T247553 [22:19:23] at any rate, something seems to be up [22:21:18] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:22:48] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [22:25:24] (03CR) 10BryanDavis: "I think based on T172693 that we don't actually have these tables passing through from the sanitarium servers. I have an open task to actu" [puppet] - 10https://gerrit.wikimedia.org/r/579800 (owner: 10Alex Monk) [22:28:50] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:32:56] This issue might be related to what we saw today: https://phabricator.wikimedia.org/T247562 marostegui [22:34:48] Amir1, marostegui: any thoughts on that one are welcome. train still blocked and i'm at a bit of a loss whose attention to bring it to or how to further investigate. [22:35:58] brennen: sure, I'm at end of my day though [22:36:05] I will take a look ASAP [22:36:22] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:38:48] Amir1: ta - i'm also nearing EOD and i am unlikely to roll forward at this point, but it would be good if it were resolved before hashar is supposed to be cutting new branch tomorrow [22:39:04] still blocked ? :( [22:39:14] yes. :( [22:39:36] do we have any idea which code causes the issue? [22:40:10] if it is triggered by a specific api call for example, we might be able to identify the culrpirt [22:41:05] I went back and looked at a few of the endpoints that triggered the problem, afaict it wasn't limited to the api [22:42:25] (03PS1) 10RLazarus: Add a default User-Agent. [software/httpbb] - 10https://gerrit.wikimedia.org/r/580135 [22:43:03] I am not savvy enough in mw anymore unfortunately :( [22:43:03] given that, I'd guess it's something more systemic; i.e., something adding pressure to the system that the system can't handle or something deep down in terms of layers of abstraction [22:43:54] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:45:32] https://phabricator.wikimedia.org/T247562#5974273 [22:49:06] nice [22:49:36] (03CR) 10Ppchelko: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust) [22:50:21] that feels bad, hashar [22:50:35] so something too large is stored in the cache [22:50:40] and somehow overloads memcached? [22:51:18] 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10brennen) May be related to T247562. [22:51:20] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:51:38] (03PS1) 10Andrew Bogott: Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) [22:51:40] (03PS1) 10Andrew Bogott: glance: move python.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) [22:51:42] (03PS1) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) [22:51:52] parsercache would be... consistent with large values? [22:52:26] (03CR) 10jerkins-bot: [V: 04-1] Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [22:53:17] (03CR) 10jerkins-bot: [V: 04-1] designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott) [22:55:28] (03PS2) 10Andrew Bogott: Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) [22:55:30] (03PS2) 10Andrew Bogott: glance: move python.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795) [22:55:33] (03PS2) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) [22:57:47] brennen: thcipriani we would need someone familiar with memcached maybe [22:57:52] there are bunch of dashboards on grafana [22:58:03] https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1 [22:58:37] that shows spikes of evictions for example [22:58:48] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:00:04] RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:01:25] brennen: thcipriani can you let me know on the deployment task what I should do tomorrow [23:01:35] eg wether we freeze/pause or wether I still cut the branch :] [23:01:46] I guesSI can cut / deploy to group0 [23:01:59] but 3 versions might be a nightmare [23:02:55] hashar: yeah, i feel like 3 versions is a bad idea. [23:03:01] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: remove old legacy-cluster code from bastion [puppet] - 10https://gerrit.wikimedia.org/r/580134 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm) [23:03:09] will leave notes on the task. [23:04:30] same [23:05:23] 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10colewhite) That CSP works well. I think cas needs to respond with an appropriate Access-Control-Allow-Origin. https://apereo.github.io/cas/5.2.x/installation/Configuration-Properties.html#http-web-requests Observation from test... [23:13:44] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:14:27] !log reset email for "MNadrofsky (WMF)" on SUL and officewiki [23:14:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:39] (03PS3) 10Alex Monk: Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 [23:18:44] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:28:46] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:31:14] (03CR) 10Volans: [C: 03+1] "Looks sae to me" [software/httpbb] - 10https://gerrit.wikimedia.org/r/580135 (owner: 10RLazarus) [23:32:41] marostegui: btw. db1126 has lots of CPU usage, I doubt it's my cache warming up (it's not showing itself anywhere else), is it something known? [23:32:48] (03PS4) 10Alex Monk: Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 (https://phabricator.wikimedia.org/T247800) [23:33:46] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:41:10] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:48:34] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:24] PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2020-03-13 23:42:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [23:56:02] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets