[00:25:30] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:30:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:52:00] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:55:08] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:57:10] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:57:22] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:57:42] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[00:59:54] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:00:50] <icinga-wm>	 RECOVERY - Check systemd state on boron is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:07:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:06] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:10:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:11:44] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:13:02] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:14:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:18:15] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:21:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:21:30] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:22:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:26:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:29:52] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:31:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40
[01:31:52] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:35:04] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:37:02] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:37:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:48:42] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:48:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:51:12] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:56:32] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[01:58:00] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:05:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:10:00] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:14:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:16:24] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:16:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:23:52] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:24:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:25:12] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:29:24] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:30:16] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:31:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returne
[02:31:50] <icinga-wm>	 status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:34:28] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:35:26] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:35:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:38:22] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:42:10] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:43:06] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:44:30] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10Jpita)
[02:47:16] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:50:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:53:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:54:54] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:58:32] <icinga-wm>	 PROBLEM - Long running screen/tmux on people1001 is CRITICAL: CRIT: Long running tmux process. (user: cdanis PID: 18618, 1729058s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[02:58:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[02:59:56] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:00:46] <wikibugs>	 10Operations, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar), and 2 others: All debug hosts give (likely spurious) message: PHP Fatal error:  The UdpSocket to 127.0.0.1:10514 has been closed (from Monolog/SyslogUdp) - https://phabricator.wikimedia.org/T214734 (10Krinkle)
[03:01:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:02:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:02:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:03:28] <wikibugs>	 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 10 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Krinkle)
[03:04:03] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops: Test gutter pool failover in production  and memcached 1.5.x - https://phabricator.wikimedia.org/T240684 (10Krinkle)
[03:06:00] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:06:00] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:08:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:08:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:12:18] <icinga-wm>	 PROBLEM - Check systemd state on boron is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:16:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:19:12] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:21:50] <wikibugs>	 (03PS1) 10Guozr.im: RemoteExecution: Typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882
[03:22:54] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:26:59] <wikibugs>	 (03CR) 10Guozr.im: "Hi guys," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[03:30:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:37:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:38:14] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:40:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:45:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:46:54] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (ex
[03:46:54] <icinga-wm>	 domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:47:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:52:32] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:54:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:54:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:55:02] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:56:34] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=eqiad+prometheus/ops&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[03:58:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:58:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[03:59:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:01:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:02:10] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:03:24] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:07:42] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:10:12] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:32:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:33:10] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:35:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:35:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:54:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:57:22] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[04:58:24] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:07:42] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:08:36] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:12:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:20:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:22:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:28:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:34:06] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:34:06] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:34:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:36:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:36:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:37:06] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:40:42] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:40:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:43:16] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:44:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40
[05:44:18] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:49:22] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:51:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:54:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[05:56:48] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:02:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:05:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:07:00] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:11:18] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:13:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:17:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:17:54] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:20:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:23:00] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:25:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:30:54] <icinga-wm>	 PROBLEM - ores uWSGI web app on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Services/ores
[06:31:04] <icinga-wm>	 PROBLEM - MD RAID on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:31:22] <icinga-wm>	 PROBLEM - dhclient process on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:31:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40
[06:31:34] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:31:44] <icinga-wm>	 PROBLEM - Check size of conntrack table on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:31:46] <icinga-wm>	 PROBLEM - configured eth on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:31:50] <icinga-wm>	 PROBLEM - DPKG on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:31:50] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:32:12] <icinga-wm>	 PROBLEM - Check systemd state on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:28] <icinga-wm>	 PROBLEM - Disk space on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops
[06:32:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:34:06] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:35:06] <icinga-wm>	 PROBLEM - puppet last run on ores1002 is CRITICAL: connect to address 10.64.0.52 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:38:20] <_joe_>	 !log restart envoy with 10 requests per connection on mw2231, T247484
[06:38:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:26] <stashbot>	 T247484: Lots of "EventBus: Unable to deliver all events" - https://phabricator.wikimedia.org/T247484
[06:40:22] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:41:08] <icinga-wm>	 RECOVERY - MD RAID on ores1002 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[06:41:26] <icinga-wm>	 RECOVERY - dhclient process on ores1002 is OK: PROCS OK: 0 processes with command name dhclient https://wikitech.wikimedia.org/wiki/Monitoring/check_dhclient
[06:41:36] <icinga-wm>	 RECOVERY - puppet last run on ores1002 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[06:41:50] <icinga-wm>	 RECOVERY - Check size of conntrack table on ores1002 is OK: OK: nf_conntrack is 0 % full https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack
[06:41:50] <icinga-wm>	 RECOVERY - configured eth on ores1002 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[06:41:54] <icinga-wm>	 RECOVERY - DPKG on ores1002 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg
[06:41:56] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on ores1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[06:41:59] <wikibugs>	 10Operations: decom racktables? - https://phabricator.wikimedia.org/T247646 (10MoritzMuehlenhoff) I think deploying it on Buster will be unproblematic, the current host is already on Stretch, so the big incompatibilities between PHP 5 and 7 are already addressed. Racktables is also still maintained (last mainten...
[06:42:20] <icinga-wm>	 RECOVERY - Check systemd state on ores1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:42:34] <icinga-wm>	 RECOVERY - Disk space on ores1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=ores1002&var-datasource=eqiad+prometheus/ops
[06:42:52] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:42:52] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:44:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:45:05] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[06:45:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:47:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:50:08] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:52:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[06:54:29] <moritzm>	 !log removing some library packages from jessie/stretch after labstore1006/1007 dist-upgrade to buster
[06:54:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:01:05] <wikibugs>	 (03PS2) 10Brian Wolff: Add wikidata.beta.wmflabs.org to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183
[07:02:00] <wikibugs>	 10Operations, 10DBA, 10OTRS, 10Recommendation-API, 10Research: Upgrade and restart m2 primary database master (db1132) - https://phabricator.wikimedia.org/T246098 (10Marostegui)
[07:02:09] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add wikidata.beta.wmflabs.org to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff)
[07:02:44] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:03:12] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:03:58] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:05:12] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:05:16] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:05:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:06:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:06:50] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:07:44] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:09:08] <wikibugs>	 (03PS5) 10Elukey: admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578)
[07:09:20] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:12:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: simplify and document some analytics posix groups [puppet] - 10https://gerrit.wikimedia.org/r/579228 (https://phabricator.wikimedia.org/T246578) (owner: 10Elukey)
[07:14:49] <moritzm>	 !log installing libgd2 security updates on jessie
[07:14:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:20:08] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:21:34] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:22:40] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:24:04] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:24:26] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:24:51] <wikibugs>	 (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2121" [puppet] - 10https://gerrit.wikimedia.org/r/579886
[07:25:14] <wikibugs>	 (03PS2) 10Marostegui: Revert "install_server: Allow reimage db2121" [puppet] - 10https://gerrit.wikimedia.org/r/579886
[07:26:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:27:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "install_server: Allow reimage db2121" [puppet] - 10https://gerrit.wikimedia.org/r/579886 (owner: 10Marostegui)
[07:30:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:32:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:36:42] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:40:44] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:41:42] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:45:46] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:47:15] <moritzm>	 !log installing lxml security updates
[07:47:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:51:08] <wikibugs>	 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) 05Resolved→03Open The issue occurred again on cp4025. Reopening.  ` Mar 14 15:51:49 cp4025 varnishd[20511]: Child (20592) not responding to CLI, killed it. Mar 14 15:51:49 cp4025 varnishd[205...
[07:52:42] <ema>	 !log cp4025: restart varnish-fe to clear 'child restarted' alert T185968
[07:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:52:46] <stashbot>	 T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968
[07:54:16] <icinga-wm>	 RECOVERY - Varnish frontend child restarted on cp4025 is OK: (C)2 ge (W)2 ge 0 https://wikitech.wikimedia.org/wiki/Varnish https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?panelId=66&fullscreen&orgId=1&var-server=cp4025&var-datasource=ulsfo+prometheus/ops
[07:54:16] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[07:54:28] <wikibugs>	 (03PS1) 10KartikMistry: apertium-es-pt: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/579889 (https://phabricator.wikimedia.org/T247585)
[07:57:10] <wikibugs>	 (03PS3) 10KartikMistry: apertium-br-fr: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-br-fr] - 10https://gerrit.wikimedia.org/r/579463 (https://phabricator.wikimedia.org/T247585)
[07:57:40] <moritzm>	 !log installing libxslt security updates
[07:57:41] <wikibugs>	 (03PS2) 10KartikMistry: apertium-cy-en: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-cy-en] - 10https://gerrit.wikimedia.org/r/579683 (https://phabricator.wikimedia.org/T247585)
[07:57:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:20] <wikibugs>	 (03PS3) 10KartikMistry: apertium-cat-ita: Fix FTBFS with apertium 3.6 + 0.2.1 release [debs/contenttranslation/apertium-ca-it] - 10https://gerrit.wikimedia.org/r/579509 (https://phabricator.wikimedia.org/T247585)
[07:58:43] <wikibugs>	 10Operations, 10Traffic: OOM killer killed varnihsd cache-main on cp3053 - https://phabricator.wikimedia.org/T247195 (10ema)
[07:58:46] <wikibugs>	 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema)
[07:59:26] <wikibugs>	 (03PS2) 10KartikMistry: apertium-en-es: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-en-es] - 10https://gerrit.wikimedia.org/r/579757 (https://phabricator.wikimedia.org/T247585)
[08:02:48] <ema>	 !log cp4025 restart trafficserver-tls to clear 'tls process restarted' alert T241593 T185968
[08:02:52] <wikibugs>	 (03PS3) 10Brian Wolff: Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183
[08:02:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:54] <wikibugs>	 (03PS1) 10Brian Wolff: Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124)
[08:02:55] <stashbot>	 T185968: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968
[08:02:55] <stashbot>	 T241593: cp1083: ats-tls and varnish-fe crashed due to insufficient memory - https://phabricator.wikimedia.org/T241593
[08:06:32] <icinga-wm>	 RECOVERY - traffic_server tls process restarted on cp4025 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4025&var-layer=tls
[08:08:30] <icinga-wm>	 ACKNOWLEDGEMENT - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% Elukey T247561
[08:08:30] <wikibugs>	 (03CR) 10Jcrespo: "The diff is ok, but the commit messages doesn't follow the guidelines:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[08:09:29] <wikibugs>	 (03CR) 10Jcrespo: "Note there is not a single verb + subject sentence on the commit, all of them should be, IMHO." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[08:13:04] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:13:52] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891
[08:15:36] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:15:58] <marostegui>	 !log Review and enable events on recently migrated 10.4 hosts - T247728
[08:16:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:16:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:04] <stashbot>	 T247728: Events set to SLAVESIDE_DISABLED when upgrading from 10.1 to 10.4 - https://phabricator.wikimedia.org/T247728
[08:16:56] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:18:34] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:19:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[08:24:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:27:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:31:06] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:31:20] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:38:22] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:38:38] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:39:48] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891
[08:39:50] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562)
[08:40:52] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:41:26] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:41:36] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:41:38] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:42:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[08:42:49] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562)
[08:43:01] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562)
[08:44:10] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:49:08] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:51:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] "The topic was wrong btw. It's eventgate-analytics, not evenstreams." [deployment-charts] - 10https://gerrit.wikimedia.org/r/579324 (https://phabricator.wikimedia.org/T247484) (owner: 10Alexandros Kosiaris)
[08:51:22] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:51:40] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:52:40] <wikibugs>	 (03PS1) 10KartikMistry: Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622)
[08:53:38] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[08:53:41] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Enable Content Translation in Malay, Azerbaijani and Estonian WPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) (owner: 10KartikMistry)
[08:53:46] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Move x1 backups from dbprov[12]001 to dbprov[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/579892 (https://phabricator.wikimedia.org/T138562)
[08:53:52] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:54:10] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[08:55:50] <wikibugs>	 (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579893 (https://phabricator.wikimedia.org/T246622) (owner: 10KartikMistry)
[08:55:58] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.328e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:00:11] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579894 (https://phabricator.wikimedia.org/T138562)
[09:02:46] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article t
[09:02:46] <icinga-wm>	  unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:03:30] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:07:48] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:08:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:22:59] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579894 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[09:23:02] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde)
[09:23:40] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) Do they need to have a phabricator/wikitech account first, @Aklapper ?
[09:24:26] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:26:28] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:26:58] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:28:42] <icinga-wm>	 PROBLEM - WDQS high update lag on wdqs1005 is CRITICAL: 4.332e+04 ge 4.32e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[09:29:57] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579894 (https://phabricator.wikimedia.org/T138562)
[09:30:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote es1011 to es2 master, this is a NOOP T239791', diff saved to https://phabricator.wikimedia.org/P10700 and previous config saved to /var/cache/conftool/dbconfig/20200316-093048-marostegui.json
[09:30:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:30:55] <stashbot>	 T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791
[09:31:30] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:32:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es1015 for upgrade and restart T239791', diff saved to https://phabricator.wikimedia.org/P10701 and previous config saved to /var/cache/conftool/dbconfig/20200316-093228-marostegui.json
[09:32:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:42] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:35:14] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:35:28] <wikibugs>	 (03CR) 10Volans: "Couple of questions inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)
[09:37:28] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_eventgate_analytics_external_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:39:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:41:16] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:42:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the
[09:42:02] <icinga-wm>	 s 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:43:50] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:44:36] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:45:02] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to event logging data in hive for joewalsh - https://phabricator.wikimedia.org/T247636 (10ArielGlenn) If there are no objection on this task by Wednesday Mar 18, I'll prepare a patch and this request can go ahead.
[09:45:16] <wikibugs>	 10Operations, 10Research: reccommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey)
[09:45:32] <wikibugs>	 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10elukey)
[09:46:15] <icinga-wm>	 ACKNOWLEDGEMENT - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title
[09:46:15] <icinga-wm>	 xpected status 404 (expecting: 200) Elukey T247732 https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:46:15] <icinga-wm>	 ACKNOWLEDGEMENT - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) Elukey T247732 https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:46:27] <elukey>	 ok acked
[09:51:45] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:52:49] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[09:55:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] kibana: refactor kibana role to kibana profile [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles)
[09:56:58] <wikibugs>	 (03PS3) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891
[09:57:00] <wikibugs>	 (03PS1) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579901 (https://phabricator.wikimedia.org/T138562)
[09:58:09] <wikibugs>	 (03PS2) 10Jcrespo: mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579901 (https://phabricator.wikimedia.org/T138562)
[09:58:54] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10ArielGlenn) @darthmon_wmde ldap entries are predicated on having a wikitech account. A manager from WMDE should sign off on this here on the task as well. And...
[09:59:11] <wikibugs>	 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) The OOM killer intervened due to "Normal" (non-DMA) free memory on NUMA node 0 going below min (1380412 < 1387544):  ` [Sat Mar 14 15:51:23 2020] Node 0 Normal free:1380412kB min:1387544kB low:17...
[10:01:28] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Change bacula backup frequency of dbprov to weekly [puppet] - 10https://gerrit.wikimedia.org/r/579901 (https://phabricator.wikimedia.org/T138562) (owner: 10Jcrespo)
[10:02:20] <Amir1>	 !log start of ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --batch-size=50 --sleep=0 --file=15march2217-holes-nulls.list on screen (T219123)
[10:02:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:02:25] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[10:04:02] <wikibugs>	 (03CR) 10Elukey: kibana: refactor kibana role to kibana profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579396 (https://phabricator.wikimedia.org/T246961) (owner: 10Mstyles)
[10:06:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: ELk7: add curator job to require disktype hdd after 7 days (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579422 (https://phabricator.wikimedia.org/T247376) (owner: 10Herron)
[10:09:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus::ops: Add prometheus job to scrape Netbox scripts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/576459 (https://phabricator.wikimedia.org/T243927) (owner: 10CRusnov)
[10:10:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10ArielGlenn) Hi @Jpita, because your phabricator account is seemingly not linked to your official (https://meta.wikimedia.org/wiki/Special:CentralAuth/JPita_(WMF)) account, I can't easily v...
[10:10:57] <marostegui>	 !log Stop mysql for upgrade on es1015 T239791
[10:11:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:03] <stashbot>	 T239791: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791
[10:12:38] <wikibugs>	 10Operations, 10Puppet, 10DBA, 10User-jbond: DB: perform rolling restart of mariadb daemons to pick up CA changes - https://phabricator.wikimedia.org/T239791 (10Marostegui)
[10:13:38] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10Aklapper) >>! In T247731#5971402, @darthmon_wmde wrote: > Do they need to have a phabricator/wikitech account first?  @darthmon_wmde: See the instructions on...
[10:13:59] <wikibugs>	 10Operations, 10Traffic: varnish 5.1.3 frontend child restarted - https://phabricator.wikimedia.org/T185968 (10ema) Also worth mentioning that in the specific case of cp4025, the trouble was caused by a sudden [[https://grafana.wikimedia.org/d/000000330/varnish-machine-stats?orgId=1&var-server=cp4025&var-datas...
[10:14:31] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484)
[10:14:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484)
[10:15:28] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[10:16:35] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:17:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10702 and previous config saved to /var/cache/conftool/dbconfig/20200316-101707-marostegui.json
[10:17:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:18:43] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:19:59] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:20:04] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) > ldap entries are predicated on having a wikitech account.  got it, thanks!  > A manager from WMDE should sign off on this here on the task as...
[10:20:53] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:22:05] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:25:03] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:25:54] <wikibugs>	 (03PS1) 10Ema: cache: decrease varnish-frontend malloc cache size [puppet] - 10https://gerrit.wikimedia.org/r/579906 (https://phabricator.wikimedia.org/T185968)
[10:26:32] <elukey>	 ok ack is not really working for recommendation api
[10:26:50] <elukey>	 downtime only the service probably is best
[10:28:29] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10ArielGlenn) >>! In T247731#5971523, @darthmon_wmde wrote: >  >> A manager from WMDE should sign off on this here on the task as well.  > That'd be me as Engin...
[10:28:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10703 and previous config saved to /var/cache/conftool/dbconfig/20200316-102829-marostegui.json
[10:28:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:30:04] <jouncebot>	 jan_drewniak: I, the Bot under the Fountain, allow thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1030).
[10:31:28] <_joe_>	 elukey: just a q: did you try to restart the service?
[10:31:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] profile::services_proxy: allow defining a retry policy (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[10:32:14] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "typo, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[10:32:23] <elukey>	 _joe_ nope, but I can try now
[10:32:45] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:33:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:34:22] <wikibugs>	 (03PS1) 10Ema: cache: limit upload transient storage usage [puppet] - 10https://gerrit.wikimedia.org/r/579907 (https://phabricator.wikimedia.org/T185968)
[10:35:53] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: profile::services_proxy: allow defining a retry policy (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[10:36:04] <elukey>	 !log roll restart of recommendation service on scb* as attempt to fix the flapping alerts - T247732
[10:36:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:36:11] <stashbot>	 T247732: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732
[10:37:52] <wikibugs>	 (03PS1) 10Ladsgroup: Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908
[10:38:13] <wikibugs>	 (03PS2) 10Ladsgroup: Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 (https://phabricator.wikimedia.org/T219123)
[10:38:40] <Amir1>	 I'm deploying this quickly
[10:38:41] <Amir1>	 ^
[10:39:19] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[10:40:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10704 and previous config saved to /var/cache/conftool/dbconfig/20200316-104002-marostegui.json
[10:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:40:39] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "Set term store to WRITE_BOTH for all of Wikidata"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579908 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[10:40:41] <Amir1>	 marostegui: ^ FYI
[10:41:22] <marostegui>	 Amir1: ok
[10:43:26] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: "Set term store to WRITE_BOTH for all of Wikidata" (T219123) (duration: 01m 13s)
[10:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:31] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[10:43:54] <wikibugs>	 (03PS2) 10Jbond: data_admin: delete this file it seems to be unused [puppet] - 10https://gerrit.wikimedia.org/r/579591 (https://phabricator.wikimedia.org/T247364)
[10:44:51] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:45:09] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: "Set term store to WRITE_BOTH for all of Wikidata" (T219123), take II (duration: 01m 07s)
[10:45:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:45:35] <jan_drewniak>	 hey here, sorry posted this in the wrong channel. I won't be doing the portal deploy today, there are a few bugs I need to fix in the build...
[10:46:57] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 40
[10:46:57] <icinga-wm>	 ) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:47:11] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:47:11] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:47:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool es1015', diff saved to https://phabricator.wikimedia.org/P10705 and previous config saved to /var/cache/conftool/dbconfig/20200316-104723-marostegui.json
[10:47:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:47:41] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected status 503 (expecting: 200): /{domain}/v1/description/addition/{target} (Description addition suggestions) is CRITICAL: Test Description addition suggestions returned the unexpected status 503 (ex
[10:47:41] <icinga-wm>	 ps://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:48:06] <Reedy>	 That's not a great line split...
[10:49:42] <elukey>	 yeah
[10:51:03] <Reedy>	 There's a few crappy ones in there
[10:51:06] * Reedy dumps on a bug
[10:51:34] <Reedy>	 T230799
[10:51:34] <stashbot>	 T230799: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799
[10:52:37] <Amir1>	 marostegui: it's time to bother you more, which replicas I need to warm before I move to reading? on s8 I mean
[10:52:39] <wikibugs>	 10Operations, 10Icinga, 10observability: icinga-wm bot truncating long messages - https://phabricator.wikimedia.org/T230799 (10Reedy) Relatedly... There's some that spill over to multiple lines, and lose letters  `lang=irc [10:46:57] <icinga-wm> PROBLEM - recommendation_api endpoints health on scb2001 is CRI...
[10:52:45] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:53:12] <elukey>	 I am downtiming those as they come --^
[10:53:46] <marostegui>	 Amir1: checking, db1126 and db1111 for sure
[10:53:49] <marostegui>	 let me double check
[10:54:37] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:55:02] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list - https://phabricator.wikimedia.org/T247737 (10Lantus)
[10:55:08] <marostegui>	 Amir1: db1111, db1126 db1104 and ideally db1092 if possible to
[10:55:12] <marostegui>	 just to be on the safe side
[10:55:44] <Amir1>	 !log warming up db1026 for up to Q35M for the new term store (T219123)
[10:55:48] <Amir1>	 cool
[10:55:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:49] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[10:55:54] <Amir1>	 marostegui: I keep it in mind
[10:55:58] <Amir1>	 Thanks
[10:56:55] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[10:57:12] <jynus>	 db1026?
[10:57:17] <jynus>	 does that exist?
[10:57:50] <Amir1>	 sorry, db1126
[10:57:52] <jynus>	 it was decommed at T174763
[10:57:54] <stashbot>	 T174763: Decommission db1026 - https://phabricator.wikimedia.org/T174763
[10:57:58] <jynus>	 oh, I see
[10:57:59] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list - https://phabricator.wikimedia.org/T247737 (10Reedy) a:05Lantus→03None
[10:58:06] <jynus>	 np
[10:59:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484)
[10:59:21] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484)
[10:59:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] data_admin: delete this file it seems to be unused [puppet] - 10https://gerrit.wikimedia.org/r/579591 (https://phabricator.wikimedia.org/T247364) (owner: 10Jbond)
[10:59:47] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (dbprov2001), No backups: 2 (dbprov2001, ...), Fresh: 96 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring
[10:59:57] <jynus>	 ^I will ack
[11:00:00] <jynus>	 it is expected
[11:00:04] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1100).
[11:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:00:10] <jynus>	 will fix itself once a new run happens
[11:01:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Make update-special-pages handle dblist comments [puppet] - 10https://gerrit.wikimedia.org/r/579876 (https://phabricator.wikimedia.org/T247716) (owner: 10Reedy)
[11:01:24] <icinga-wm>	 ACKNOWLEDGEMENT - Backup freshness on backup1001 is CRITICAL: All failures: 1 (dbprov2001), No backups: 2 (dbprov2001, ...), Fresh: 96 jobs Jcrespo running backups under new name https://wikitech.wikimedia.org/wiki/Backups%23Monitoring
[11:04:04] <Amir1>	 !log Warming up InnoDB buffer pool cache in db1111, db1126, db1104, db1092 (T219123)
[11:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:10] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[11:04:38] <Amir1>	 !log ... for Q30M-Q35M of the new term store
[11:04:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:46] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10ArielGlenn) I have verified via google hangout using his wikimedia email account (and checking his image against the office picture :-)) that it's really JPita asking for access.
[11:06:51] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10elukey)
[11:07:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::services_proxy: allow defining a retry policy [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[11:07:52] <wikibugs>	 (03PS1) 10Ladsgroup: Set up read new term store up to Q35M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579913 (https://phabricator.wikimedia.org/T219123)
[11:08:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[11:09:21] <wikibugs>	 (03CR) 10Jbond: pick_nodes: add ability to pick nodes based on a puppet class (031 comment) [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/579579 (https://phabricator.wikimedia.org/T245288) (owner: 10Jbond)
[11:13:24] <wikibugs>	 (03PS1) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176)
[11:21:44] <wikibugs>	 (03PS2) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176)
[11:21:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "> Having said that, I do not think we should rush and merge this right away - changeprop chart wasn't deployed yet and new followups are b" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[11:22:42] <wikibugs>	 (03PS3) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176)
[11:22:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/21442/mw1331.eqiad.wmnet/ this is in all effects a noop." [puppet] - 10https://gerrit.wikimedia.org/r/579903 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[11:25:02] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1003/21446/install1003.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[11:26:21] <wikibugs>	 10Operations, 10Analytics, 10DC-Ops, 10netops: kafka-jumbo1006 and stat1005 network issues - https://phabricator.wikimedia.org/T247561 (10elukey) I had a chat with Arzhel today and we didn't find a lot. From his perspective, it seems that something in the middle between the switch and stat1005 is not worki...
[11:32:37] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] eventgate-analytics: allow retries on connection reset [puppet] - 10https://gerrit.wikimedia.org/r/579904 (https://phabricator.wikimedia.org/T247484) (owner: 10Giuseppe Lavagetto)
[11:38:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: decrease varnish-frontend malloc cache size [puppet] - 10https://gerrit.wikimedia.org/r/579906 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema)
[11:38:40] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] cache: limit upload transient storage usage [puppet] - 10https://gerrit.wikimedia.org/r/579907 (https://phabricator.wikimedia.org/T185968) (owner: 10Ema)
[11:40:27] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Request for new mailing list Deutschschweiz - https://phabricator.wikimedia.org/T247737 (10Aklapper)
[11:52:22] <XioNoX>	 !log manually fix prometheus squid exporter on install1003
[11:52:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:54:04] <wikibugs>	 (03CR) 10Jbond: Prometheus Squid exporter, specify proxy port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[11:57:41] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:59:08] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q35M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579913 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[12:00:00] <wikibugs>	 (03Merged) 10jenkins-bot: Set up read new term store up to Q35M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579913 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[12:00:19] <wikibugs>	 (03PS1) 10Jbond: New release 1.0.4 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918
[12:05:32] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579913|Set up read new term store up to Q35M (T219123)]] (duration: 01m 08s)
[12:05:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:05:37] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[12:09:39] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579913|Set up read new term store up to Q35M (T219123)]], take II (duration: 01m 07s)
[12:09:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:53] <Amir1>	 !log warming up cache for Q35M to Q40M for new term store on db1111, db1126, db1104, db1092 (T219123)
[12:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:15:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK)
[12:17:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:20:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:20:17] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde)
[12:21:42] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) 2 ouf of the 3 have already wikitech accounts. Could we proceed with these two?
[12:22:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "+1 for the openstack part. Please collect a +1 from someone related to the logstash/kafka thing." [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712)
[12:22:54] <wikibugs>	 10Operations: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10WMDE-leszek)
[12:23:13] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10WMDE-leszek)
[12:24:38] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Please collect +1/+2 from somebody more related with this redis module." [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK)
[12:26:38] <wikibugs>	 10Operations, 10CommRel-Specialists-Support, 10Core Platform Team, 10Editing-team, and 10 others: RFC: Serve Main Page of Wikimedia wikis from a consistent URL - https://phabricator.wikimedia.org/T120085 (10Esanders) >>! In T120085#5545232, @Krinkle wrote: > So the question is whether it would be a problem...
[12:27:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:31:27] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Offboard Sandra Müllrick from WMF systems - https://phabricator.wikimedia.org/T247750 (10Aklapper)
[12:31:42] <wikibugs>	 (03PS2) 10Jbond: New release 1.0.4 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918
[12:33:13] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:34:19] <wikibugs>	 (03PS3) 10Jbond: New release 1.0.5 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918
[12:35:39] <wikibugs>	 (03PS4) 10Jbond: New release 1.0.4 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918
[12:35:45] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:35:50] <wikibugs>	 (03PS5) 10Jbond: New release 1.0.5 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918
[12:36:18] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Volans) ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report. What should be the correct state for now?
[12:36:21] <wikibugs>	 (03CR) 10Holger Knust: "Thanks, we were targeting end of this week to get both charts in a "potentially deployable" state, meaning tweaks aside the bulk of the wo" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[12:36:40] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] New release 1.0.5 [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/579918 (owner: 10Jbond)
[12:37:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:42:49] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:43:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool es1015', diff saved to https://phabricator.wikimedia.org/P10706 and previous config saved to /var/cache/conftool/dbconfig/20200316-124309-marostegui.json
[12:43:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:50:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[12:53:19] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[12:58:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10ArielGlenn) >>! In T247731#5971941, @darthmon_wmde wrote: > 2 ouf of the 3 have already wikitech accounts. Could we proceed with these two?  Sure. We should m...
[12:58:21] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:01:34] <wikibugs>	 (03PS4) 10Ayounsi: Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176)
[13:03:44] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/compiler1002/21447/install1003.wikimedia.org/" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[13:07:01] <wikibugs>	 (03PS1) 10Jbond: Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923
[13:08:05] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Pita - https://phabricator.wikimedia.org/T247722 (10ArielGlenn) I see you already in the group:    ariel@mwmaint1002:~$ ldapsearch -x cn=wmf | grep  josepita   member: uid=josepita,ou=people,dc=wikimedia,dc=org
[13:08:43] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 102 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:09:33] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 (owner: 10Jbond)
[13:13:13] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:13:29] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) is CRITICAL: Test article.creation.morelike - good article title returned the unexpected status 404 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:14:05] <elukey>	 downtiming --^
[13:15:27] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:15:57] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:17:17] <wikibugs>	 (03PS2) 10Jbond: Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923
[13:17:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) @Cmjohnson there seems to be a problem with the host's serial: https://netbox.wikimedia.org/extras/reports/puppetdb.PhysicalHosts/
[13:17:34] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10elukey) 05Resolved→03Open
[13:19:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 (owner: 10Jbond)
[13:19:40] <wikibugs>	 10Operations, 10netops, 10Patch-For-Review, 10User-Elukey: can aggregated netflow data include the router it was sampled from? - https://phabricator.wikimedia.org/T246186 (10ayounsi) >>! In T246186#5960144, @elukey wrote: > If the cardinality of the three new dimensions are not too big we could definitely...
[13:25:31] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:26:06] <wikibugs>	 (03PS1) 10Ladsgroup: Set up read new term store up to Q40M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579925 (https://phabricator.wikimedia.org/T219123)
[13:28:52] <wikibugs>	 (03PS3) 10Jbond: Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923
[13:30:12] <hashar>	 jbond42: hi! do you need the CI container bumped/updated?
[13:30:22] <gehel>	 !log depooling wdqs1005 to catch up on lag
[13:30:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:31] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:30:57] <jbond42>	 hashar: i have https://gerrit.wikimedia.org/r/#/c/integration/config/+/579924 but just making sure i have all the rby dependencies in order first, will ping when the gemfile changes are merged, thanks
[13:32:15] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Gemfile: update puppet-lint-wmf_styleguide-check [puppet] - 10https://gerrit.wikimedia.org/r/579923 (owner: 10Jbond)
[13:32:34] <hashar>	 jbond42: cool :]
[13:32:45] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde)
[13:33:03] <jbond42>	 hashar: just merged so if you could merge https://gerrit.wikimedia.org/r/#/c/integration/config/+/579924 that would be great :)
[13:33:23] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10darthmon_wmde) Awesome, thanks @ArielGlenn !  I just added the third user name =)
[13:33:33] <hashar>	 doing so!
[13:33:44] <jbond42>	 thanks <3
[13:34:50] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: rack/setup/install ganeti10([09]|1[0-8]).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10akosiaris) >>! In T228924#5971978, @Volans wrote: > ganeti1009 is set as Staged in Netbox and missing in PuppetDB, so it's report by the Netbox report. > What sh...
[13:36:22] <hashar>	 the container is building, I will bump the jobs
[13:37:17] <jbond42>	 great thanks
[13:40:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:41:04] <icinga-wm>	 ACKNOWLEDGEMENT - WDQS high update lag on wdqs1005 is CRITICAL: 4.676e+04 ge 4.32e+04 Gehel currently depooled to catch up on lag https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[13:42:21] <gehel>	 !log restarting blazegraph on wdqs1007
[13:42:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:43:43] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617)
[13:44:10] <wikibugs>	 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) I'll revert 576921 (that was a leftover of testing), but with the service ID pointing to 443 (and CASRootProxiedAs set to https://cas-logstash.wikimedia.org (as Envoy only goes one way and other it would report...
[13:45:09] <icinga-wm>	 ACKNOWLEDGEMENT - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad Ayounsi https://phabricator.wikimedia.org/T245176#5972066 https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:47:06] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond)
[13:48:43] <hashar>	 jbond42: the container has failed. Tox could not reach some network resource through the proxy. I am trying again
[13:49:48] <jbond42>	 hashar: ack thanks
[13:49:50] <ema>	 !log upload atskafka 0.1 to buster-wikimedia T237993
[13:49:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:49:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Revert service ID for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/579954 (https://phabricator.wikimedia.org/T246998)
[13:49:55] <stashbot>	 T237993: Create replacement for Varnishkafka - https://phabricator.wikimedia.org/T237993
[13:51:04] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[13:56:38] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM and thanks for adding types/lookup to the other parameter <3" [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[13:56:59] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617)
[13:59:18] <icinga-wm>	 PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:01:13] <hashar>	 Cannot connect to proxy.', timeout('timed out
[14:01:14] <hashar>	 bah
[14:01:26] <hashar>	 jbond42: I am not sure what is going on. Gotta dig into it
[14:02:28] <icinga-wm>	 PROBLEM - Check systemd state on stat1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:33] <jbond42>	 hashar: i know that mutante and moritzm added a new web proxy server last week i wonder if that could be the cause of the problems
[14:04:09] <moritzm>	 indeed, webproxy in prod now uses install1003.wikimedia.org, where did it fail? somewhere in labs?
[14:04:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:04:48] <godog>	 this alert is known btw ^ being worked on
[14:05:06] <moritzm>	 !log installing libxslt security updates
[14:05:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:44] <hashar>	 on contint1001 inside a docker container. I am not even sure which url it uses as a proxy ;)
[14:06:20] <hashar>	 http://webproxy.eqiad.wmnet:8080http://webproxy.eqiad.wmnet:8080
[14:06:34] <wikibugs>	 10Operations, 10ops-eqiad: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10fgiunchedi)
[14:06:45] <hashar>	 which indeed seems to point to install1003
[14:06:57] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on mw1373 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Filippo Giunchedi https://phabricator.wikimedia.org/T247755 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:08:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 36 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:10:48] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q40M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579925 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[14:10:50] <wikibugs>	 (03PS1) 10Ema: ATS: add tls and backend log config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/579955
[14:11:46] <wikibugs>	 (03Merged) 10jenkins-bot: Set up read new term store up to Q40M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579925 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[14:15:02] <Amir1>	 !log ladsgroup@mwmaint1002:~$ mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=wikidatawiki --from-id 87500000 --to-id 87767570 --batch-size=10 --sleep=5 (T219123)
[14:15:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:07] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[14:15:25] <wikibugs>	 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) While there are many features that I would like to improve (like adding some state management for new and removed jobs, total size monitoring,...
[14:16:51] <moritzm>	 !log rolling restart of FPM on mw1261-mw1265 to pick up libxslt security updates
[14:16:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:55] <hashar>	 jbond42: moritzm: so it seems the Docker container resolves webproxy.eqiad.wmnet to install1003 and the connections time out :/
[14:18:54] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q40M (T219123)]] (duration: 01m 07s)
[14:18:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:08] <moritzm>	 hashar: that seems to be somein with the Docker container setup, I see plenty of successful accesses from contint1001 (for pypi and other URLs) on install1003
[14:22:13] <Amir1>	 !log warming up cache for Q40M to Q50M for new term store on db1111, db1126, db1104, db1092 (T219123)
[14:22:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:18] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[14:22:19] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q40M (T219123)]], take II (duration: 01m 06s)
[14:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:22:30] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:23:58] <icinga-wm>	 RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 623 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:26:00] <icinga-wm>	 PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:27:32] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:27:42] <wikibugs>	 10Operations, 10Thumbor, 10Wikimedia-Logstash, 10observability, and 2 others: Stream Thumbor logs to logstash - https://phabricator.wikimedia.org/T212946 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi This is complete (i.e. {T242609}), resolving. Feel free to reopen though!
[14:29:18] <wikibugs>	 (03PS6) 10Ema: atskafka: add puppet module [puppet] - 10https://gerrit.wikimedia.org/r/579247 (https://phabricator.wikimedia.org/T247497)
[14:29:20] <wikibugs>	 (03PS4) 10Ema: cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497)
[14:29:22] <wikibugs>	 (03PS4) 10Ema: cache: test atskafka webrequest on cp3050 [puppet] - 10https://gerrit.wikimedia.org/r/579514 (https://phabricator.wikimedia.org/T247497)
[14:30:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:32:49] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: remove all obsolete v2 policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579634 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[14:32:57] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: (Need by: ASAP) rack/setup/install stat1008 - https://phabricator.wikimedia.org/T246472 (10Cmjohnson) 05Open→03Resolved Thanks, @elukey  fixed the issue in netbox
[14:34:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] cache: add atskafka webrequest test instance [puppet] - 10https://gerrit.wikimedia.org/r/579513 (https://phabricator.wikimedia.org/T247497) (owner: 10Ema)
[14:35:08] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid,swagger_check_cxserver_cluster_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:37:12] <icinga-wm>	 RECOVERY - Check systemd state on stat1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:27] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium-recursive package [debs/contenttranslation/apertium-recursive] - 10https://gerrit.wikimedia.org/r/578704 (https://phabricator.wikimedia.org/T234181) (owner: 10KartikMistry)
[14:38:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Add apertium-anaphora package [debs/contenttranslation/apertium-anaphora] - 10https://gerrit.wikimedia.org/r/578705 (https://phabricator.wikimedia.org/T234181) (owner: 10KartikMistry)
[14:39:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] mediawiki: Change php-wmerrors channel from "fatal" to as "exception" [puppet] - 10https://gerrit.wikimedia.org/r/577645 (https://phabricator.wikimedia.org/T247113) (owner: 10Krinkle)
[14:40:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Prometheus Squid exporter, specify proxy port [puppet] - 10https://gerrit.wikimedia.org/r/579915 (https://phabricator.wikimedia.org/T245176) (owner: 10Ayounsi)
[14:40:42] <wikibugs>	 (03PS4) 10Andrew Bogott: nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573)
[14:42:25] <wikibugs>	 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10hashar) When building a docker container on contint1001.wikimedia.org with docker-pkg, pip gets proxy timeout error when using `http://webproxy.eqiad.wmnet:8080`.  I have manually switched to the...
[14:42:52] <hashar>	 moritzm:   webproxy.codfw.wmnet works fine though!  I have commented about it on the task
[14:42:59] <hashar>	 jbond42: container build, I am updating the jobs
[14:43:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis)
[14:43:28] <godog>	 XioNoX: ^ suspicious timeouts reaching the proxy from the outside too, like the exporter is experiencing
[14:43:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: replace 'admin_or_member' with 'admin_or_owner' [puppet] - 10https://gerrit.wikimedia.org/r/579635 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[14:44:19] <XioNoX>	 :/
[14:44:38] <hashar>	 webproxy.codfw.wmnet worked for me
[14:45:14] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:47:02] <wikibugs>	 (03CR) 10Marostegui: "The only thing I have in mind at the moment is...given that this script is very specific for es (ie: will only work with single PKs, which" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/577224 (https://phabricator.wikimedia.org/T244884) (owner: 10Jcrespo)
[14:47:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen) Machine is built and has accounts for fr-analytics.
[14:48:06] <jbond42>	 hashar: act thanks :)
[14:48:24] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond)
[14:49:38] <wikibugs>	 10Operations: eqiad squid performances issue - https://phabricator.wikimedia.org/T247759 (10ayounsi) p:05Triage→03High
[14:49:54] <XioNoX>	 godog: https://phabricator.wikimedia.org/T247759
[14:50:08] <XioNoX>	 nobody owns squid though
[14:50:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:51:47] <godog>	 XioNoX: thanks! yeah in my mind it is foundations, but I don't want to voluntell anyone heh
[14:51:56] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki::jobrunner: raise max_execution_time [puppet] - 10https://gerrit.wikimedia.org/r/579961 (https://phabricator.wikimedia.org/T247622)
[14:52:12] <icinga-wm>	 RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 39 probes of 542 (alerts on 50) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[14:52:49] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/559536 (https://phabricator.wikimedia.org/T241160) (owner: 10Jbond)
[14:53:09] <XioNoX>	 godog: I bolded it in today's meeting so at least people can be aware of it
[14:53:43] <godog>	 +1 thank you
[14:55:12] <hashar>	 maybe one can pull out install1003 from webproxy dns entry?
[14:55:16] <hashar>	 or maybe it is just overloaded
[14:57:56] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[14:58:17] <XioNoX>	 godog: I'm wondering if pulling it from prometheus doesn't make things worse?
[14:59:22] <godog>	 XioNoX: possible for sure! should be easy to quickly test by pulling the prometheus ferm rules from install1003
[14:59:56] <wikibugs>	 (03PS4) 10Andrew Bogott: nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573)
[14:59:58] <wikibugs>	 (03PS2) 10Guozr.im: RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882
[15:01:05] <wikibugs>	 (03CR) 10Guozr.im: "> Patch Set 1:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[15:01:36] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:02:35] <moritzm>	 !log rolling restart of FPM/apache on netmon* to pick up libxslt security updates
[15:02:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:21] <akosiaris>	 !log T234181 upload apertium-anaphora_0.0.4-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
[15:04:21] <akosiaris>	 !log T234181 upload apertium-recursive_0.0.1-1+wmf1 to apt.wikimedia.org jessie-wikimedia/main
[15:04:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:30] <stashbot>	 T234181: Package apertium-anaphora and apertium-recursive - https://phabricator.wikimedia.org/T234181
[15:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:04:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: sort all policy rules [puppet] - 10https://gerrit.wikimedia.org/r/579636 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[15:05:18] <wikibugs>	 (03PS1) 10KartikMistry: apertium-fr-es: Fix FTBFS with apertium 3.6 [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/580053 (https://phabricator.wikimedia.org/T247585)
[15:05:18] <wikibugs>	 10Operations, 10Patch-For-Review: Upgrade install servers to Buster - https://phabricator.wikimedia.org/T224576 (10ayounsi) I opened T247759 to track this issue.
[15:05:30] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:06:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:08:17] <wikibugs>	 (03CR) 10Gergő Tisza: [C: 03+1] Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 (owner: 10Alex Monk)
[15:10:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/21448/ the change does the right thing. Merging." [puppet] - 10https://gerrit.wikimedia.org/r/579961 (https://phabricator.wikimedia.org/T247622) (owner: 10Giuseppe Lavagetto)
[15:10:09] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, please sync up so I can update the router's forwarders." [puppet] - 10https://gerrit.wikimedia.org/r/569684 (https://phabricator.wikimedia.org/T224576) (owner: 10Dzahn)
[15:13:06] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:15:55] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Jgreen)
[15:16:40] <elukey>	  /query ema 
[15:16:43] <elukey>	 uff
[15:16:59] <ema>	 elukey: <3
[15:17:00] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Revert service ID for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/579954 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff)
[15:17:14] <vgutierrez>	 elukey: italian mafia leak detected
[15:17:26] * vgutierrez hides
[15:17:26] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[15:17:30] <elukey>	 vgutierrez: you got me
[15:17:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Revert service ID for Logstash [puppet] - 10https://gerrit.wikimedia.org/r/579954 (https://phabricator.wikimedia.org/T246998) (owner: 10Muehlenhoff)
[15:19:55] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[15:19:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] RemoteExecution: Fix typo in class CommandReturn [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[15:19:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/578609 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK)
[15:20:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:20:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/578611 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK)
[15:22:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "lgtm, thanks will merge" [puppet] - 10https://gerrit.wikimedia.org/r/578575 (https://phabricator.wikimedia.org/T201491) (owner: 10QEDK)
[15:22:42] <wikibugs>	 (03PS4) 10Jcrespo: mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891
[15:23:02] <wikibugs>	 (03PS4) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153)
[15:23:47] <wikibugs>	 (03CR) 10KartikMistry: "recheck" [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700) (owner: 10KartikMistry)
[15:25:17] <wikibugs>	 (03CR) 10Jcrespo: "I hope it was clear that my purpose here was not to annoy you with red tape 0:-D" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[15:27:34] <wikibugs>	 (03PS5) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153)
[15:30:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:32:58] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Update RemoteCommandExecution to the latest version [puppet] - 10https://gerrit.wikimedia.org/r/579891 (owner: 10Jcrespo)
[15:34:26] * Krinkle testing on mwdebug1002
[15:34:57] <wikibugs>	 10Operations, 10SDC General, 10Structured Data Engineering, 10Structured-Data-Backlog, and 2 others: Create CQS puppet configs by applying query_service module - https://phabricator.wikimedia.org/T237089 (10Sarahmarie1981)
[15:35:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:35:58] <wikibugs>	 (03PS2) 10Krinkle: wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821)
[15:36:03] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Sarahmarie1981)
[15:36:09] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[15:37:02] <wikibugs>	 (03Merged) 10jenkins-bot: wgConf: Remove unused 'fullLoadCallback' property assignment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579643 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[15:38:03] <wikibugs>	 10Operations, 10Patch-For-Review: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10MoritzMuehlenhoff) >>! In T246998#5966192, @colewhite wrote: > Can the idp redirect to https?  What happens when this is configured?  The server-side IDP and Apache config has been adapted, if anyone wants to...
[15:39:06] <wikibugs>	 (03PS4) 10KartikMistry: apertium-en-ca: Update to new upstream 1.0.1 [debs/contenttranslation/apertium-en-ca] - 10https://gerrit.wikimedia.org/r/578707 (https://phabricator.wikimedia.org/T233700)
[15:39:43] <wikibugs>	 (03CR) 10CRusnov: "Thanks!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)
[15:42:52] <wikibugs>	 (03CR) 10Guozr.im: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/579882 (owner: 10Guozr.im)
[15:46:40] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (Get media in test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[15:49:04] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[15:49:28] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:53:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:53:50] <wikibugs>	 (03PS1) 10Muehlenhoff: Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074
[15:54:26] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:54:50] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff)
[15:55:08] <wikibugs>	 (03CR) 10Dzahn: "isn't it "insetup_noferm" ?" [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff)
[15:55:50] <wikibugs>	 (03CR) 10Jcrespo: "> Hi Jcrespo," [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im)
[15:56:06] <moritzm>	 mutante: ack, amended in PS2
[15:56:09] <wikibugs>	 (03PS2) 10Muehlenhoff: Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074
[15:56:23] <wikibugs>	 10Operations, 10Traffic, 10observability, 10User-fgiunchedi: Per-backend ATS Prometheus metrics - https://phabricator.wikimedia.org/T227668 (10ema) 05Open→03Resolved a:03ema Metrics added a while ago, closing!
[15:56:26] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff)
[15:56:27] <mutante>	 moritzm: yep, +1
[15:58:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:58:30] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] ATS: add tls and backend log config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/579955 (owner: 10Ema)
[15:59:04] <wikibugs>	 (03CR) 10Ema: [C: 03+2] ATS: add tls and backend log config for cloud [puppet] - 10https://gerrit.wikimedia.org/r/579955 (owner: 10Ema)
[15:59:09] <wikibugs>	 (03CR) 10QEDK: [C: 03+1] Fix typos (boostrap -> bootstrap) [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712)
[16:02:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM will merge thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/566323 (https://phabricator.wikimedia.org/T201491) (owner: 10DannyS712)
[16:03:24] <wikibugs>	 (03CR) 10Guozr.im: "> Patch Set 2:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/578623 (https://phabricator.wikimedia.org/T218189) (owner: 10Guozr.im)
[16:06:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add new insetup roles to Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/580074 (owner: 10Muehlenhoff)
[16:07:02] <wikibugs>	 (03PS1) 10Krinkle: Revert "wgConf: Remove unused 'fullLoadCallback' property assignment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580078
[16:07:21] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] "Will reconsider later in the stack" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580078 (owner: 10Krinkle)
[16:08:16] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "wgConf: Remove unused 'fullLoadCallback' property assignment" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580078 (owner: 10Krinkle)
[16:08:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:09:15] <wikibugs>	 (03PS2) 10Krinkle: wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821)
[16:09:30] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff)
[16:09:38] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff)
[16:09:42] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[16:10:40] <wikibugs>	 (03Merged) 10jenkins-bot: wgConf: Assign $wgLocalDatabases normally instead of by-ref [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579645 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[16:13:58] <wikibugs>	 (03PS3) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648
[16:14:52] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/wgConf.php: Ie9002d9095ee (duration: 01m 08s)
[16:14:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:30] <wikibugs>	 (03PS3) 10Krinkle: wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649
[16:15:46] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 (owner: 10Krinkle)
[16:15:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:16:40] <wikibugs>	 (03Merged) 10jenkins-bot: wgConf: Move wgLocalDatabases to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579648 (owner: 10Krinkle)
[16:21:25] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I08af45e2e47 (duration: 01m 07s)
[16:21:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:21:35] <wikibugs>	 (03PS5) 10DannyS712: trwiki: Grant interface editors editprotected & editsemiprotected [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579772 (https://phabricator.wikimedia.org/T247672)
[16:22:20] <rlazarus>	 !log copied envoyproxy_1.13.1-1 from buster-wikimedia to stretch-wikimedia
[16:22:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:22:29] <wikibugs>	 (03PS2) 10Krinkle: Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821)
[16:22:31] <wikibugs>	 (03PS2) 10Krinkle: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821)
[16:22:57] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 (owner: 10Krinkle)
[16:24:06] <wikibugs>	 (03Merged) 10jenkins-bot: wgConf: Move wgLocalDatabases to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579649 (owner: 10Krinkle)
[16:29:43] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[16:29:59] <Krinkle>	 James_F: can do a few beta patches in the mean time if you want
[16:30:03] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/wgConf.php: I870122f946d (duration: 01m 07s)
[16:30:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:30:09] <Krinkle>	 I could use a short break after this one
[16:30:39] <wikibugs>	 (03Merged) 10jenkins-bot: Move $wgConf to CommonSettings.php (1/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579651 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[16:33:07] <logmsgbot>	 !log krinkle@deploy1001 Synchronized multiversion/MWConfigCacheGenerator.php: I498e2ebd8c9 (no-op) (duration: 01m 07s)
[16:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:43] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: I498e2ebd8c9 (duration: 01m 07s)
[16:34:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:21] * Krinkle done with mwdebug1002
[16:38:00] <wikibugs>	 (03PS1) 10Elukey: Raise MaxGCPauseMillis on Hadoop HDFS Namenodes' GC settings [puppet] - 10https://gerrit.wikimedia.org/r/580079
[16:40:37] <icinga-wm>	 ACKNOWLEDGEMENT - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: RRDP status alert. Ayounsi known https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/
[16:40:47] <wikibugs>	 (03PS4) 10Jforrester: Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff)
[16:41:24] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff)
[16:42:15] <wikibugs>	 (03Merged) 10jenkins-bot: Add wikidata.beta.wmflabs.org + prod domains to beta csp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/578183 (owner: 10Brian Wolff)
[16:43:46] <wikibugs>	 (03PS2) 10Jforrester: Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff)
[16:44:02] <James_F>	 OK, let's do this.
[16:45:55] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff)
[16:46:55] <wikibugs>	 (03Merged) 10jenkins-bot: Make CSP enforce on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579890 (https://phabricator.wikimedia.org/T244124) (owner: 10Brian Wolff)
[16:48:42] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Set wmgUseCSP false everywhere T244124 (duration: 01m 07s)
[16:48:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:48:49] <stashbot>	 T244124: Make CSP enforce on beta cluster - https://phabricator.wikimedia.org/T244124
[16:50:50] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Touch and secondary sync of IS for cache-busting (duration: 01m 07s)
[16:50:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:47] <logmsgbot>	 !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: Enforce Content Security Policy if wmgUseCSP is set T244124 (duration: 01m 06s)
[16:52:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:54:06] <icinga-wm>	 RECOVERY - WDQS high update lag on wdqs1005 is OK: (C)4.32e+04 ge (W)2.16e+04 ge 2.1e+04 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:54:49] <gehel>	 !log repooling wdqs1005
[16:54:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:58] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:56:40] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Raise MaxGCPauseMillis on Hadoop HDFS Namenodes' GC settings [puppet] - 10https://gerrit.wikimedia.org/r/580079 (owner: 10Elukey)
[16:58:51] <wikibugs>	 (03PS1) 10Ladsgroup: Set up read new term store up to Q50M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580082 (https://phabricator.wikimedia.org/T219123)
[16:59:55] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Set up read new term store up to Q50M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580082 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[17:00:04] <jouncebot>	 gehel and onimisionipe: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1700).
[17:01:26] <wikibugs>	 (03Merged) 10jenkins-bot: Set up read new term store up to Q50M [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580082 (https://phabricator.wikimedia.org/T219123) (owner: 10Ladsgroup)
[17:03:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:03:36] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q50M (T219123)]] (duration: 01m 06s)
[17:03:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:41] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[17:06:56] <logmsgbot>	 !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:579925|Set up read new term store up to Q50M (T219123)]], take II (duration: 01m 08s)
[17:07:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:08:11] <Amir1>	 !log warming up cache for Q50M to Q60M for new term store on db1111, db1126, db1104, db1092 (T219123)
[17:08:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:12:19] <wikibugs>	 (03CR) 10Ottomata: "Have not totally followed this discussion so feel free to ignore this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[17:13:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:19:30] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:21:41] <mutante>	 looks like that spike is the deploy or cache-warmup, Amir
[17:24:08] <mutante>	 yea, it's over 
[17:24:10] <mutante>	 though
[17:24:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:27:28] <jynus>	 Amir1: addshore please keep in mind https://grafana.wikimedia.org/d/XyoE_N_Wz/wikidata-database-cpu-saturation?orgId=1
[17:27:39] <Amir1>	 jynus: thanks
[17:27:41] <Amir1>	 I will
[17:28:01] <jynus>	 there was spikes on db1126, probably causing the issue below
[17:28:08] <jynus>	 *above
[17:29:23] <Amir1>	 I keep it tamed for now
[17:29:29] <jynus>	 Commit failed on server(s) 
[17:31:16] <jynus>	 that was errors on the wikidata master ^
[17:31:22] <jynus>	 so not reads
[17:31:34] <Amir1>	 oh the writes
[17:31:37] <Amir1>	 let me check
[17:32:05] <jynus>	 give it a look- it passed so no big deal, but maybe interesting for code reasons
[17:37:17] <wikibugs>	 (03PS1) 10Dzahn: site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586)
[17:38:03] <wikibugs>	 (03PS1) 10Dzahn: remove production IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580089 (https://phabricator.wikimedia.org/T229586)
[17:38:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:40:06] <wikibugs>	 (03PS1) 10Dzahn: remove mgmt IPs for cp1099 [dns] - 10https://gerrit.wikimedia.org/r/580091 (https://phabricator.wikimedia.org/T229586)
[17:40:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:42:01] <Amir1>	 is the prometheus because of it too?
[17:43:40] <Amir1>	 I let it stay there for a bit
[17:43:54] <cdanis>	 Amir1: no, unrelated
[17:43:59] <Amir1>	 cool
[17:44:04] <Amir1>	 afk for a bit
[17:44:24] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:45:16] <wikibugs>	 (03PS1) 10Dzahn: site: let labtest* use role(test), not spare::system [puppet] - 10https://gerrit.wikimedia.org/r/580092
[17:47:11] <wikibugs>	 (03PS2) 10Dzahn: site/DHCP: remove cp1099 [puppet] - 10https://gerrit.wikimedia.org/r/580087 (https://phabricator.wikimedia.org/T229586)
[17:47:49] <Amir1>	 it's page creations going up: https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops
[17:47:56] <Amir1>	 https://grafana.wikimedia.org/d/000000170/wikidata-edits?orgId=1&from=now-1h&to=now&fullscreen&panelId=10
[17:48:29] <Amir1>	 will go down quickly, it's going to flap, I'm already asking users to reduce the items created for a bit, until this is over
[17:50:32] <mutante>	 ok, thanks for the updates Amir
[17:50:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:51:09] <Amir1>	 there's not much we can do about it, we are writing to the term stores for new items since today and it's going to overlap quite a lot
[17:51:25] <Amir1>	 but we should improve it nonetheless 
[17:51:52] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[17:53:12] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:54:54] <wikibugs>	 (03PS1) 10Krinkle: [WIP] logging: Remove useMicrosecondTimestamps(false) calls [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550)
[17:55:23] <wikibugs>	 (03CR) 10Krinkle: "Blocked as WIP because we're on Monolog 1.25, not >= 2.x" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580096 (https://phabricator.wikimedia.org/T116550) (owner: 10Krinkle)
[17:55:31] <Krinkle>	 Reedy: ^ :D
[17:58:14] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_mobileapps_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:58:19] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] toolforge: support canonical redirects in urlproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez)
[18:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: May I have your attention please! Morning SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1800)
[18:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[18:00:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:03:34] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[18:05:29] <logmsgbot>	 !log mforns@deploy1001 Started deploy [analytics/refinery@1681b92]: deploying refinery to add forgotten artifacts for v0.0.118
[18:05:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:44] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617)
[18:11:01] <wikibugs>	 (03PS3) 10Krinkle: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821)
[18:11:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:13:18] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:14:31] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: toolforge: support canonical redirects in urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617)
[18:15:39] <Krinkle>	 jouncebot: next
[18:15:39] <jouncebot>	 In 1 hour(s) and 44 minute(s): Services – Graphoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2000)
[18:15:41] <Krinkle>	 jouncebot: now
[18:15:41] <jouncebot>	 For the next 0 hour(s) and 44 minute(s): Morning SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T1800)
[18:17:17] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[18:17:28] <wikibugs>	 (03PS1) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101
[18:17:28] * Krinkle testing on mwdebug1002
[18:17:30] <wikibugs>	 (03Merged) 10jenkins-bot: Move $wgConf to CommonSettings.php (2/2) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579652 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[18:17:32] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, please merge so we can work on follow up patches :-)" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/496564 (owner: 10BryanDavis)
[18:18:09] <wikibugs>	 (03CR) 10BryanDavis: toolforge: support canonical redirects in urlproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/579952 (https://phabricator.wikimedia.org/T234617) (owner: 10Arturo Borrero Gonzalez)
[18:18:31] <logmsgbot>	 !log mforns@deploy1001 Finished deploy [analytics/refinery@1681b92]: deploying refinery to add forgotten artifacts for v0.0.118 (duration: 13m 01s)
[18:18:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:18:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM, please merge so we can followup with other patches." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha)
[18:19:12] <wikibugs>	 (03PS4) 10Krinkle: Document Apache gzip sidestepping [puppet] - 10https://gerrit.wikimedia.org/r/539842 (https://phabricator.wikimedia.org/T232615) (owner: 10Gilles)
[18:19:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:19:31] <wikibugs>	 (03PS2) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101
[18:20:06] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578407 (owner: 10BryanDavis)
[18:20:56] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for eqiad on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe1005:9112 job=statsd_exporter site=eqiad https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad
[18:21:08] <icinga-wm>	 PROBLEM - mediawiki originals uploads -hourly- for codfw on icinga1001 is CRITICAL: account=mw-media class=originals cluster=swift instance=ms-fe2005:9112 job=statsd_exporter site=codfw https://wikitech.wikimedia.org/wiki/Swift/How_To https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw
[18:21:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578408 (https://phabricator.wikimedia.org/T246689) (owner: 10BryanDavis)
[18:22:28] <wikibugs>	 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Dvorapa) Any news? From possible solutions like T238751, T240442, T245144 and @Ladsgroup's T247459?  La...
[18:23:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:25:14] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. I would suggest to include the final help output in the commit message (if it fits, anyway). That should help with patch reviewing t" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578409 (owner: 10BryanDavis)
[18:29:10] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[18:30:31] <wikibugs>	 (03PS1) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105
[18:31:13] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)
[18:34:18] <logmsgbot>	 !log krinkle@deploy1001 Synchronized docroot/noc/: I2c3217fb3 (duration: 01m 07s)
[18:34:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:35:33] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. The benefit of using this vs using kubectl directly is that this is persisting the info into the manifest, right? so next start uses" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578412 (owner: 10BryanDavis)
[18:35:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:36:29] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/CommonSettings.php: no-op, courtesy of opcache (duration: 01m 06s)
[18:36:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:36:53] <wikibugs>	 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn)
[18:37:25] <wikibugs>	 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn)
[18:37:39] <wikibugs>	 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn)
[18:38:00] <wikibugs>	 10Operations, 10serviceops: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn)
[18:38:04] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/: I2c3217fb3da8bb65 (duration: 01m 07s)
[18:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:38:17] <wikibugs>	 (03PS3) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780)
[18:38:42] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] Add support for redirecting to toolforge.org (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578413 (https://phabricator.wikimedia.org/T234617) (owner: 10BryanDavis)
[18:38:49] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn)
[18:38:59] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10Dzahn)
[18:39:00] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp3058 is OK: HTTP OK: HTTP/1.0 200 OK - 22098 bytes in 0.257 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[18:41:16] <wikibugs>	 (03PS2) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 (https://phabricator.wikimedia.org/T247780)
[18:41:26] <wikibugs>	 (03PS1) 10Dzahn: remove production IPs of mw1221 through mw1226 [dns] - 10https://gerrit.wikimedia.org/r/580107 (https://phabricator.wikimedia.org/T247780)
[18:41:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:42:18] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn)
[18:43:13] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:43:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:43:24] <cdanis>	 around
[18:43:25] * volans looking
[18:43:27] <rlazarus>	 here
[18:43:29] <XioNoX>	 same
[18:43:36] <chaomodus>	 here
[18:43:44] <vgutierrez>	 here
[18:43:45] <volans>	  load average: 74.63, 128.69, 163.81
[18:43:45] <cdanis>	 MW exceptions don't seem critical
[18:43:58] <apergos>	 looking in
[18:43:58] <jbond42>	 here
[18:44:04] <mutante>	 Amir1: hi, fyi
[18:44:08] <marostegui>	 checking
[18:44:16] <volans>	 level=error msg="Error pinging mysqld: Error 1040: Too many connections"
[18:44:27] <marostegui>	 parsercache with too many?
[18:44:32] <mutante>	 there is "cache warm up" going on
[18:44:34] <marostegui>	 massive invalidation pehaps?
[18:44:35] <volans>	 apparently and very high load average
[18:44:43] <mutante>	 17:08 < Amir1> !log warming up cache for Q50M to Q60M for new term store on db1111, db1126, db1104, db1092 (T219123)
[18:44:44] <stashbot>	 T219123: Migrate to and read from new store for item terms - https://phabricator.wikimedia.org/T219123
[18:44:46] <jynus>	 here
[18:44:46] <mutante>	 etc
[18:44:56] <marostegui>	 mutante: don't think that should affect parsercache
[18:44:59] <mutante>	 ok
[18:45:02] <marostegui>	 or at least it never did before
[18:45:03] <apergos>	 that should be about dbs cache not pc
[18:45:09] <apergos>	 "should"
[18:45:29] <volans>	 marostegui: I cannot find the mysql error log could be it was never created?
[18:45:32] <Amir1>	 back now,
[18:45:41] <Amir1>	 let me read
[18:46:07] <marostegui>	 lots of REPLACE /* SqlBagOStuff::updateTableKeys api.php@mw1313 *
[18:46:28] <Amir1>	 that's not me
[18:46:36] <Amir1>	 the cache warmup is direct query to datbase
[18:46:38] <cdanis>	 it started just before 18:00
[18:46:39] <Amir1>	 *database
[18:46:47] <jynus>	 lets failover pc2? marostegui to see if server not service?
[18:46:49] <cdanis>	 pc1008's disk is maxed out too
[18:46:49] <wikibugs>	 (03PS3) 10Krinkle: [WIP] Remove use of the $globals cache temporary file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821)
[18:46:50] <Amir1>	 by cache we mean innodb buffer pool
[18:46:54] <jynus>	 or do you know it is service?
[18:46:58] <marostegui>	 jynus: I am checking if the rest are the same
[18:47:00] <rlazarus>	 appserver latency also spiked at the same time, is still high https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1
[18:47:05] <Amir1>	 https://phabricator.wikimedia.org/T219123#5924185 
[18:47:17] <marostegui>	 they all have a big increase in connections
[18:47:24] <marostegui>	 since 18:00
[18:47:26] <jynus>	 then lets not touch it
[18:47:30] <jynus>	 if train, revert
[18:47:35] <cdanis>	 is not train
[18:47:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Remove use of the $globals cache temporary file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle)
[18:48:13] <wikibugs>	 (03PS2) 10Krinkle: [WIP] Remove "Cache-control: no-cache" hack from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579815 (https://phabricator.wikimedia.org/T189966)
[18:48:18] <rlazarus>	 who's IC?
[18:48:25] <wikibugs>	 (03PS3) 10Krinkle: Remove "Cache-control: no-cache" hack from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579815 (https://phabricator.wikimedia.org/T189966)
[18:48:51] <marostegui>	 there's definitely something happening around 18:00
[18:48:57] <marostegui>	 that had created lots of connections there
[18:48:58] <jynus>	 it is all updates
[18:49:21] <volans>	 marostegui: write policy and BBU seems ok
[18:49:27] <jynus>	 it is sql
[18:49:28] <marostegui>	 jynus: replaces from what I can see
[18:49:29] <jynus>	 not server
[18:49:42] <jynus>	 yes, I mean updates as processlists says
[18:49:51] <jynus>	 not UPDATE sql
[18:50:01] <marostegui>	 maybe a massive expiration or something?
[18:50:11] <jynus>	 that could be caused by something expiring all keys
[18:50:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[18:51:15] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] site: let labtest* use role(test), not spare::system [puppet] - 10https://gerrit.wikimedia.org/r/580092 (owner: 10Dzahn)
[18:51:42] <jynus>	 at 5000 connections the query killers kicks in
[18:51:52] <volans>	 what's the current TTL for items in parsercache?
[18:51:55] <Amir1>	 where is this connections? s8?
[18:51:56] <marostegui>	 I am not seeing any traffic increase or anything
[18:52:02] <marostegui>	 Amir1: no, parsercache
[18:52:04] <jynus>	 volans: like 3 months or something
[18:52:04] <cdanis>	 Amir1: parsercache
[18:52:08] <marostegui>	 volans: 30 days I reckon
[18:52:25] <jynus>	 could be someone asking a lot of uncached pages?
[18:52:28] <jynus>	 revisions?
[18:52:40] <Amir1>	 then it's not the term store stuff
[18:52:46] <jynus>	 Amir1: it is not you
[18:52:55] <Amir1>	 so I get out of the way
[18:52:58] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109
[18:53:13] <marostegui>	 I cannot correlate the increase on parsercache with any increase on traffic
[18:53:20] <marostegui>	 At least not with requests
[18:53:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: pc2 #page on pc1008 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[18:53:36] <jynus>	 the others are a bit healthier
[18:53:41] <marostegui>	 that's just a temporary recovery
[18:53:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109 (owner: 10Andrew Bogott)
[18:54:05] <jynus>	 I wonder if we should try to depool, unless you have a better idea
[18:54:13] <marostegui>	 can we get some MW expert in here?
[18:54:24] <jynus>	 as in failover to 1010
[18:54:26] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[18:54:27] <marostegui>	 jynus: The others are suffering kinda the same, just smaller hit
[18:54:36] <volans>	 any hot key?
[18:54:47] <wikibugs>	 (03PS3) 10Andrew Bogott: nova policy.json: Remove all redundant policies [puppet] - 10https://gerrit.wikimedia.org/r/579637 (https://phabricator.wikimedia.org/T247573)
[18:54:50] <marostegui>	 jynus: pc1010 might be worse, as it is probably cold :(
[18:55:14] <jynus>	 I know, that is why I was asking for a better idea
[18:55:16] <jynus>	 :-D
[18:55:30] <apergos>	 it smells like a massive expiry doesn't it
[18:55:39] <Amir1>	 marostegui: I know a little bit about PC
[18:55:39] <cdanis>	 it is weird that pc1008 is showing any packet drops at all
[18:55:42] <jynus>	 https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1
[18:55:46] <jynus>	 hit rate plummeted
[18:56:14] <marostegui>	 Amir1: help welcomed! :)
[18:56:33] <volans>	 https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=1584384914633&to=1584384969481&fullscreen&panelId=6
[18:56:34] <cdanis>	 jynus: but only once things were bad
[18:56:35] <jynus>	 cdanis: it is being overloaded by connections
[18:56:36] <volans>	 disk space
[18:56:47] <mutante>	 doc?  https://docs.google.com/document/d/1GsyYu_ruw58SSIJrYTQncLJzoXvSoBQdWQDNrFED_iQ/edit?usp=sharing
[18:57:00] <jynus>	 volans: we are ok on disk space
[18:57:01] <volans>	 sorry, those are small difs
[18:57:04] <jynus>	 it would take hours to
[18:57:05] <rlazarus>	 volans: note axes, that's almost no change
[18:57:11] <volans>	 yeah it foolefd me :D
[18:57:16] <marostegui>	 volans: check the axe, it is very little
[18:57:18] <rlazarus>	 thanks grafana :(
[18:57:42] <marostegui>	 Amir1: to sum up, we are seeing a huge increase in connections on all parsercache hosts, being pc1008 the one with more
[18:58:09] <jynus>	 connections are going down now
[18:58:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: Remove all redundant policies [puppet] - 10https://gerrit.wikimedia.org/r/579637 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[18:58:12] <jynus>	 also on pc1008
[18:58:40] <jynus>	 something created either a cache miss avalanche or some weied mw pattern
[18:58:43] <marostegui>	 volans: to answer your previous question, no, I see random tables being involved all the times and random keys
[18:58:47] <jynus>	 or memcached content got lost
[18:58:50] <jynus>	 many options
[18:58:53] <volans>	 marostegui: ack, I see that too
[18:58:55] <jynus>	 this is just possible causes
[18:59:10] <jynus>	 yeah, different languages involved
[18:59:34] <Amir1>	 okay, let me think, it can be that someone is trying to load pages in multilingual wikis (commons, wikidata) in non-main language 
[18:59:41] <Amir1>	 that would rebuild PC for each one of them
[19:00:12] <Amir1>	 let me dig a bit
[19:00:13] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:00:20] <jynus>	 yeah, it came back
[19:00:52] <marostegui>	 Amir1: any possible way to identify that from a db point of view?
[19:00:55] <cdanis>	 I don't see anything that unsuual in memcache dashboards so far
[19:01:07] * chaomodus afk ping if necessary
[19:01:10] <wikibugs>	 (03PS2) 10Andrew Bogott: Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109
[19:01:10] <elukey>	 on the memcached side, I don't see any abnormal activity from a quick look https://grafana.wikimedia.org/d/000000316/memcache?orgId=1
[19:01:17] <Amir1>	 they have keys, the keys are actually values of another key
[19:01:31] <jynus>	 querying sys.processlist is better here
[19:01:33] <jynus>	 not locking
[19:01:54] <Amir1>	 for each PC entry there are two rows (in two different databases), one refers from a general key to a specific key, the other from the sepcific key to the value
[19:02:29] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "site: let labtest* use role(test), not spare::system" [puppet] - 10https://gerrit.wikimedia.org/r/580109 (owner: 10Andrew Bogott)
[19:02:50] <jynus>	 I see a few enwiktionary:pcache:idoptions
[19:02:53] <marostegui>	 I have the feeling that most of them are  SqlBagOStuff::updateTableKeys api.php instead of  SqlBagOStuff::updateTableKeys index.php
[19:03:04] <jynus>	 :pcache:idoptions:
[19:03:09] <jynus>	 is that normal?
[19:03:13] <icinga-wm>	 PROBLEM - MariaDB Slave SQL: pc2 #page on pc1008 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:03:20] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] "I merged this in error, and reverted in" [puppet] - 10https://gerrit.wikimedia.org/r/580109 (owner: 10Andrew Bogott)
[19:03:28] <marostegui>	 can someone downtime pc1008?
[19:03:45] <cdanis>	 marostegui: doing
[19:03:54] <marostegui>	 cdanis: thanks :*
[19:04:17] <Amir1>	 jynus: yup, the value of that would point out to the key to the actual value...
[19:04:24] <Amir1>	 why it's like this, I don't know
[19:05:23] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: pc2 #page on pc1008 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:06:01] <jynus>	 marostegui: I still think its is ok to try to pool pc1010
[19:06:05] <jynus>	 what do we have to lose?
[19:06:10] <jynus>	 we can revert
[19:06:20] <marostegui>	 how are pc1007 and pc1009?
[19:06:23] <marostegui>	 Let me prepare the patch
[19:06:27] <jynus>	 less loaded
[19:06:31] <jynus>	 it mostly hits pc1008
[19:06:34] <marostegui>	 ok, pc1010 replicates from pc1007
[19:06:37] <jynus>	 although it affects all
[19:06:42] <volans>	 from some grep I did
[19:06:42] <volans>	     157 updateTableKeys RunSingleJob.php@
[19:06:42] <volans>	     222 updateTableKeys index.php@
[19:06:42] <volans>	     331 updateTableKeys api.php@
[19:06:46] <wikibugs>	 (03PS4) 10Krinkle: Remove "Cache-control: no-cache" hack from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579815 (https://phabricator.wikimedia.org/T247783)
[19:06:49] <marostegui>	 so pc1010 will be pretty cold
[19:06:53] <jynus>	 I know
[19:06:55] <marostegui>	 let me try to pool it 
[19:06:56] <jynus>	 empty actually
[19:07:08] <marostegui>	 yep
[19:07:11] <jynus>	 but we can discard the server, even if doesn't fix anything
[19:07:11] <marostegui>	 preparing the patch
[19:07:16] <jynus>	 it is not dbctl
[19:07:20] <marostegui>	 nope
[19:07:23] <volans>	 and then a long tail of language names with few hits
[19:07:29] <wikibugs>	 (03PS4) 10Andrew Bogott: nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573)
[19:07:30] <Amir1>	 hmm, can it be the elastic jobs? let me check
[19:07:41] <jynus>	 I don't have a better suggestion at the moment
[19:07:42] <wikibugs>	 (03PS4) 10Andrew Bogott: nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639
[19:07:49] <volans>	 Amir1: my sample data had 1336 lines
[19:07:56] <wikibugs>	 (03PS4) 10Krinkle: [WIP] Remove use of the $globals cache temporary file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/579653 (https://phabricator.wikimedia.org/T169821)
[19:08:04] <volans>	 so half was the long tail
[19:08:08] <jynus>	 how is latency, availabillity affected?
[19:08:08] <volans>	 and half those 3
[19:08:17] <Amir1>	 volans: can I see some of the tail?
[19:08:21] <icinga-wm>	 RECOVERY - MariaDB Slave SQL: pc2 #page on pc1008 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave
[19:08:26] <mutante>	 i am keeping notes on the doc. does this currently have user impact?
[19:08:34] <cdanis>	 mutante: appserver latency elevated
[19:08:36] <jynus>	 that is my question
[19:08:38] <jynus>	 ok
[19:08:41] <mutante>	 ok
[19:08:47] <jynus>	 so I prefer to try something
[19:08:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: only permit admin user to resize VMs [puppet] - 10https://gerrit.wikimedia.org/r/579638 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[19:08:53] <cdanis>	 mostly long-tail -- 95%ile from 500ms to several seconds
[19:08:57] <jynus>	 if there is imapact, at least to help debugging
[19:08:59] <cdanis>	 error rate looks okay
[19:09:02] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy.json: require projectadmin for delete/rebuild/reboot [puppet] - 10https://gerrit.wikimedia.org/r/579639 (owner: 10Andrew Bogott)
[19:09:05] <jynus>	 discard server variables
[19:09:10] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Emergency pool pc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580117
[19:09:13] <jynus>	 even if it is likely to make things worse
[19:09:16] <marostegui>	 jynus: please review ^
[19:09:18] <jynus>	 or equally bad
[19:09:35] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] db-eqiad.php: Emergency pool pc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580117 (owner: 10Marostegui)
[19:09:37] <jynus>	 Go
[19:09:46] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] db-eqiad.php: Emergency pool pc1010 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580117 (owner: 10Marostegui)
[19:09:49] <jynus>	 we can at least get information for debugging
[19:10:04] * addshore reads up
[19:10:07] <wikibugs>	 (03CR) 10Volans: "Those are still pooled AFAIC (except mw1221)" [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn)
[19:10:22] <marostegui>	 deploying
[19:10:32] <mutante>	 rlazarus: i guess you can call me IC if updating the doc counts
[19:10:40] <mutante>	 volans: yea, i was about to depool when this started 
[19:10:54] <jynus>	 can someone check potential related deploys (even if unlikely)
[19:11:12] <jynus>	 and someone check changes in traffic pattern for requests that could lead to extra parsing
[19:11:14] <mutante>	 Amir1: hows the "check on elastic jobs" looking?
[19:11:25] <logmsgbot>	 !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool pc1010 instead of pc1008 as pc1008 is overloaded (duration: 01m 06s)
[19:11:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:31] <jynus>	 another person to check memcache hit rates
[19:11:34] <marostegui>	 let's monitor pc1010 closely
[19:11:44] <jynus>	 we can prepare the revert
[19:11:56] <Amir1>	 mutante: nothing out of ordinary 
[19:11:58] <jynus>	 it is not as if I have high confidence on that fixing the issue
[19:12:04] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:12:25] <mutante>	 re: memcache hit rates: https://grafana.wikimedia.org/d/000000316/memcache?orgId=1
[19:12:30] <jynus>	 connections growing
[19:12:34] <jynus>	 but let's wait
[19:12:53] <volans>	 I don't see any deploy around 17:58
[19:13:01] <cdanis>	 appserver error rate increasing
[19:13:03] <mutante>	 memcached traffic looks stable?
[19:13:07] <jynus>	 thanks, those checks helps even if to discard
[19:13:14] <marostegui>	 so far pc1010 has the same amount as pc1007 and 1009
[19:13:22] <mutante>	 Amir1: thanks, ok
[19:13:23] <jynus>	 that woudl be better than pc1008
[19:13:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:13:40] <jynus>	 if at least stays available
[19:13:46] <cdanis>	 appserver tail and average latency is down a lot
[19:13:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10wiki_willy) a:03Jclark-ctr
[19:13:51] <marostegui>	 parsercache hit ratio sinking, as expected
[19:13:59] <jynus>	 sure but that was the downside
[19:14:08] <jynus>	 maybe there was 2 issues
[19:14:12] <jynus>	 a request mw one
[19:14:19] <jynus>	 and a server one taht made the issue worse
[19:14:23] <jynus>	 but too early to say
[19:14:36] <marostegui>	 yeah, but something definitely happen at 18:00
[19:14:40] <jynus>	 requests seem capped at 500
[19:14:42] <cdanis>	 mcrouter traffic is elevated
[19:14:43] <jynus>	 is taht normal?
[19:14:46] <volans>	 slightly before
[19:14:46] <cdanis>	 but slightly
[19:15:01] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10wiki_willy) @Jclark-ctr - this might be just a loose power cord, but if its an actually power supply that's bad, it looks like this machine is under warranty.  Thanks, Willy
[19:15:09] <jynus>	 should we diff pc1008 and pc1010?
[19:15:09] <marostegui>	 pc1010 seems stable at around 800 connections
[19:15:15] <marostegui>	 jynus: also hardware diff
[19:15:20] <cdanis>	 ok, appserver error rate is back to 0, average latency and 95%ile is still down
[19:15:21] <marostegui>	 like raid stripe and all that
[19:15:34] <jynus>	 so, marostegui so far the idea wasn't as crazy lol
[19:15:49] <jynus>	 i am voting for 2 issues
[19:15:59] <jynus>	 this will not be free, though
[19:16:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Horizon: update nova_policy to conform with the current API policy.json [puppet] - 10https://gerrit.wikimedia.org/r/579640 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[19:16:07] <jynus>	 increaded latency for days
[19:16:07] <volans>	 not sure if it was mentioned, mediawiki appservers workers saturation had a spike at the same time
[19:16:12] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET
[19:16:17] <cdanis>	 volans: good to know
[19:16:33] <jynus>	 volans: the issue is to diferenciate between cause and consequences
[19:16:40] <volans>	 yeah that seems effect
[19:16:41] <jynus>	 mutante: 2 causes seem to be happening
[19:16:55] <jynus>	 pc1008 likely to have performance issues (source unknown)
[19:17:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:17:04] <jynus>	 that was mitigated (so far)
[19:17:12] <jynus>	 but there is still a change in mw parsing
[19:17:17] <jynus>	 which is overloading all servers
[19:17:17] <mutante>	 jynus: ok, ack
[19:17:24] <Amir1>	 the errors I see are mostly commons
[19:17:25] <jynus>	 that is the most worrying one
[19:17:29] <Amir1>	 things like /w/index.php?title=Image:71_Winter_Hawk.jpg&action=render&uselang=en
[19:17:30] <marostegui>	 I hope we are not getting new keys or something, that means more disk usage
[19:17:41] <Amir1>	 https://logstash.wikimedia.org/goto/6268ff8273f61fb3a2ff0a9f33028c5d
[19:17:45] <addshore>	 what time did this start?
[19:17:46] <jynus>	 marostegui: that is a worry for tomorrows manuel and jaime
[19:17:54] <jynus>	 :-D
[19:17:54] <Amir1>	 I think it's a bot going crazy
[19:17:56] <mutante>	 addshore: 1800 UTC
[19:17:58] <marostegui>	 addshore: around 18:00 UTC
[19:17:59] <jynus>	 Amir1: could be
[19:18:02] <volans>	 17:58
[19:18:08] <jynus>	 I woudl search for requests parsing lots of pages
[19:18:14] <jynus>	 old revisions
[19:18:15] <mutante>	 addshore: with alerts at 18:43
[19:18:16] <cdanis>	 the only increase in appserver CPU I see is a recent one, at about 19:12
[19:18:18] <jynus>	 but highly parallel
[19:18:24] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:18:50] <jynus>	 can I get a status check from everyone, are we better than before in avaiulability and performance?
[19:18:56] <cdanis>	 jynus: yes
[19:19:01] <jynus>	 just to make sure we stay with pc1010
[19:19:09] <jynus>	 anyone else agrees?
[19:19:17] <cdanis>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET&var-code=200&fullscreen&panelId=10
[19:19:19] <cdanis>	 for example
[19:19:21] <jynus>	 I agree from db layer
[19:19:21] <marostegui>	 pc1010 is performing better
[19:19:31] <jynus>	 but would prefer someone from app later, like cdanis
[19:19:32] <marostegui>	 and stable at around 800 connections, which is still quite a lot
[19:19:41] <jynus>	 yeah, that is the main issue here
[19:19:45] <volans>	 +1
[19:19:50] <volans>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&fullscreen&panelId=92
[19:19:53] <volans>	 seems better too
[19:19:55] <jynus>	 but at least we are on a more sane state
[19:19:57] <marostegui>	 ok, let's stay at pc1010
[19:20:18] <jynus>	 when you don't know what to do, do something, as long as it is safe
[19:20:26] <jynus>	 that's my philosophy
[19:20:29] <wikibugs>	 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10leila) @elukey thanks for flagging this.  @bmansurov can you look into this and let me know what the best course of action is?
[19:20:30] <mutante>	 jynus: appserver errors are back to normal low
[19:20:30] <jynus>	 :-D
[19:20:39] <volans>	 responses still slow, but let's see if they recover
[19:20:39] <wikibugs>	 10Operations, 10Research: recommendation api's test on scb nodes are flapping - https://phabricator.wikimedia.org/T247732 (10leila) p:05Triage→03High
[19:20:41] <volans>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&fullscreen&panelId=88&from=now-3h&to=now
[19:20:43] <jynus>	 mutante: pc conenctions are not normal
[19:20:46] <cdanis>	 appserver CPU is slightly increased, and appserver network traffic is also increased, I think a direct result of the lower hit rate
[19:20:48] <jynus>	 they are still elevated
[19:21:14] <logmsgbot>	 !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@f5600d6]: Update mobileapps to 8a6e403
[19:21:17] <volans>	 but all the other graphs are returning to ttheir previous state
[19:21:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:21:19] <volans>	 so promising
[19:21:51] <marostegui>	 mutante: we have around 250 connections normally, and now we have between 700 and 1000
[19:22:02] <jynus>	 marostegui: I will stop replication on pc1010
[19:22:10] <marostegui>	 jynus: +1
[19:22:11] <jynus>	 forgot that
[19:22:34] <elukey>	 there is also an increase in all traffic metrics for memcached, but nothing really horrible
[19:22:43] <jynus>	 pc1007-bin.080617:259138670
[19:23:06] <jynus>	 !log stop replication at pc1010 at pos pc1007-bin.080617:259138670
[19:23:06] <marostegui>	 so the initial issue is mitigated, but we need MW follow up on what has generated this amount of connections
[19:23:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:23:18] <jynus>	 we need to replicate the codfw one from it now
[19:23:23] <jynus>	 but that can wait
[19:23:44] <marostegui>	 yeah, no rush on that
[19:24:05] <jynus>	 almost 18:00 to the second
[19:24:10] <Amir1>	 elukey: Can I see the graphs? It might be related to my work
[19:24:19] <jynus>	 a bit before actually
[19:24:26] <jynus>	 17:58 a first spike
[19:24:45] <elukey>	 Amir1: it is https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-3h&to=now but around 20:11, when manuel switched the pc shard.. 
[19:24:47] <jynus>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1008&var-port=9104
[19:24:57] <jynus>	 ^mutate for the doc, if you add the ranges
[19:24:57] <logmsgbot>	 !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
[19:25:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:25:04] <elukey>	 err 19:11 UTC
[19:25:28] <wikibugs>	 10Operations, 10LDAP-Access-Requests: Request for a ldap account and be added to nda ldap group for PHPCC - https://phabricator.wikimedia.org/T247731 (10RStallman-legalteam) @ darthmon_wmde - Happy to help with the NDAs. Could you provide me with a physical (mailing) address for thephp.cc? A physical address i...
[19:25:42] <marostegui>	 maybe we should check if there were some relevant cronjobs starting at around these times
[19:25:43] <marostegui>	 ?
[19:25:59] <jynus>	 sure
[19:26:05] <jynus>	 I was checking more graphs
[19:26:12] <jynus>	 the overload is happening also on the others
[19:26:22] <jynus>	 they are not healthy
[19:26:25] <jynus>	 high latency
[19:26:28] <logmsgbot>	 !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventstreams' for release 'canary' .
[19:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:26:40] <marostegui>	 mmmm
[19:26:46] <marostegui>	 pc1007 and pc1009 have fully recovered
[19:26:52] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=pc1009&var-port=9104&fullscreen&panelId=37
[19:26:53] <jynus>	 really=
[19:27:00] <marostegui>	 https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&fullscreen&panelId=37&var-dc=eqiad%20prometheus%2Fops&var-server=pc1007&var-port=9104
[19:27:00] <jynus>	 I am not 100% sure of that
[19:27:10] <marostegui>	 check those graphs
[19:27:14] <jynus>	 sure
[19:27:19] <jynus>	 but see latency of scapring
[19:27:23] <jynus>	 very unusual
[19:27:31] <marostegui>	 pc1010 still having 700 connections
[19:27:34] <marostegui>	 this is super weird :-/
[19:27:59] <jynus>	 normally it is 200
[19:28:02] <logmsgbot>	 !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@f5600d6]: Update mobileapps to 8a6e403 (duration: 06m 48s)
[19:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:28:09] <jynus>	 the others are in around 300
[19:28:20] <marostegui>	 yeah, but a massive drop to almost normal values
[19:28:21] <jynus>	 sometimes more
[19:28:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:28:25] <marostegui>	 from 800 to 300
[19:28:27] <jynus>	 what I mean is that
[19:28:33] <jynus>	 they are not as good as they used to be
[19:28:38] <jynus>	 they are just "stable"
[19:28:41] <jynus>	 but not normal
[19:29:01] <jynus>	 however, if they were worse because pc1008
[19:29:03] <addshore>	 I dont really see any increase in parser cache generation / misses
[19:29:03] <addshore>	 https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&var-contentModel=Campaign&var-contentModel=CollaborationHubContent&var-contentModel=CollaborationListContent&var-contentModel=JsonConfig_Dashiki&var-contentModel=JsonSchema&var-contentModel=JsonZeroConfig&var-contentModel=Json_JsonConfig&var-contentModel=Map_JsonConfig&var-contentModel=MassMessageListContent&var-contentModel=Scribunto&var-contentModel=SecurePoll&var-co
[19:29:03] <addshore>	 ntentModel=Tabular_JsonConfig&var-contentModel=css&var-contentModel=flow_board&var-contentModel=hit&var-contentModel=javascript&var-contentModel=json&var-contentModel=miss&var-contentModel=proofread_index&var-contentModel=proofread_page&var-contentModel=sanitized_css&var-contentModel=text&var-contentModel=wikibase_item&var-contentModel=wikibase_lexeme&var-contentModel=wikibase_property&var-contentModel=wikitext&var-contentModel=ya
[19:29:03] <addshore>	 ml&from=1584381707179&to=1584386903777
[19:29:06] <jynus>	 i cannot see why
[19:29:08] <addshore>	 urgf, bad lin
[19:29:09] <addshore>	 link
[19:29:18] <jynus>	 minimize
[19:29:22] <jynus>	 w.wiki
[19:29:26] <icinga-wm>	 PROBLEM - Logstash Elasticsearch indexing errors on icinga1001 is CRITICAL: 15.14 ge 8 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:29:42] <addshore>	 https://bit.ly/2WtZD5V
[19:29:43] <jynus>	 could elastic parse stuff, someone proposed?
[19:29:50] <marostegui>	 Amir1 mentioned it
[19:30:08] <Amir1>	 nothing out of ordinary in the jobs 
[19:30:39] <addshore>	 marostegui: jynus that elastic related one was / is 
[19:30:40] <addshore>	 https://phabricator.wikimedia.org/T239931
[19:30:43] <Amir1>	 https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1
[19:31:11] <jynus>	 coudl it be one of those edits
[19:31:17] <jynus>	 that parses half of wikipedias
[19:31:18] <volans>	 anything I can help with right now? 
[19:31:21] <jynus>	 on wikidata
[19:31:23] <Amir1>	 addshore: that sanitizer is disabled for wikidata, other wikis, I don't know if it would be big enough 
[19:31:24] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Bstorm)
[19:31:28] <jynus>	 or somewhere else (a template)
[19:31:30] <wikibugs>	 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labstore1004/labstore1005 to Stretch/Buster - https://phabricator.wikimedia.org/T224582 (10Bstorm) 05Open→03Stalled While we now have an improved failover experience with these systems, th...
[19:31:42] <Amir1>	 jynus: that's disabled for a really long time now
[19:31:46] <marostegui>	 jynus: but would have hit all the servers the same, or at least, the recovery, pc1010 still have double connections
[19:31:54] <icinga-wm>	 RECOVERY - Logstash Elasticsearch indexing errors on icinga1001 is OK: (C)8 ge (W)1 ge 0.175 https://wikitech.wikimedia.org/wiki/Logstash%23Indexing_errors https://logstash.wikimedia.org/goto/1cee1f1b5d4e6c5e06edb3353a2a4b83 https://grafana.wikimedia.org/dashboard/db/logstash
[19:32:42] <jynus>	 gwtoolsetUploadMediafileJob is that normal?
[19:32:49] <jynus>	 https://grafana.wikimedia.org/d/000000400/jobqueue-eventbus?orgId=1&fullscreen&panelId=11&from=1584365565964&to=1584387165964&var-site=eqiad&var-type=All
[19:32:58] <jynus>	 I am just looking at random graphs
[19:33:59] <cdanis>	 appserver latency is still up from before, but it's within what i think is a healthy range
[19:34:05] <jynus>	 mutante: if things are "normal" maybe we can stop here
[19:34:22] <jynus>	 and try to find a root cause tomorrow
[19:34:27] <marostegui>	 cdanis: yeah, we still see more connections than usual, specially on pc1010 (which replaced pc1008)
[19:34:37] <jynus>	 plus the extra latency
[19:34:39] <mutante>	 jynus: i added the grafana links with proper date range now
[19:34:42] <marostegui>	 I think we need 2 different tasks, one for pc1008 and another one for mw
[19:34:51] <mutante>	 jynus: it seems like it, yes
[19:34:53] <jynus>	 cdanis: marostegui: coming from invalidating 33% of our cache disk
[19:34:56] <wikibugs>	 (03PS3) 10Andrew Bogott: nova policy: convert from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/579647
[19:35:13] <marostegui>	 jynus: we've never seen more connections because of the invalidation
[19:35:21] <Amir1>	 I go eat for now, Reedy can you take a look?
[19:35:26] <jynus>	 maybe pc1008 had some issues that only showed up under high load
[19:35:28] <Amir1>	 He knows PC very well
[19:35:29] <volans>	 cdanis: can you explain this one?
[19:35:30] <volans>	 https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&from=now-3h&to=now&var-source=eqiad%20prometheus%2Fops&var-cluster=appserver&var-node=mw1238&fullscreen&panelId=88
[19:35:31] <marostegui>	 We will probably need CPT involved in the MW investigation 
[19:35:35] <jynus>	 lets create a ticket
[19:35:39] <volans>	 as all the other graphs in the same dashboard have recovered
[19:35:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={icinga,squid} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:35:52] <jynus>	 and ask sam tim or brad
[19:35:54] <marostegui>	 Then let's include Reedy as well on the task :-)
[19:36:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy: convert from json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/579647 (owner: 10Andrew Bogott)
[19:36:47] <jynus>	 should I create it or are you already?
[19:36:49] <cdanis>	 volans: reparses?
[19:36:55] <cdanis>	 not sure
[19:37:00] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:37:06] <marostegui>	 jynus: I am creating the MW task
[19:37:12] <mutante>	 2 tickets? one for CPT and one to check pc1008 for hardware issues?
[19:37:16] <marostegui>	 yep
[19:37:20] <mutante>	 doing the other one
[19:37:20] <jynus>	 well, I didn't want to separate them yet
[19:37:20] <volans>	 me neither has been almost 30m that the other graphs have recovered in the same dashboard
[19:37:21] <marostegui>	 I am creating the one for CPT
[19:37:33] <jynus>	 until we know what's the deal with with pc1008
[19:37:37] <jynus>	 but it is ok
[19:37:37] * volans dinner is coming up
[19:37:47] <volans>	 if you don't have anything immedate for me I'll be afk for a bit
[19:37:49] <jynus>	 I wanted to create just one with "overload"
[19:37:51] <marostegui>	 jynus: I prefer to separate them because we may spam stuff about HW, mysql confg etc
[19:38:00] <marostegui>	 We can always merge them later
[19:38:20] <cdanis>	 more mw errors, mostly for wikidata and commonswiki
[19:38:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:38:32] <jynus>	 oh, they should be separate, but wouldn't know what to say about pc1008 yet
[19:38:38] <jynus>	 as we depooled it blindly
[19:38:50] <cdanis>	 i think these are the same errors amir was talking about
[19:38:58] <mutante>	 T247787
[19:38:58] <wikibugs>	 10Operations: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Dzahn)
[19:38:59] <stashbot>	 T247787: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787
[19:39:17] <jynus>	 ok, a title works for now
[19:39:24] <jynus>	 lets add that to the doc
[19:39:31] <jynus>	 and I will add some tags
[19:40:13] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics: (Need by: TBD) rack/setup/install kafka-jumbo100[789].eqiad.wmnet - https://phabricator.wikimedia.org/T244506 (10wiki_willy) a:05Christopher→03Cmjohnson
[19:40:16] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo)
[19:40:28] <mutante>	 added under actionables. and one to follow-up with CPT.
[19:40:34] <jynus>	 yep
[19:40:34] <wikibugs>	 10Operations, 10ops-eqiad, 10serviceops: (Need by: TBD) rack/setup/install kubernetes10[07-14].eqiad.wmnet - https://phabricator.wikimedia.org/T241850 (10wiki_willy) a:05Christopher→03Cmjohnson
[19:40:50] <jynus>	 status now: pc1008 failed over to pc1010
[19:41:00] <jynus>	 33% less cache on disk
[19:41:16] <jynus>	 still connection/load issues in general, but to a level we can cope
[19:41:41] <jynus>	 ^if you want to copy and paste
[19:42:01] <mutante>	 added
[19:42:45] <jynus>	 hit rate went from 81% to 64%
[19:43:02] <jynus>	 which is impacting but not out of the ordinary
[19:43:20] <jynus>	 we have 3 servers precisely for that :-D
[19:44:31] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Cache: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui) p:05Triage→03High
[19:44:32] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[19:44:33] <marostegui>	 task created ^
[19:44:45] <jynus>	 thanks, will link them
[19:44:49] <mutante>	 ok, as you suggested i guess at this point we can stop and it's not an active incident. but the search for root cause continues tomorrow
[19:45:05] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Cache: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Marostegui)
[19:45:09] <marostegui>	 mutante: yeah, I believe so
[19:45:19] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Cache: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10jcrespo) pc1008 conincidental hw issues handled separatelly at T247787
[19:45:21] <mutante>	 alright, ACK
[19:46:02] <jynus>	 if everyone things mw state is ok until tomorrow (now that peak time will finish) we will reasearch more tomorrow
[19:46:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: rack/setup/install fran1001 - https://phabricator.wikimedia.org/T245554 (10Cmjohnson) 05Open→03Resolved
[19:47:03] <mutante>	 jynus: +1
[19:47:06] <marostegui>	 Let's continue tomorrow, it is quite late for EU time
[19:47:11] <jynus>	 yep
[19:47:15] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident, 10Wikimedia-production-error: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Reminder: pc1010 stopped replication, but pc2 on codfw needs to replicate from it.
[19:47:18] <marostegui>	 Thanks everyone who showed up, specially mutante for coordinating!
[19:47:19] <jynus>	 ^added that reminder
[19:47:29] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10Krinkle)
[19:47:36] <jynus>	 can you paste the url of incident somewhere?
[19:47:43] <jynus>	 we will put it on the wiki when we can
[19:47:46] <cdanis>	 https://docs.google.com/document/d/1GsyYu_ruw58SSIJrYTQncLJzoXvSoBQdWQDNrFED_iQ/edit#heading=h.vg6rb6x2eccy
[19:47:48] <jynus>	 thanks
[19:48:01] <jynus>	 we will put it public soon
[19:48:10] <jynus>	 (tomorrow)
[19:48:17] <jynus>	 on wikitech
[19:48:21] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10Krinkle)
[19:50:13] <marostegui>	 Ok, I am going offline
[19:50:14] <marostegui>	 Thanks everyone
[19:50:52] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:51:13] <mutante>	 thanks DBAs, good night
[19:51:54] <mutante>	 added "having pc1010 around to be able to pool" under "what went well"
[19:52:53] <wikibugs>	 10Operations, 10DBA, 10Wikimedia-Incident: investigate pc1008 for possible hardware issues / performance under high load - https://phabricator.wikimedia.org/T247787 (10jcrespo) Both pending documentation and more research, but it is mitigated by being depooled.
[19:52:57] <jynus>	 mutante: he he
[20:00:04] <jouncebot>	 halfak and accraze: May I have your attention please! Services – Graphoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2000)
[20:00:48] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:01:31] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122
[20:01:44] <mutante>	 well, it's true :)
[20:02:11] <wikibugs>	 (03PS1) 10Hashar: README.md: update docker-pkg command line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580124
[20:02:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 (owner: 10Andrew Bogott)
[20:03:30] <wikibugs>	 (03PS2) 10Andrew Bogott: nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 (https://phabricator.wikimedia.org/T247573)
[20:04:06] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=no; selector: name=mw122[1-6].eqiad.wmnet
[20:04:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:37] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] README.md: update docker-pkg command line [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580124 (owner: 10Hashar)
[20:04:38] <mutante>	 !log depool (yes->no) mw1221 - mw1226 (T247780)
[20:04:39] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy: use context_is_admin instead of admin_api [puppet] - 10https://gerrit.wikimedia.org/r/580122 (https://phabricator.wikimedia.org/T247573) (owner: 10Andrew Bogott)
[20:04:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:04:45] <stashbot>	 T247780: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780
[20:05:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:06:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:08:16] <mutante>	 hrmm
[20:08:40] <mutante>	 ok, that is really small though compared to earlier
[20:12:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:13:38] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy: restrict VM creation to project admins [puppet] - 10https://gerrit.wikimedia.org/r/580125
[20:18:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:21:46] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10Jclark-ctr) 05Open→03Resolved Reseated power cable
[20:22:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: mw1373 power supply redundancy ipmi alert - https://phabricator.wikimedia.org/T247755 (10wiki_willy) Thanks @Jclark-ctr
[20:23:26] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:28:19] <wikibugs>	 (03PS6) 10CRusnov: puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153)
[20:29:14] <icinga-wm>	 RECOVERY - IPMI Sensor Status on mw1373 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[20:32:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:33:28] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:34:48] <icinga-wm>	 PROBLEM - Host ms-be1030 is DOWN: PING CRITICAL - Packet loss = 100%
[20:35:28] <icinga-wm>	 RECOVERY - Host ms-be1030 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[20:35:58] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission
[20:36:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:36:35] <wikibugs>	 (03CR) 10CRusnov: [C: 03+2] puppetdb uservice: Add individual host queries, expand for interface automation [puppet] - 10https://gerrit.wikimedia.org/r/579758 (https://phabricator.wikimedia.org/T244153) (owner: 10CRusnov)
[20:37:15] <logmsgbot>	 !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0)
[20:37:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:37:21] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1221.eqiad.wmnet` -  mw1221.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga   - Found...
[20:37:40] <volans>	 mutante: if you're decomming hosts and yo have a bunch
[20:37:51] <volans>	 can you try multiple at the same time too please?
[20:38:26] <mutante>	 volans: yea, will do that. i just wanted to be careful with the first one and the right order of things
[20:38:36] <volans>	 sure, thanks a lot!
[20:38:40] <mutante>	 doing the other 5 at once then or so
[20:39:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:39:37] <mutante>	 volans: would you agree it's "pooled => 'no'" (but not 'inactive'),   run decom script,  remove from site.pp and conftool at once?
[20:39:56] <volans>	 why not inactive?
[20:40:24] <mutante>	 because when i did that before running the decom script, the icinga alerts started about "not in dsh group"
[20:40:34] <mutante>	 and the decom script does "no -> inactive" ?
[20:41:08] <mutante>	 or maybe i should separate "remove from site" and "remove from conftool"
[20:41:09] <volans>	 the decom cookbook doesn't touch conftool
[20:41:11] <volans>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/master/cookbooks/sre/hosts/decommission.py
[20:41:15] <mutante>	 ok
[20:41:24] <volans>	 list of actions at the top or running it with -h
[20:42:07] <logmsgbot>	 !log dzahn@cumin1001 conftool action : set/pooled=inactive; selector: name=mw122[1-6].eqiad.wmnet
[20:42:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:17] <logmsgbot>	 !log dzahn@cumin1001 START - Cookbook sre.hosts.decommission
[20:43:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:53] <mutante>	 volans: ATTENTION: destructive action for 5 hosts: mw[1222-1226].eqiad.wmnet
[20:44:02] <mutante>	 jouncebot: next
[20:44:02] <jouncebot>	 In 0 hour(s) and 15 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2100)
[20:44:15] <volans>	 mutante: yep :D
[20:44:18] <mutante>	 i gotta remove from conftool right after
[20:44:21] <mutante>	 before the deploy
[20:44:24] <volans>	 k
[20:44:30] <mutante>	 or they get errors on scap
[20:44:38] <volans>	 sure, once the decom has run they are gone
[20:44:50] <volans>	 you can remove from everything but mgmt just in case
[20:44:54] <mutante>	 **Failed to wipe bootloaders, manual intervention required to make it unbootable
[20:45:02] <mutante>	 on one of them
[20:45:07] <volans>	 was a broken one?
[20:45:11] <volans>	 could we ssh into it?
[20:45:34] <mutante>	 i did not test that, it was 1223
[20:45:39] <mutante>	 not the one already depooled
[20:45:41] <mutante>	 so it was pooled
[20:45:48] <mutante>	 the others did not have the issue
[20:45:56] <logmsgbot>	 !log dzahn@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1)
[20:46:03] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: decom old appservers in eqiad - https://phabricator.wikimedia.org/T247780 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw[1222-1226].eqiad.wmnet` -  mw1222.eqiad.wmnet (**PASS**)   - Downtimed host on Icinga...
[20:46:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:46:07] <mutante>	 but it should have been working or we would have noticed in icinga before
[20:46:35] <volans>	 let me see the logs
[20:46:37] <mutante>	 volans: well, it's exit_code=1 at the end but just because of that one step, all else looks good
[20:46:46] <volans>	 k
[20:47:25] <wikibugs>	 (03PS4) 10Dzahn: site/conftool: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780)
[20:47:58] <volans>	 teh wipe command returned 123 as status code
[20:48:18] <volans>	 we could try to power it up again and see if I can repro
[20:48:42] <mutante>	 yea, it was serving until recently https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=mw1223&var-datasource=eqiad%20prometheus%2Fops&var-cluster=api_appserver
[20:48:47] <volans>	 depending at which part of the command it failed, it might have partially worked making the repro harder
[20:48:48] <mutante>	 ok, let me see if it boots
[20:49:06] <mutante>	 well.. after the merge 
[20:49:16] <mutante>	 because of the deployment in 10 min
[20:49:36] <volans>	 sure no hurry
[20:50:25] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "depooled to inactive and ran decom cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/580101 (https://phabricator.wikimedia.org/T247780) (owner: 10Dzahn)
[20:52:42] <mutante>	 !log 5 old API appservers in eqiad removed
[20:52:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:54:44] <mutante>	 !log powercycling mw1223
[20:54:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:55:27] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:55:57] <mutante>	 hmmm.. kind of want a cumin alias for scap proxies to run puppet on them
[20:56:43] <mutante>	 volans: it boots into PXE / Debian installer 
[20:56:53] <mutante>	 did not merge the DHCP removal :p
[20:57:10] <volans>	 ok, so I guess no repro :D
[20:58:01] <mutante>	 !log mw1223 power down
[20:58:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:58:11] <volans>	 thanks for trying
[20:58:15] <mutante>	 yw
[20:58:37] <wikibugs>	 (03PS3) 10Dzahn: DHCP: remove mw1221 through mw1226 [puppet] - 10https://gerrit.wikimedia.org/r/580105 (https://phabricator.wikimedia.org/T247780)
[20:59:06] <rlazarus>	 mutante: C:profile::mediawiki::scap_proxy?
[20:59:19] <mutante>	 volans: overall cookbook works very well. thx
[20:59:30] <rlazarus>	 wait no, that's every mw host o_O
[20:59:31] <mutante>	 rlazarus: ah, thanks, that looks good
[20:59:41] <mutante>	 heh
[20:59:53] <rlazarus>	 oh, I see # Sets up an rsync proxy for scap, if the server is set up to be one
[20:59:58] <volans>	 mutante: thanks for testing it!
[21:00:04] <jouncebot>	 Reedy and sbassett: Your horoscope predicts another unfortunate Weekly Security deployment window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2100).
[21:00:15] <mutante>	 Reedy: are you deploying something?
[21:00:20] <Reedy>	 Nope
[21:00:29] <mutante>	 ok, nice
[21:00:46] <mutante>	 then i won't worry about forcing puppet
[21:01:11] <volans>	 mutante: 'R:ferm::service = rsyncd_scap_proxy'
[21:01:43] <volans>	 rlazarus: ^^^
[21:01:52] <rlazarus>	 nice, thanks
[21:02:09] <mutante>	 oh, via ferm, nice
[21:02:35] <volans>	 alternatively 'R:rsync::server::module%path = /srv/mediawiki'
[21:02:56] <mutante>	 tries it.. 9 hosts
[21:03:12] <volans>	 those seems to be the only two matching things looking at modules/profile/manifests/mediawiki/scap_proxy.pp
[21:03:17] <volans>	 I'd use the first one, seems more safe
[21:03:23] <volans>	 and simpler :D
[21:03:54] <mutante>	 using the ferm::service is a nice one that would work for other things as well
[21:04:06] <mutante>	 100.0% (9/9) success ratio  
[21:04:30] <mutante>	 ok, i hope 6 servers already makes a difference for power usage
[21:04:37] <mutante>	 doing more later
[21:04:43] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:04:54] <mutante>	 maybe more like tomorrow
[21:05:44] <wikibugs>	 (03PS1) 10Brian Wolff: Add prod domains to beta CSP policy to allow easier gadget testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580127
[21:12:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:13:55] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/580127 (owner: 10Brian Wolff)
[21:21:34] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:22:08] <wikibugs>	 (03PS1) 10Hashar: Add an image for python2 app based on Buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458)
[21:23:33] <wikibugs>	 (03CR) 10Hashar: "I have build it locally with:" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar)
[21:24:10] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:24:27] <wikibugs>	 (03CR) 10Hashar: "Here is a python2-build-buster image, that is to be able to craft the wheels for Zuul which is a python 2 app. We are too short to attempt" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar)
[21:24:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:25:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:27:49] <wikibugs>	 (03PS2) 10Hashar: Add an image for python2 app based on Buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458)
[21:28:42] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:29:48] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy: restrict VM creation to project admins [puppet] - 10https://gerrit.wikimedia.org/r/580125 (owner: 10Andrew Bogott)
[21:33:00] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:34:00] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:34:47] <wikibugs>	 (03PS1) 10Andrew Bogott: nova policy: restrict VM update to projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/580129
[21:35:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] nova policy: restrict VM update to projectadmin [puppet] - 10https://gerrit.wikimedia.org/r/580129 (owner: 10Andrew Bogott)
[21:37:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:44:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:51:20] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_mobileapps_cluster_codfw} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:52:44] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:53:33] <wikibugs>	 (03PS3) 10Hashar: Add an image for python2 app based on Buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/580128 (https://phabricator.wikimedia.org/T215458)
[21:53:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:54:54] <wikibugs>	 (03CR) 10BryanDavis: "> LGTM. The benefit of using this vs using kubectl directly is that" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/578412 (owner: 10BryanDavis)
[21:57:46] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[21:58:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:06:16] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:07:06] <wikibugs>	 (03PS1) 10Cmjohnson: Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/580132 (https://phabricator.wikimedia.org/T244506)
[22:07:48] <wikibugs>	 (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/579587 (https://phabricator.wikimedia.org/T215458) (owner: 10Hashar)
[22:09:40] <wikibugs>	 (03PS2) 10Cmjohnson: Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/580132 (https://phabricator.wikimedia.org/T244506)
[22:11:56] <wikibugs>	 (03CR) 10Cmjohnson: [C: 03+2] Add kafka-jumbo100[7-9] to netboot.cfg and dhcpd file [puppet] - 10https://gerrit.wikimedia.org/r/580132 (https://phabricator.wikimedia.org/T244506) (owner: 10Cmjohnson)
[22:12:46] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:13:24] <wikibugs>	 (03PS1) 10Bstorm: toolforge-k8s: remove old legacy-cluster code from bastion [puppet] - 10https://gerrit.wikimedia.org/r/580134 (https://phabricator.wikimedia.org/T246689)
[22:13:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:17:27] <wikibugs>	 (03CR) 10Alex Monk: [C: 03+1] toolforge-k8s: remove old legacy-cluster code from bastion [puppet] - 10https://gerrit.wikimedia.org/r/580134 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm)
[22:18:47] <brennen>	 i feel like i'm seeing a pattern of errors that echoes thursday's T247553
[22:19:23] <brennen>	 at any rate, something seems to be up
[22:21:18] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:22:48] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[22:25:24] <wikibugs>	 (03CR) 10BryanDavis: "I think based on T172693 that we don't actually have these tables passing through from the sanitarium servers. I have an open task to actu" [puppet] - 10https://gerrit.wikimedia.org/r/579800 (owner: 10Alex Monk)
[22:28:50] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:32:56] <Amir1>	 This issue might be related to what we saw today: https://phabricator.wikimedia.org/T247562 marostegui 
[22:34:48] <brennen>	 Amir1, marostegui: any thoughts on that one are welcome.  train still blocked and i'm at a bit of a loss whose attention to bring it to or how to further investigate.
[22:35:58] <Amir1>	 brennen: sure, I'm at end of my day though 
[22:36:05] <Amir1>	 I will take a look ASAP
[22:36:22] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:38:48] <brennen>	 Amir1: ta - i'm also nearing EOD and i am unlikely to roll forward at this point, but it would be good if it were resolved before hashar is supposed to be cutting new branch tomorrow
[22:39:04] <hashar>	 still blocked ? :(
[22:39:14] <brennen>	 yes. :(
[22:39:36] <hashar>	 do we have any idea which code causes the issue?
[22:40:10] <hashar>	 if it is triggered by a specific api call for example, we might be able to identify the culrpirt
[22:41:05] <thcipriani>	 I went back and looked at a few of the endpoints that triggered the problem, afaict it wasn't limited to the api
[22:42:25] <wikibugs>	 (03PS1) 10RLazarus: Add a default User-Agent. [software/httpbb] - 10https://gerrit.wikimedia.org/r/580135
[22:43:03] <hashar>	 I am not savvy enough in mw anymore unfortunately :(
[22:43:03] <thcipriani>	 given that, I'd guess it's something more systemic; i.e., something adding pressure to the system that the system can't handle or something deep down in terms of layers of abstraction
[22:43:54] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={squid,swagger_check_cxserver_cluster_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:45:32] <brennen>	 https://phabricator.wikimedia.org/T247562#5974273
[22:49:06] <hashar>	 nice
[22:49:36] <wikibugs>	 (03CR) 10Ppchelko: "> Patch Set 4:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/575108 (https://phabricator.wikimedia.org/T220399) (owner: 10Holger Knust)
[22:50:21] <Platonides>	 that feels bad, hashar 
[22:50:35] <hashar>	 so something too large is stored in the cache
[22:50:40] <hashar>	 and somehow overloads memcached?
[22:51:18] <wikibugs>	 10Operations, 10Core Platform Team, 10MediaWiki-Cache, 10Wikimedia-Incident: Parsercache sudden increase of connections - https://phabricator.wikimedia.org/T247788 (10brennen) May be related to T247562.
[22:51:20] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:51:38] <wikibugs>	 (03PS1) 10Andrew Bogott: Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795)
[22:51:40] <wikibugs>	 (03PS1) 10Andrew Bogott: glance: move python.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795)
[22:51:42] <wikibugs>	 (03PS1) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795)
[22:51:52] <brennen>	 parsercache would be... consistent with large values?
[22:52:26] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott)
[22:53:17] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795) (owner: 10Andrew Bogott)
[22:55:28] <wikibugs>	 (03PS2) 10Andrew Bogott: Keystone: convert policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580137 (https://phabricator.wikimedia.org/T247795)
[22:55:30] <wikibugs>	 (03PS2) 10Andrew Bogott: glance: move python.json files to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580138 (https://phabricator.wikimedia.org/T247795)
[22:55:33] <wikibugs>	 (03PS2) 10Andrew Bogott: designate: move policy.json to yaml [puppet] - 10https://gerrit.wikimedia.org/r/580139 (https://phabricator.wikimedia.org/T247795)
[22:57:47] <hashar>	 brennen: thcipriani we would need someone familiar with memcached  maybe
[22:57:52] <hashar>	 there are bunch of dashboards on grafana
[22:58:03] <hashar>	 https://grafana.wikimedia.org/d/000000317/memcache-slabs?orgId=1
[22:58:37] <hashar>	 that shows spikes of evictions for example
[22:58:48] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Evening SWAT(Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200316T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:01:25] <hashar>	 brennen: thcipriani can you let me know on the deployment task what I should do tomorrow  
[23:01:35] <hashar>	 eg wether we freeze/pause or wether I still cut the branch :]
[23:01:46] <hashar>	 I guesSI can cut / deploy to group0
[23:01:59] <hashar>	 but 3 versions might be a nightmare
[23:02:55] <brennen>	 hashar: yeah, i feel like 3 versions is a bad idea.
[23:03:01] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: remove old legacy-cluster code from bastion [puppet] - 10https://gerrit.wikimedia.org/r/580134 (https://phabricator.wikimedia.org/T246689) (owner: 10Bstorm)
[23:03:09] <brennen>	 will leave notes on the task.
[23:04:30] <thcipriani>	 same
[23:05:23] <wikibugs>	 10Operations: Enable SSO for Kibana - https://phabricator.wikimedia.org/T246998 (10colewhite) That CSP works well.  I think cas needs to respond with an appropriate Access-Control-Allow-Origin.  https://apereo.github.io/cas/5.2.x/installation/Configuration-Properties.html#http-web-requests  Observation from test...
[23:13:44] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:14:27] <tzatziki>	 !log reset email for "MNadrofsky (WMF)" on SUL and officewiki
[23:14:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:39] <wikibugs>	 (03PS3) 10Alex Monk: Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800
[23:18:44] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:28:46] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:31:14] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "Looks sae to me" [software/httpbb] - 10https://gerrit.wikimedia.org/r/580135 (owner: 10RLazarus)
[23:32:41] <Amir1>	 marostegui: btw. db1126 has lots of CPU usage, I doubt it's my cache warming up (it's not showing itself anywhere else), is it something known?
[23:32:48] <wikibugs>	 (03PS4) 10Alex Monk: Add public replica view for oauth_registered_consumer [puppet] - 10https://gerrit.wikimedia.org/r/579800 (https://phabricator.wikimedia.org/T247800)
[23:33:46] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:41:10] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:48:34] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=squid site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:52:24] <icinga-wm>	 PROBLEM - snapshot of s4 in codfw on db1115 is CRITICAL: snapshot for s4 at codfw taken more than 3 days ago: Most recent backup 2020-03-13 23:42:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[23:56:02] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets