[00:00:23] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:33:13] Hi everyone, [00:33:39] what are the differences of /srv/mediawiki /srv/mediawiki/w /srv/mediawiki/php-... [00:34:25] can we say as /srv/mediawiki is just a regular directory that contains different versions of mediawiki? [00:59:09] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [01:00:55] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:42:09] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 173838944 and 15 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:46:33] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 23597424 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [05:52:51] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 930272 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:00:53] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 78768 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:02:55] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [06:04:43] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:01:30] (03PS7) 10Jcrespo: Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) (owner: 10Dzahn) [09:03:49] (03CR) 10Jcrespo: [C: 03+2] Revert "Remove access for bmansurov" [puppet] - 10https://gerrit.wikimedia.org/r/559651 (https://phabricator.wikimedia.org/T241089) (owner: 10Dzahn) [09:06:33] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:06:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Restore access for bmansurov - https://phabricator.wikimedia.org/T241089 (10jcrespo) Access has been restored. @bmansurov please wait 30 minutes after this comment for the puppet change to propagate to all servers and then test your access. If it works... [09:08:19] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:21:02] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10jcrespo) Also note in the past I tried to granularize more, for example I created a group for prometheus editing, as several users only required that and to follow least privileges good practices, but... [10:06:59] (03PS4) 10Jcrespo: swift: Fix icinga+prometheus+grafana alert link (Dashboard not found) [puppet] - 10https://gerrit.wikimedia.org/r/560538 [10:07:01] (03PS1) 10Jcrespo: admin: Add production access to Aroraakhil, including private data [puppet] - 10https://gerrit.wikimedia.org/r/560604 (https://phabricator.wikimedia.org/T241096) [10:07:03] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:07:35] (03PS2) 10Jcrespo: admin: Add production access to Aroraakhil, including private data [puppet] - 10https://gerrit.wikimedia.org/r/560604 (https://phabricator.wikimedia.org/T241096) [10:08:49] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:21:04] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users and researchers for Aroraakhil - https://phabricator.wikimedia.org/T241096 (10jcrespo) @Leila researchers typically have time-limited MOUs, is this true in this case? If so, could you share... [10:22:19] (03CR) 10Jcrespo: [C: 04-1] "Probably time limited, wait for answer at: https://phabricator.wikimedia.org/T241096#5763697" [puppet] - 10https://gerrit.wikimedia.org/r/560604 (https://phabricator.wikimedia.org/T241096) (owner: 10Jcrespo) [10:25:32] (03PS6) 10Subscriptshoe9: Upload HD Logo for 9 Wikibooks Projects and 1 Wikipeida Project: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) [10:31:17] (03PS6) 10Subscriptshoe9: Add HD logos to IS.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) [10:59:15] 10Operations, 10netops: fastnetmon misreports attack type and protocol - https://phabricator.wikimedia.org/T241374 (10jcrespo) p:05Triage→03Normal [11:00:47] 10Operations, 10Traffic: Add more detailed instructions to the "sec-advice" page - https://phabricator.wikimedia.org/T241309 (10jcrespo) See also T240794, which if agreed could be done at the same time. [11:09:00] 10Operations, 10Wikimedia-Mailing-lists: Migrate archives of the OKFN-hosted Open-GLAM mailing list to Wikimedia's mailman - https://phabricator.wikimedia.org/T240929 (10jcrespo) p:05Triage→03Low a:03herron I am personally not familiar with mailman format. Maybe @Herron, our mail expert, knows how to pro... [11:10:54] 10Operations: Track services without a native systemd unit - https://phabricator.wikimedia.org/T240843 (10jcrespo) How high priority would you say this has, to remove it from triage inbox? [11:11:58] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10jcrespo) p:05Triage→03High This seems high importance, feel free to tune down if necessary. [11:14:49] 10Operations, 10netops: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10jcrespo) @ayounsi how prioritary would you say this ticket is worth? Spam is annoying but shouldn't have high- however the disk saturation could be dangerous (I don't... [11:16:23] 10Operations, 10DNS, 10Traffic: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10jcrespo) [11:16:49] 10Operations, 10DNS, 10Traffic: redirect non-existing wikimania2020.wikimedia.org to wikimania.wikimedia.org - https://phabricator.wikimedia.org/T240341 (10jcrespo) 05Open→03Stalled Stalled based on comments, waiting for T202684#5735025 response. [11:20:27] 10Operations, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10Scap, 10serviceops: On beta, scap can't clear opcache on some mw servers - https://phabricator.wikimedia.org/T237033 (10jcrespo) @hashar based on Dzhan's comment, is that something your team could handle, sending a puppet patch fo... [11:36:39] (03PS1) 10MarcoAurelio: deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/560606 [11:37:18] (03CR) 10jerkins-bot: [V: 04-1] deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/560606 (owner: 10MarcoAurelio) [11:38:19] (03Abandoned) 10MarcoAurelio: deployment-mediawiki-parsoid10: Switch labmon1001 to cloudmetrics1002 [puppet] - 10https://gerrit.wikimedia.org/r/560606 (owner: 10MarcoAurelio) [11:47:26] jynus: Hi (&& feliz Navidad). Got a question re Horizon for beta-cluster. Not sure if you'd be able to help? [11:47:50] that sounds like a question for cloud [11:48:06] I am very unlikely to be able to help [11:48:17] got it, I'll ask on -cloud [11:48:35] it's for a gerrit repo andrewbogott set up [11:48:51] you may have to wait, not sure if anybody will be up yet during these dates [11:49:03] but worth asking there indeed [11:49:19] (up at cloud team, I mean) [11:49:27] looking at https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+log/master it looks like a bot does port the changes [12:56:01] (03CR) 10Aklapper: "So who is not afraid (and has the permissions) to say +2 on this one-liner?" [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [12:57:54] (03CR) 10Jcrespo: "I can deploy, but only if a Phabricator is around to handle potential fallout." [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [12:58:15] (03CR) 10Jcrespo: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/542787 (https://phabricator.wikimedia.org/T127640) (owner: 10Aklapper) [14:42:48] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560577 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [14:44:48] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/560580 (https://phabricator.wikimedia.org/T150618) (owner: 10Subscriptshoe9) [15:13:25] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=atlas_exporter site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:15:11] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:48:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:50:33] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:16:53] 10Operations, 10SRE-Access-Requests: Restore access for bmansurov - https://phabricator.wikimedia.org/T241089 (10bmansurov) 05Open→03Resolved Thanks, @jcrespo. I got my access back. [18:41:39] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:47:01] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [19:41:53] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [20:41:22] 10Operations, 10MediaWiki-Authentication-and-authorization, 10Security-Team, 10Traffic, 10Security: Investigate usefulness of SameSite cookies for logged-in accounts - https://phabricator.wikimedia.org/T158604 (10sbassett) [20:55:49] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:55:55] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:57:37] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [20:57:43] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:00:23] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:02:11] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:02:11] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:02:59] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:02:59] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:03:59] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:04:19] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:06:33] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:06:35] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:07:53] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:07] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:07] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:10:15] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:13:13] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:13:41] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:13:41] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:14:39] PROBLEM - recommendation_api endpoints health on scb1001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:15:01] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:15:29] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:15:29] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:15:37] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:17:15] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:18:13] RECOVERY - recommendation_api endpoints health on scb1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:19:03] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:23] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:51] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:20:57] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:22:11] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:22:39] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:26:13] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:28:01] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:28:01] PROBLEM - recommendation_api endpoints health on scb2001 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:29:19] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:29:47] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:31:37] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:31:37] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:31:37] RECOVERY - recommendation_api endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:33:25] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:33:25] PROBLEM - recommendation_api endpoints health on scb2002 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:33:33] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:34:43] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:38:49] PROBLEM - recommendation_api endpoints health on scb2005 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:38:49] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:40:35] RECOVERY - recommendation_api endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:42:25] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:44:13] RECOVERY - recommendation_api endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:49:05] PROBLEM - recommendation_api endpoints health on scb1003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:49:35] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:49:41] PROBLEM - recommendation_api endpoints health on scb2006 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:54:29] RECOVERY - recommendation_api endpoints health on scb1003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:55:05] RECOVERY - recommendation_api endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [21:58:33] PROBLEM - recommendation_api endpoints health on scb2003 is CRITICAL: /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 500 (expecting: 404) https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:00:21] RECOVERY - recommendation_api endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:02:07] RECOVERY - recommendation_api endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [22:42:15] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 202239096 and 13 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:45:13] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 21747360 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:45:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [22:50:37] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 31811408 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:54:11] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 111280 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:54:11] RECOVERY - Postgres Replication Lag on maps1002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 118400 and 28 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:54:47] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 75408 and 65 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [23:31:23] 10Operations: Audit the WMF LDAP group and limit its permissions - https://phabricator.wikimedia.org/T240870 (10Peachey88) Do we need a over-all wmf group at all? Would a group per service be better for a granularized access point of view and annual access auditing?