[00:02:07] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:06:49] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[00:22:07] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 52.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:28:25] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 71.84 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[00:51:53] <wikibugs>	 (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane)
[01:09:08] <DannyS712>	 I'm having trouble loading gerrit / wmf projects / phab / wmcloud tools, but the rest of my internet is fine. Is something going on? chaomodus
[01:16:29] <cdanis>	 DannyS712: https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue
[01:16:35] <cdanis>	 also, clinic duty is a business hours thing :)
[01:17:12] <DannyS712>	 ah, sorry. Everyone is in a different timezone, so I wasn't sure. Anyway, it works for me now
[01:17:47] <cdanis>	 ok :)
[02:36:13] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:37:47] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:49:05] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[02:58:45] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:03:33] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:09:59] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:14:47] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:17:59] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:22:47] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:25:59] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:32:23] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[03:34:16] <wikibugs>	 10Operations, 10Wikimedia-General-or-Unknown: Special:ActiveUsers misses some active users on some(?) wikis - https://phabricator.wikimedia.org/T263931 (10Base)
[03:53:11] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:07:37] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:10:25] <wikibugs>	 10Operations, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10Tgr)
[04:22:05] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:28:29] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:30:05] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:34:53] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[04:57:19] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:02:09] <icinga-wm>	 PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:44:02] <wikibugs>	 (03CR) 10Ryan Kemper: "Note I removed the 4s timeout since the corresponding ticket that prompted it to be included is no longer open. Thus the PCC changes showi" [puppet] - 10https://gerrit.wikimedia.org/r/629829 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper)
[06:36:37] <elukey>	 !log powercycle analytics1048
[06:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:39:43] <icinga-wm>	 PROBLEM - Host analytics1048 is DOWN: PING CRITICAL - Packet loss = 100%
[06:39:49] <icinga-wm>	 RECOVERY - Host analytics1048 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms
[06:39:59] <icinga-wm>	 RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[06:51:26] <elukey>	 (all good, hadoop worker up again)
[07:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200927T0700)
[08:12:41] <icinga-wm>	 PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX
[08:14:17] <icinga-wm>	 RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[11:15:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:16:45] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:42:25] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:43:59] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:56:53] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[11:58:29] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:13:04] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10hashar)
[13:31:17] <wikibugs>	 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Esanders) I've update the list of maintainers: https://www.mediawiki.org/wiki/Special:Diff/4132328
[15:32:19] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid
[15:33:46] <wikibugs>	 (03PS3) 10Effie Mouzeli: WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340)
[15:33:59] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[16:06:59] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - apifeatureusage-2020.06.30[0](2020-09-24T09:12:16.665Z), apifeatureusage-2020.07.12[0](2020-09-24T09:12:16.665Z), apifeatureusage-2020.07.18[0](2020-09-24T09:12:16.664Z), apifeatureusage-2020.07.04[0](2020-09-24T09:12:16.664Z), apifeatureusage-2020.07.19[0](2020-09-24T09:12:16.664Z), apifeatureusage-2020.07.01[0](2020-09-24T09:
[16:06:59] <icinga-wm>	 eatureusage-2020.08.03[0](2020-09-24T09:12:16.665Z), apifeatureusage-2020.07.26[0](2020-09-24T09:12:16.665Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:17:57] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_general_1587198756[0](2020-09-24T14:34:37.564Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[16:41:35] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[16:43:11] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[17:59:42] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10SD0001) I have a bot that queries ORES a lot (I believe most of its requests are probably being fulfilled from the cac...
[19:55:15] <icinga-wm>	 PROBLEM - SSH on ores1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:57:03] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[19:58:26] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki)
[19:58:54] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki)
[19:58:59] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki)
[20:00:15] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:03:50] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) I've `sudo service uwsgi-ores restart` on all ORES200x boxes again. This fixes the problem temporarily (~24 ho...
[20:08:23] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:09:41] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32365787}  https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-7d&to=now-1m
[20:10:01] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:10:49] <icinga-wm>	 PROBLEM - Long running screen/tmux on mwdebug1001 is CRITICAL: CRIT: Long running SCREEN process. (user: jiji PID: 8843, 1732902s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens
[20:29:47] <wikibugs>	 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) Any ideas on causes and solutions @Halfak?
[20:43:47] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:45:25] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[20:55:43] <icinga-wm>	 RECOVERY - SSH on ores1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[21:40:11] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[21:43:23] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:12:29] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:14:07] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:54:19] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[22:57:33] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:21:01] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 8405 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:21:37] <icinga-wm>	 PROBLEM - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter
[23:24:13] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 1.004e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:29:03] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 16 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:46:43] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 6870 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:47:19] <icinga-wm>	 RECOVERY - mcrouter process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter
[23:48:19] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 9 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:50:43] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[23:52:21] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets