[00:02:07] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:06:49] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:22:07] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is CRITICAL: 52.78 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:28:25] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on alert1001 is OK: (C)60 le (W)70 le 71.84 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [00:51:53] (03CR) 10Pppery: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/629738 (https://phabricator.wikimedia.org/T262218) (owner: 10HitomiAkane) [01:09:08] I'm having trouble loading gerrit / wmf projects / phab / wmcloud tools, but the rest of my internet is fine. Is something going on? chaomodus [01:16:29] DannyS712: https://wikitech-static.wikimedia.org/wiki/Reporting_a_connectivity_issue [01:16:35] also, clinic duty is a business hours thing :) [01:17:12] ah, sorry. Everyone is in a different timezone, so I wasn't sure. Anyway, it works for me now [01:17:47] ok :) [02:36:13] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:37:47] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:49:05] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [02:58:45] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:03:33] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:09:59] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:14:47] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:17:59] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:22:47] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:25:59] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:32:23] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [03:34:16] 10Operations, 10Wikimedia-General-or-Unknown: Special:ActiveUsers misses some active users on some(?) wikis - https://phabricator.wikimedia.org/T263931 (10Base) [03:53:11] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:07:37] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:10:25] 10Operations, 10Packaging: Update php-xdebug to 2.7.2 in apt.wikimedia.org - https://phabricator.wikimedia.org/T263933 (10Tgr) [04:22:05] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:28:29] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:30:05] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:34:53] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [04:57:19] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:02:09] PROBLEM - SSH on analytics1048 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:44:02] (03CR) 10Ryan Kemper: "Note I removed the 4s timeout since the corresponding ticket that prompted it to be included is no longer open. Thus the PCC changes showi" [puppet] - 10https://gerrit.wikimedia.org/r/629829 (https://phabricator.wikimedia.org/T263073) (owner: 10Ryan Kemper) [06:36:37] !log powercycle analytics1048 [06:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:43] PROBLEM - Host analytics1048 is DOWN: PING CRITICAL - Packet loss = 100% [06:39:49] RECOVERY - Host analytics1048 is UP: PING OK - Packet loss = 0%, RTA = 0.66 ms [06:39:59] RECOVERY - SSH on analytics1048 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:51:26] (all good, hadoop worker up again) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200927T0700) [08:12:41] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) is CRITICAL: Test Suggest source sections to translate returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/CX [08:14:17] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [11:15:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_cxserver_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:16:45] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:42:25] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:43:59] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:56:53] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:58:29] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:13:04] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10hashar) [13:31:17] 10Operations, 10Editing-team, 10MassMessage, 10WMF-JobQueue, 10Platform Team Workboards (Clinic Duty Team): Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Esanders) I've update the list of maintainers: https://www.mediawiki.org/wiki/Special:Diff/4132328 [15:32:19] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /_info (retrieve service info) is CRITICAL: Test retrieve service info returned the unexpected status 503 (expecting: 200): /api (Zotero and citoid alive) is CRITICAL: Test Zotero and citoid alive returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Citoid [15:33:46] (03PS3) 10Effie Mouzeli: WIP mcrouter: install ohhost memcached on MediaWiki servers [puppet] - 10https://gerrit.wikimedia.org/r/629830 (https://phabricator.wikimedia.org/T244340) [15:33:59] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:06:59] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - apifeatureusage-2020.06.30[0](2020-09-24T09:12:16.665Z), apifeatureusage-2020.07.12[0](2020-09-24T09:12:16.665Z), apifeatureusage-2020.07.18[0](2020-09-24T09:12:16.664Z), apifeatureusage-2020.07.04[0](2020-09-24T09:12:16.664Z), apifeatureusage-2020.07.19[0](2020-09-24T09:12:16.664Z), apifeatureusage-2020.07.01[0](2020-09-24T09: [16:06:59] eatureusage-2020.08.03[0](2020-09-24T09:12:16.665Z), apifeatureusage-2020.07.26[0](2020-09-24T09:12:16.665Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:17:57] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - enwiki_general_1587198756[0](2020-09-24T14:34:37.564Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [16:41:35] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:43:11] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:59:42] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10SD0001) I have a bot that queries ORES a lot (I believe most of its requests are probably being fulfilled from the cac... [19:55:15] PROBLEM - SSH on ores1007.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:57:03] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:58:26] 10Operations, 10serviceops, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) [19:58:54] 10Operations, 10serviceops, 10User-jijiki: Test onhost memcached performance and functionality - https://phabricator.wikimedia.org/T263958 (10jijiki) [19:58:59] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [20:00:15] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:03:50] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) I've `sudo service uwsgi-ores restart` on all ORES200x boxes again. This fixes the problem temporarily (~24 ho... [20:08:23] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:09:41] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) {F32365787} https://grafana.wikimedia.org/d/HIRrxQ6mk/ores?viewPanel=13&orgId=1&refresh=1m&from=now-7d&to=now-1m [20:10:01] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:10:49] PROBLEM - Long running screen/tmux on mwdebug1001 is CRITICAL: CRIT: Long running SCREEN process. (user: jiji PID: 8843, 1732902s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [20:29:47] 10Operations, 10Machine Learning Platform, 10ORES, 10serviceops, 10Wikimedia-production-error: ORES redis: max number of clients reached... - https://phabricator.wikimedia.org/T263910 (10calbon) Any ideas on causes and solutions @Halfak? [20:43:47] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:45:25] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:55:43] RECOVERY - SSH on ores1007.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:40:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:43:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:12:29] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:14:07] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:54:19] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:57:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:21:01] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 8405 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:21:37] PROBLEM - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [23:24:13] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 1.004e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:29:03] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 16 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:46:43] PROBLEM - MediaWiki memcached error rate on alert1001 is CRITICAL: 6870 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:47:19] RECOVERY - mcrouter process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter https://wikitech.wikimedia.org/wiki/Mcrouter [23:48:19] RECOVERY - MediaWiki memcached error rate on alert1001 is OK: (C)5000 gt (W)1000 gt 9 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:50:43] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:52:21] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets