[00:03:48] PROBLEM - Check systemd state on netflow3001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:54] PROBLEM - Check systemd state on netflow4001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:03:58] PROBLEM - Check systemd state on netflow2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:50] PROBLEM - Check systemd state on netflow5001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:22] PROBLEM - Check systemd state on netflow1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:57:32] 10Operations, 10Security-Team, 10Wikimedia-Mailing-lists: Have a conversation about migrating from GNU Mailman 2.1 to GNU Mailman 3.0 - https://phabricator.wikimedia.org/T52864 (10Ladsgroup) >>! In T52864#6242301, @Tgr wrote: >>>! In T52864#6242222, @Ladsgroup wrote: >> The only thing is that with disabling... [01:03:54] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/606464 (owner: 10Ladsgroup) [02:30:24] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Peachey88) [02:30:52] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T255928 (10Peachey88) [03:00:04] RECOVERY - Check systemd state on an-launcher1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:18:22] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 148.5 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [03:49:30] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:15:04] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:48] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 69.15 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200621T0700) [08:57:46] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:59:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:27:24] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 138.3 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [11:55:52] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 75.25 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [12:14:52] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:16:40] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:18:15] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/suggest/sections/{title}/{from}/{to} (Suggest source sections to translate) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:19:54] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [15:41:44] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1088 - https://phabricator.wikimedia.org/T255928 (10Zoranzoki21) [15:41:47] 10Operations, 10DBA: db1088 crashed - https://phabricator.wikimedia.org/T255927 (10Zoranzoki21) [16:07:08] PROBLEM - Maps tiles generation on icinga1001 is CRITICAL: CRITICAL: 100.00% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [16:36:27] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10anirudhsbh) I have reached out to hpnadig and he asked me to follow up with the WMF technical team to have the password reset. The other two admins have not responded. [16:36:40] (03PS1) 10Ladsgroup: meet: Add /etc/meet-auth to store the configs and data [puppet] - 10https://gerrit.wikimedia.org/r/606824 [16:37:28] (03Abandoned) 10Ladsgroup: meet: Change owner of account manager code to www-data [puppet] - 10https://gerrit.wikimedia.org/r/606464 (owner: 10Ladsgroup) [16:37:48] (03CR) 10jerkins-bot: [V: 04-1] meet: Add /etc/meet-auth to store the configs and data [puppet] - 10https://gerrit.wikimedia.org/r/606824 (owner: 10Ladsgroup) [16:38:26] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10anirudhsbh) Hello, I have an update... one of the old passwords started working. Sorry about the confusion! [16:38:55] (03PS2) 10Ladsgroup: meet: Add /etc/meet-auth to store the configs and data [puppet] - 10https://gerrit.wikimedia.org/r/606824 [16:39:26] 10Operations, 10Wikimedia-Mailing-lists: Password reset for wikimediaindia-l mailing list - https://phabricator.wikimedia.org/T255910 (10RhinosF1) 05Stalled→03Invalid >>! In T255910#6242788, @anirudhsbh wrote: > Hello, I have an update... one of the old passwords started working. Sorry about the confusion!... [17:13:10] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:14:56] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [17:44:31] (03CR) 10Andrew Bogott: [C: 03+1] cloud nfs: only run nfs-exportd on the current active node (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/606543 (https://phabricator.wikimedia.org/T253353) (owner: 10Bstorm) [17:45:28] (03CR) 10Andrew Bogott: [C: 03+1] unattendedupgrades: allow configurable kernel cleanup [puppet] - 10https://gerrit.wikimedia.org/r/606234 (https://phabricator.wikimedia.org/T127374) (owner: 10Bstorm) [17:57:20] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 155.6 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [18:23:28] 10Operations, 10Wikimedia-Mailing-lists: Creation of mailinglist for Board of WUG Esperanto and Free Knowledge - https://phabricator.wikimedia.org/T255951 (10KuboF) [18:57:08] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 62.03 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:09:56] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 103.7 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [19:45:50] PROBLEM - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is CRITICAL: 106.8 gt 100 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:03:28] RECOVERY - Rate of JVM GC Old generation-s runs - elastic1052-production-search-psi-eqiad on elastic1052 is OK: (C)100 gt (W)80 gt 26.44 https://wikitech.wikimedia.org/wiki/Search%23Using_jstack_or_jmap_or_other_similar_tools_to_view_logs https://grafana.wikimedia.org/d/000000462/elasticsearch-memory?orgId=1&var-exported_cluster=production-search-psi-eqiad&var-instance=elastic1052&panelId=37 [21:54:34] (03PS2) 10QChris: gerrit: Add option to mark gerrit servers as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606530 (https://phabricator.wikimedia.org/T254158) [21:54:36] (03PS4) 10QChris: gerrit: Mark gerrit1002 (gerrit-test) as upgraded [puppet] - 10https://gerrit.wikimedia.org/r/606531 (https://phabricator.wikimedia.org/T254158) [21:54:38] (03PS3) 10QChris: gerrit: Add dedicated home dir for new Gerrit version [puppet] - 10https://gerrit.wikimedia.org/r/606532 (https://phabricator.wikimedia.org/T254158) [21:54:40] (03PS3) 10QChris: gerrit: Stop setting up a database for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606536 (https://phabricator.wikimedia.org/T254158) [21:54:42] (03PS4) 10QChris: gerrit: Drop its configuration for draft changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606533 (https://phabricator.wikimedia.org/T254158) [21:54:44] (03PS2) 10QChris: gerrit: Update its-phabricator templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606781 (https://phabricator.wikimedia.org/T254158) [21:54:46] (03PS2) 10QChris: gerrit: Update email templates for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606782 (https://phabricator.wikimedia.org/T254158) [21:54:48] (03PS2) 10QChris: gerrit: Drop empty unused Git config file [puppet] - 10https://gerrit.wikimedia.org/r/606783 (https://phabricator.wikimedia.org/T254158) [21:54:50] (03PS3) 10QChris: gerrit: Enable git protocol v2 on new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606784 (https://phabricator.wikimedia.org/T254158) [21:54:52] (03PS2) 10QChris: gerrit: Allow to use request tracing for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606795 (https://phabricator.wikimedia.org/T254158) [21:54:54] (03PS2) 10QChris: gerrit: Do not enable the ability to move changes for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606796 (https://phabricator.wikimedia.org/T254158) [21:54:56] (03PS15) 10QChris: Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 (https://phabricator.wikimedia.org/T254648) (owner: 10Paladox) [21:54:58] (03PS8) 10QChris: Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [21:55:00] (03PS1) 10QChris: gerrit: Use `replica` instead of `slave` for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606839 (https://phabricator.wikimedia.org/T254158) [21:55:02] (03PS1) 10QChris: gerrit: Remove old Polymer <2 styles for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606840 (https://phabricator.wikimedia.org/T227509) [21:55:04] (03PS1) 10QChris: gerrit: Switch header styling for new Gerrits from component to style [puppet] - 10https://gerrit.wikimedia.org/r/606841 (https://phabricator.wikimedia.org/T227509) [21:55:06] (03PS1) 10QChris: gerrit: Use colored header bar also in dark theme for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606842 (https://phabricator.wikimedia.org/T227509) [21:55:08] (03PS1) 10QChris: gerrit: Have a proper light and dark style for new Gerrits [puppet] - 10https://gerrit.wikimedia.org/r/606843 (https://phabricator.wikimedia.org/T227509) [21:56:55] (03CR) 10QChris: [C: 03+1] Gerrit: Convert CoC and Privacy links to use the new PolyGerrit extension point [puppet] - 10https://gerrit.wikimedia.org/r/520295 (https://phabricator.wikimedia.org/T254648) (owner: 10Paladox) [21:57:01] (03CR) 10QChris: [C: 03+1] Gerrit: Migrate theme to support Polymer 2 [puppet] - 10https://gerrit.wikimedia.org/r/539180 (https://phabricator.wikimedia.org/T227509) (owner: 10Paladox) [21:57:28] (03CR) 10QChris: "This theme is now online on https://gerrit-test.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/606843 (https://phabricator.wikimedia.org/T227509) (owner: 10QChris) [22:09:34] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [150.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:31:04] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [23:01:48] PROBLEM - Maps - OSM synchronization lag - codfw on icinga1001 is CRITICAL: 2.593e+05 ge 2.592e+05 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=12&fullscreen&orgId=1