[00:09:05] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037
[00:09:45] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033
[00:10:05] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079
[00:10:25] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
[00:10:33] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035
[00:11:31] <icinga-wm>	 PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050
[00:14:13] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009
[00:15:21] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050
[00:15:47] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079
[00:16:15] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035
[00:17:21] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp2033 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033
[00:18:37] <icinga-wm>	 RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037
[02:07:17] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745
[02:18:21] <wikibugs>	 (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 (https://phabricator.wikimedia.org/T257976) (owner: 10TrainBranchBot)
[03:02:17] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:07:57] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[03:17:31] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui the Next time you have this problem,  open the first 1GB NIC and change the setting from None to PXE and do t...
[03:47:11] <wikibugs>	 (03PS1) 10Gergő Tisza: Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582)
[04:15:05] <wikibugs>	 (03CR) 10Nuria: [C: 03+1] "Thanks for cleaning up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[04:37:42] <wikibugs>	 (03PS5) 10Andrew Bogott: Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614)
[04:37:44] <wikibugs>	 (03PS1) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081)
[04:38:08] <wikibugs>	 (03PS2) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081)
[04:38:19] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) (owner: 10Andrew Bogott)
[04:38:36] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) (owner: 10Andrew Bogott)
[04:39:14] <wikibugs>	 (03PS1) 10Andrew Bogott: wmcs-novastats-capacity: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/625773
[04:40:36] <wikibugs>	 (03Abandoned) 10Andrew Bogott: wmcs-novastats-capacity: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/625773 (owner: 10Andrew Bogott)
[04:40:38] <wikibugs>	 (03PS3) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081)
[04:45:21] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[04:47:17] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[05:15:11] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2026.codfw.wmnet...
[05:23:28] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) >>! In T260373#6441588, @Papaul wrote: > @Marostegui the Next time you have this problem, >  > open the first 1GB NIC...
[05:23:57] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:32:17] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime
[05:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:34:26] <logmsgbot>	 !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99)
[05:34:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:35:25] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[05:55:14] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2026.codfw.wmnet'] `  and were **ALL** successful.
[05:56:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) es2026 got installed correctly: ` root@es2026:~# free -g ; df -hT /srv               total        used        free...
[06:04:41] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) I have given it most of the vg remaining size: ` root@es2026:~# pvs   PV         VG   Fmt  Attr PSize   PFree   /dev/...
[06:05:35] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui)
[06:14:43] <marostegui>	 !log Stop MySQL on db1106 for PDU maintenance T261452
[06:14:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:49] <stashbot>	 T261452: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452
[06:18:15] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625778
[06:18:52] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625778 (owner: 10Marostegui)
[06:21:20] <wikibugs>	 (03PS1) 10Legoktm: [DNM] Dummy change to test CI [puppet] - 10https://gerrit.wikimedia.org/r/625779
[06:21:49] <wikibugs>	 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) @RKemper not sure how familiar are you with the magical world of serial consoles, I'll add a few links and then y...
[06:22:46] <wikibugs>	 (03PS2) 10Legoktm: [DNM] Dummy change to test CI [puppet] - 10https://gerrit.wikimedia.org/r/625779
[06:23:19] <elukey>	 !log roll restart of Hadoop master daemons on an-master100[1,2] to pick up new opejdk settings
[06:23:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:24:58] <wikibugs>	 (03CR) 10ZPapierski: Multiple instances of msearch_daemon (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski)
[06:25:59] <wikibugs>	 (03PS3) 10Legoktm: build: Use at least commit-message-validator 0.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066)
[06:29:33] <wikibugs>	 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui)
[06:31:27] <marostegui>	 !log Deploy schema change on s5 eqiad master - T253276
[06:31:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:31:34] <stashbot>	 T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276
[06:37:27] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[06:38:26] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[06:40:55] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto)
[06:41:05] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto)
[06:42:54] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876)
[06:43:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto)
[06:44:09] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[06:47:16] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[06:47:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:48:42] <wikibugs>	 (03PS1) 10ArielGlenn: update dumps web pages to note that recent versions of windows 7zip work [puppet] - 10https://gerrit.wikimedia.org/r/625780 (https://phabricator.wikimedia.org/T208647)
[06:50:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for PDU maintenance', diff saved to https://phabricator.wikimedia.org/P12513 and previous config saved to /var/cache/conftool/dbconfig/20200908-065022-marostegui.json
[06:50:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:56:52] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[06:56:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:31] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[06:58:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:56] <marostegui>	 !log Deploy schema change on s2 eqiad master - T253276
[06:59:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:59:01] <stashbot>	 T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276
[06:59:41] <wikibugs>	 (03PS3) 10Muehlenhoff: Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635
[07:00:21] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[07:00:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:03:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635 (owner: 10Muehlenhoff)
[07:14:31] <wikibugs>	 (03PS2) 10Muehlenhoff: reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597
[07:15:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff)
[07:20:45] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:26:02] <wikibugs>	 (03PS1) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782
[07:26:23] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[07:26:31] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] build: Use at least commit-message-validator 0.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) (owner: 10Legoktm)
[07:27:31] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm)
[07:30:38] <wikibugs>	 (03PS2) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782
[07:37:17] <wikibugs>	 (03PS3) 10Muehlenhoff: reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597
[07:37:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Retire the HTTP listener for debmonitor (along with ferm rules) [puppet] - 10https://gerrit.wikimedia.org/r/625658 (owner: 10Muehlenhoff)
[07:40:04] <XioNoX>	 !log move HE from ix to transit BGP group on cr3-eqsin
[07:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:41:49] <wikibugs>	 (03CR) 10Elukey: "Pcc for the search-loader vms: https://puppet-compiler.wmflabs.org/compiler1002/24989/" [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski)
[07:42:03] <Urbanecm>	 jouncebot: now
[07:42:04] <jouncebot>	 No deployments scheduled for the next 3 hour(s) and 17 minute(s)
[07:44:34] <elukey>	 !log roll restart kafka daemons on kafka-jumbo100[7-9] to pick up opendjk upgrades
[07:44:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:43] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org:80 on debmonitor2002 is CRITICAL: connect to address 10.192.32.42 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor
[07:44:49] <logmsgbot>	 !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Revert "Update T250887 mitigations" (T250887; T262242) (duration: 00m 59s)
[07:44:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:44:56] <stashbot>	 T262242: Save Timing regression on 2020-09-07 at 18:04 UTC - https://phabricator.wikimedia.org/T262242
[07:46:40] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli)
[07:50:18] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org:80 on debmonitor1002 is CRITICAL: connect to address 10.64.16.72 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor
[07:51:11] <moritzm>	 ^ expected, will be fixed with next Puppet run on icinga1001
[07:52:01] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli)
[07:52:37] <wikibugs>	 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar)
[07:58:12] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org:80 on debmonitor2001 is CRITICAL: connect to address 10.192.0.14 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor
[08:03:28] <wikibugs>	 (03PS10) 10JMeybohm: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris)
[08:06:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris)
[08:08:18] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org:80 on debmonitor1001 is CRITICAL: connect to address 10.64.32.62 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor
[08:11:15] <wikibugs>	 (03PS4) 10Jbond: build: Use at least commit-message-validator 0.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) (owner: 10Legoktm)
[08:12:02] <wikibugs>	 10Operations, 10Traffic: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10ayounsi) 👍
[08:13:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] "LGTM, merging thanks" [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) (owner: 10Legoktm)
[08:16:29] <moritzm>	 !log installing 4.19.132 kernel on buster systems (only installing the deb, reboots separately)
[08:16:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:17:18] <wikibugs>	 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, and 2 others: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) >>! In T166066#5087807, @jbond wrote: >> In addition, Jenkins doesn't seem to like having more than Change-...
[08:20:50] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main,name=eqiad
[08:20:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:46] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=codfw
[08:21:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:22:11] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad
[08:22:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:05] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=blubberoid,name=eqiad
[08:23:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:29:08] <wikibugs>	 (03PS1) 10Jgiannelos: Enable OpenAPI spec on push-notifications service [deployment-charts] - 10https://gerrit.wikimedia.org/r/625832 (https://phabricator.wikimedia.org/T261635)
[08:35:10] <wikibugs>	 (03PS1) 10Volans: cr: update debmonitor IPs in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489)
[08:35:38] <volans>	 XioNoX: if you're around and have a second for a quick review ^^^ :)
[08:35:54] <XioNoX>	 volans: what are you pointing to/?
[08:35:56] <volans>	 (I know, you probabkky don't see it)
[08:36:00] <volans>	 https://gerrit.wikimedia.org/r/c/operations/homer/public/+/625833 :D
[08:36:10] <XioNoX>	 !!!
[08:36:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[08:37:33] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[08:37:39] <XioNoX>	 volans: all good!
[08:37:57] <volans>	 ack, should I run it on all CRs or just eqiad/codfw?
[08:38:25] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cr: update debmonitor IPs in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[08:39:13] <wikibugs>	 (03Merged) 10jenkins-bot: cr: update debmonitor IPs in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[08:39:44] <volans>	 XioNoX: also thanks for the super quick review :)
[08:41:06] <XioNoX>	 volans: eqiad should be enough as it's for analytics iirc
[08:41:18] <volans>	 and cloud
[08:43:28] <volans>	 indeed, homer 'cr*codfw*' diff has no diffs
[08:45:01] <volans>	 !log running homer 'cr*eqiad*' commit "Update debmonitor IPs, T261489"
[08:45:01] <elukey>	 thanks!
[08:45:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:45:07] <stashbot>	 T261489: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489
[08:45:30] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui)
[08:46:05] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) p:05Triage→03Medium
[08:46:44] <elukey>	 XioNoX: I am wondering if we could create a script to check ips in filters to spot for the ones without a corresponding A/AAAA/PTR/etc.. set of records
[08:46:55] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond)
[08:47:18] <elukey>	 if analytics is the only segment of the network affected I can create a custom one
[08:47:31] <elukey>	 but everytime that I check that list I find super stale things
[08:47:42] <elukey>	 and people tend to forget about the vlan rules etc..
[08:48:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db2127's weight', diff saved to https://phabricator.wikimedia.org/P12514 and previous config saved to /var/cache/conftool/dbconfig/20200908-084834-marostegui.json
[08:48:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:49:25] <volans>	 elukey: you mean in CI? it wouldn't have catch this one as the old hosts are still there for now
[08:49:30] <volans>	 will be decom in the next days
[08:49:46] <elukey>	 nono I mean a regular icinga alert
[08:50:41] <elukey>	 in this case: say you forgot about the analytics filters, and you decommed debmonitor1001 - I'd get an alert for stale ip in the analytics filter right after it
[08:51:41] <volans>	 we do already convert IPs in the config in ipaddress objects, to ensure they are valid IPs and simplify their usage, I'm wondering if we could add a dns resolution there too
[08:51:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, kubernet
[08:51:56] <icinga-wm>	 t, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes101
[08:51:56] <icinga-wm>	  marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:52:01] <volans>	 given that we have the twice a day check that checks the live config vs the repo
[08:52:12] <volans>	 akosiaris: expected I guess? %%%
[08:52:14] <volans>	 *^^^
[08:52:26] <wikibugs>	 (03PS1) 10Jbond: profile::pki: fix apache config [puppet] - 10https://gerrit.wikimedia.org/r/625837 (https://phabricator.wikimedia.org/T259117)
[08:52:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, kubernet
[08:52:28] <icinga-wm>	 t, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes101
[08:52:28] <icinga-wm>	  marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[08:52:43] <elukey>	 volans: ah that would be great, I'd only need something that tells me about stale entries. It would be great for my mental sanity :D
[08:53:02] <jayme>	 this is me
[08:53:15] <akosiaris>	 volans: yes
[08:53:32] <akosiaris>	 I 'll schedule downtime for that as well
[08:53:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] profile::pki: fix apache config [puppet] - 10https://gerrit.wikimedia.org/r/625837 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond)
[08:53:50] <volans>	 elukey: the only usecase we need to solve is when running homer from local env, so we might have to force the resolver to be one of our public ns
[08:54:20] <volans>	 elukey: care to open a task about it? we can discuss there with arzhel and decide what to do
[08:54:32] <jbond42>	 volans: https://gerrit.wikimedia.org/r/c/operations/debs/wmf-sre-laptop/+/614787 ;)
[08:55:03] <volans>	 jbond42: lol, yeah but can't assume everyone has that :D
[08:55:18] <jbond42>	 yes was mostly in jest :)
[08:55:36] * jbond42 is reminded volans uses a mac
[08:55:43] <marostegui>	 !log Deploy schema change on s7 eqiad master - T253276
[08:55:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:48] <stashbot>	 T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276
[08:56:47] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, 
[08:56:47] <icinga-wm>	 iad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kuber
[08:56:47] <icinga-wm>	 mnet are marked down but pooled alexandros kosiaris kubernetes etcd upgrade https://wikitech.wikimedia.org/wiki/PyBal
[08:56:47] <icinga-wm>	 ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, 
[08:56:47] <icinga-wm>	 iad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kuber
[08:56:47] <icinga-wm>	 mnet are marked down but pooled alexandros kosiaris kubernetes etcd upgrade https://wikitech.wikimedia.org/wiki/PyBal
[08:57:00] <icinga-wm>	 RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[08:57:32] <XioNoX>	 volans, elukey, I think the answer would be to use a tool to manage the ACLs. I tested and liked capirca, but it has some downsides, like a new format for defining rules
[08:57:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: default-network-policy: allow restbase HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/625839 (https://phabricator.wikimedia.org/T244843)
[08:59:36] <elukey>	 XioNoX: mmm not sure, if people change hosts/DNS and we don't run the tool for a while, we don't get any hint that things are stale no?
[09:00:48] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove debmonitor1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/625840 (https://phabricator.wikimedia.org/T261489)
[09:00:53] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove debmonitor1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/625841 (https://phabricator.wikimedia.org/T261489)
[09:01:16] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad,swagger_check_echostore_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[09:02:42] <elukey>	 this is a good example https://gerrit.wikimedia.org/r/625840 - the ips were in the analytics filters, and if Riccardo didn't update the term they'd become stale soon. Adding a quick icinga check that runs twice a day for example would have raised  an alert to analytics
[09:03:48] <XioNoX>	 elukey: but the tool would run each day with the daily diffs
[09:03:57] <XioNoX>	 (just some thoughts of course)
[09:03:58] <jayme>	 prom alert is us as well akosiaris...but thats probably bad do ack/downtime
[09:04:47] <elukey>	 XioNoX: ah ok I didn't get this part, yes that would work too
[09:07:08] <XioNoX>	 elukey: anyway, please open a task, that's a usecase of managing/checking ACLs with a tool that I didn't think about
[09:08:48] <elukey>	 XioNoX: yes I know you don't want to talk with me, I'll open a task :D
[09:10:06] <XioNoX>	 elukey: I already exceeded my monthly allowed words with you
[09:10:15] <XioNoX>	 :)
[09:10:57] <kormat>	 (which was 0)
[09:15:38] <elukey>	 ahhahha
[09:15:40] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625842 (https://phabricator.wikimedia.org/T261717)
[09:15:51] <elukey>	 what a nice working environment
[09:15:53] <elukey>	 :D
[09:17:21] <volans>	 XioNoX: it's possible that define in homer controls only the analytics side of it and not the cloud one? (cc moritzm )
[09:17:35] <volans>	 I still see failures on cloudvirt hosts for example
[09:18:17] <XioNoX>	 volans: I don't understand
[09:18:44] <volans>	 cloudvirt hosts can't connect to the new debmonitor hosts [12]002
[09:19:06] <volans>	 and the rest of cloud-related hosts AFAICT
[09:19:08] <wikibugs>	 (03Restored) 10Jcrespo: profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn)
[09:19:12] <wikibugs>	 (03PS1) 10Vgutierrez: 1.8: Bump version number [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625843 (https://phabricator.wikimedia.org/T261632)
[09:19:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn)
[09:19:35] <volans>	 I was hoping that the homer's block was managing both :)
[09:20:40] <jayme>	 !log disabling puppted on argon.eqiad.wmnet,chlorine.eqiad.wmnet,kubernetes[1001-1016].eqiad.wmnet - Reinitialize eqiad k8s cluster with new etcd - T239835
[09:20:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:20:47] <stashbot>	 T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835
[09:21:05] <XioNoX>	 volans: I see what you mean, what you changed was only for analytics, not sure if there are similar rules for cloud
[09:21:35] <XioNoX>	 finishing up what I'm doing then I can have a look
[09:22:06] <volans>	 oh, my bad, found them
[09:22:17] <moritzm>	 yeah, it's still failing on cloudcephosd e.g.
[09:22:34] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqi
[09:22:34] <icinga-wm>	 tes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[09:22:42] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqi
[09:22:42] <icinga-wm>	 tes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[09:23:28] <_joe_>	 uh
[09:23:45] <_joe_>	 "unknown to pybal"?
[09:23:49] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] mariadb: Productionize es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625842 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui)
[09:24:15] <jayme>	 that looks strange
[09:24:26] <wikibugs>	 (03PS1) 10Volans: cr: update debmonitor IPs in firewall rules (#2) [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489)
[09:24:40] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:25:00] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Fix includes to build against Varnish 6 [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625713 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez)
[09:25:27] <vgutierrez>	 _joe_, jayme: do you need help with that?
[09:25:48] <_joe_>	 vgutierrez: that alert is wrong
[09:26:01] <volans>	 XioNoX: this should do (no hurry)  https://gerrit.wikimedia.org/r/c/operations/homer/public/+/625844
[09:26:05] <_joe_>	 I'm not sure what it's measuring there, but I can tell you those hosts are well known to pybal :)
[09:26:20] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1013 is CRITICAL: instance=kubernetes1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:26:22] <_joe_>	 lvs1015:~$ curl localhost:9090/pools/proton_4030a
[09:26:26] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:26:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[09:26:44] <_joe_>	 uhm eqiad fataling out?
[09:26:50] <_joe_>	 what's still pooled a/a ?
[09:27:20] <_joe_>	 akosiaris, jayme please pause a sec
[09:27:21] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM #2" [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[09:27:26] <jayme>	 _joe_: ack
[09:27:46] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1009 is CRITICAL: instance=kubernetes1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:27:52] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[09:27:53] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:27:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:27:56] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12515 and previous config saved to /var/cache/conftool/dbconfig/20200908-092755-kormat.json
[09:27:56] <_joe_>	 just timeouts
[09:27:59] <_joe_>	 please proceed
[09:27:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:04] <akosiaris>	 ok
[09:28:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:29:30] <_joe_>	 we're having some latency on the appservers cluster
[09:29:34] <_joe_>	 not sure why
[09:29:38] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:29:49] <_joe_>	 can we please downtime all the k8s hosts?
[09:30:14] <wikibugs>	 (03CR) 10Volans: [C: 03+2] cr: update debmonitor IPs in firewall rules (#2) [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[09:30:30] <_joe_>	 kormat / marostegui it seems s3 in codfw is having some query latency
[09:30:38] <wikibugs>	 (03Merged) 10jenkins-bot: cr: update debmonitor IPs in firewall rules (#2) [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans)
[09:30:43] <marostegui>	 _joe_: did you see the task I created?
[09:30:50] <marostegui>	 _joe_: I subscribed you, it is that I believe
[09:30:50] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:30:54] <_joe_>	 nope
[09:31:32] <kormat>	 _joe_: https://phabricator.wikimedia.org/T262240
[09:31:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:31:46] <_joe_>	 yeah I'm reading
[09:31:57] <marostegui>	 _joe_: essentially the same thing we saw on sunday night
[09:32:18] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:33:06] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:33:07] <logmsgbot>	 !log jayme@cumin1001 START - Cookbook sre.hosts.downtime
[09:33:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:33:15] <logmsgbot>	 !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[09:33:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:18] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:35:50] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[09:36:36] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[09:37:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625842 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui)
[09:37:51] <volans>	 !log running homer 'cr*eqiad*' commit "Update debmonitor IPs (#2), T261489"
[09:37:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:58] <stashbot>	 T261489: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489
[09:39:18] <wikibugs>	 (03PS1) 10Hashar: base: add basic spec for base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/625846
[09:39:20] <wikibugs>	 (03PS1) 10Hashar: base: upgrade git on stretch to 2.20 [puppet] - 10https://gerrit.wikimedia.org/r/625847 (https://phabricator.wikimedia.org/T262244)
[09:39:22] <wikibugs>	 (03PS1) 10Hashar: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244)
[09:39:24] <wikibugs>	 (03PS1) 10Hashar: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244)
[09:39:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2014 - T261717', diff saved to https://phabricator.wikimedia.org/P12517 and previous config saved to /var/cache/conftool/dbconfig/20200908-093957-marostegui.json
[09:40:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:40:05] <stashbot>	 T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717
[09:40:20] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar)
[09:40:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 04-1] "To upgrade git in stretch fleet-wide I'll simply upload it to the main component." [puppet] - 10https://gerrit.wikimedia.org/r/625847 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar)
[09:41:04] <wikibugs>	 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar)
[09:43:26] <marostegui>	 !log Stop mysql on es2014 to clone es2026 T261717
[09:43:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:32] <akosiaris>	 !log stopped calico-node and kube-apiserver on k8s nodes/masters T239835
[09:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:43:37] <stashbot>	 T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835
[09:44:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris)
[09:45:20] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[09:46:00] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris)
[09:46:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris)
[09:46:13] <_joe_>	 I think this ^^ is expected given no service is producing to eqiad
[09:46:21] <_joe_>	 not even for test/health check queues
[09:47:21] <wikibugs>	 (03Merged) 10jenkins-bot: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris)
[09:48:10] <elukey>	 _joe_ yep I think so too
[09:48:25] <wikibugs>	 (03PS1) 10Marostegui: es2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/625851 (https://phabricator.wikimedia.org/T261717)
[09:48:33] <_joe_>	 yes, and it only happens when the services themselves are not running
[09:49:45] <icinga-wm>	 PROBLEM - Prometheus k8s cache not updating on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops
[09:50:00] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] es2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/625851 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui)
[09:50:51] <jayme>	 prometheus alert expected as apiservers are down
[09:52:56] <akosiaris>	 !log enable puppet, run it on all k8s eqiad nodes and double check that calico-node is fine T239835
[09:53:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:53:02] <stashbot>	 T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835
[09:53:38] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper)
[09:54:15] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki)
[09:56:11] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) We let the first chunk of about 700k GET requests for about a week, but nothing stood up much.
[09:59:57] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi)
[10:00:28] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) Postponed to Wednesday 16th, 11am UTC, 30min, the time the cables and optics arrive.
[10:00:35] <wikibugs>	 (03PS2) 10ArielGlenn: update dumps web pages to note that recent versions of windows 7zip work [puppet] - 10https://gerrit.wikimedia.org/r/625780 (https://phabricator.wikimedia.org/T208647)
[10:01:28] <wikibugs>	 (03CR) 10ArielGlenn: [C: 03+2] update dumps web pages to note that recent versions of windows 7zip work [puppet] - 10https://gerrit.wikimedia.org/r/625780 (https://phabricator.wikimedia.org/T208647) (owner: 10ArielGlenn)
[10:02:34] <wikibugs>	 10Operations, 10Acme-chief, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10Vgutierrez)
[10:02:36] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki)
[10:03:49] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi)
[10:04:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) Postponed to Sept. 17th, 1pm Eastern, 17:00 UTC
[10:06:18] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Marostegui) Everything ok from the DB point of view.  All the DB hosts in D4 can have a hard downtime, nothing will be impacted from our side.
[10:08:52] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12519 and previous config saved to /var/cache/conftool/dbconfig/20200908-100852-kormat.json
[10:08:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:10:55] <wikibugs>	 (03PS1) 10Jbond: pki: install mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/625854
[10:11:13] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:11:14] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
[10:11:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:12:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: install mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/625854 (owner: 10Jbond)
[10:13:22] <wikibugs>	 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki)
[10:14:21] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] 1.8: Bump version number [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625843 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez)
[10:14:26] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single
[10:14:27] <logmsgbot>	 !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99)
[10:14:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:16:18] <wikibugs>	 (03PS1) 10Jbond: pki: install sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/625856
[10:16:23] <wikibugs>	 10Operations, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10jijiki)
[10:17:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: install sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/625856 (owner: 10Jbond)
[10:20:21] <wikibugs>	 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki)
[10:20:36] <wikibugs>	 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki)
[10:20:36] <marostegui>	 !log Deploy schema change on s4 eqiad master - T253276
[10:20:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:20:43] <stashbot>	 T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276
[10:20:57] <wikibugs>	 10Operations, 10SRE-tools, 10User-Joe: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10jijiki)
[10:21:17] <icinga-wm>	 PROBLEM - Prometheus k8s cache not updating on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops
[10:21:30] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review: Create a  mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki)
[10:21:42] <wikibugs>	 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, and 2 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10jijiki)
[10:22:55] <wikibugs>	 (03PS1) 10Jbond: pki: enable headers module [puppet] - 10https://gerrit.wikimedia.org/r/625857
[10:23:33] <wikibugs>	 10Operations: Redefine privileges and access for perf-roots group - https://phabricator.wikimedia.org/T207666 (10jijiki)
[10:25:12] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: enable headers module [puppet] - 10https://gerrit.wikimedia.org/r/625857 (owner: 10Jbond)
[10:27:43] <wikibugs>	 10Operations, 10SRE-tools, 10User-Joe: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10jijiki)
[10:29:37] <wikibugs>	 (03PS1) 10Jcrespo: mariadb: Move db1133 from m5 to core-test (backup testing db) [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217)
[10:38:12] <wikibugs>	 (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1001/24990/db1133.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:39:04] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] mariadb: Move db1133 from m5 to core-test (backup testing db) [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:39:08] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Make sure to full-upgrade and reboot the host for https://phabricator.wikimedia.org/T261389, if you can, please mark it as done." [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:40:56] <wikibugs>	 (03CR) 10Jcrespo: "Thanks for the reminder. I think I was going to go for a full reimage into buster." [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:41:43] <wikibugs>	 (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:43:28] <wikibugs>	 10Operations, 10Scap, 10Wikimedia-General-or-Unknown, 10serviceops, 10Release-Engineering-Team (Deployment services): "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10jijiki)
[10:44:04] <wikibugs>	 (03PS1) 10Jcrespo: install_server: Reimage db1133 into buster [puppet] - 10https://gerrit.wikimedia.org/r/625863 (https://phabricator.wikimedia.org/T253217)
[10:45:45] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] install_server: Reimage db1133 into buster [puppet] - 10https://gerrit.wikimedia.org/r/625863 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:47:41] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] install_server: Reimage db1133 into buster [puppet] - 10https://gerrit.wikimedia.org/r/625863 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo)
[10:53:16] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[10:53:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:29] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[10:53:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:40] <marostegui>	 !log Deploy schema change on s3 eqiad master - T253276
[10:53:42] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[10:53:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:53:46] <stashbot>	 T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276
[10:53:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:56:45] <wikibugs>	 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10jijiki)
[10:56:51] <wikibugs>	 (03PS1) 10Jbond: pki: only enable client auth for API directory and add fqdn to aliases [puppet] - 10https://gerrit.wikimedia.org/r/625866
[11:00:05] <jouncebot>	 Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1100).
[11:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:05:45] <wikibugs>	 (03PS1) 10JMeybohm: admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869
[11:08:56] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki)
[11:09:08] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10jijiki)
[11:09:29] <wikibugs>	 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki)
[11:11:40] <wikibugs>	 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10jijiki)
[11:15:08] <logmsgbot>	 !log jynus@cumin1001 START - Cookbook sre.hosts.downtime
[11:15:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:18:20] <logmsgbot>	 !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[11:18:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] graphite: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/625609 (owner: 10Muehlenhoff)
[11:33:50] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' .
[11:33:51] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'coredns' .
[11:33:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:33:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:34:08] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[11:34:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:23] <wikibugs>	 (03PS2) 10JMeybohm: admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869
[11:43:12] <icinga-wm>	 RECOVERY - Prometheus k8s cache not updating on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops
[11:43:13] <icinga-wm>	 RECOVERY - Prometheus k8s cache not updating on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops
[12:04:16] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[12:04:17] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:04:19] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12520 and previous config saved to /var/cache/conftool/dbconfig/20200908-120419-kormat.json
[12:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:04:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:08:03] <wikibugs>	 10Operations, 10Acme-chief, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10Vgutierrez) p:05Triage→03Medium
[12:10:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: set postgres shared buffers [puppet] - 10https://gerrit.wikimedia.org/r/625881
[12:11:32] <kormat>	 godog: hah. i cheated and commented out `tuning.conf` from postgres config. this might be a better approach ;)
[12:11:40] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12521 and previous config saved to /var/cache/conftool/dbconfig/20200908-121139-kormat.json
[12:11:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:25] <godog>	 kormat: heheh I think I might have done sth similar, and then promptly came back to bite me (currently changing domains)
[12:18:06] <wikibugs>	 (03CR) 10Kormat: [C: 03+1] pontoon: set postgres shared buffers [puppet] - 10https://gerrit.wikimedia.org/r/625881 (owner: 10Filippo Giunchedi)
[12:19:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set postgres shared buffers [puppet] - 10https://gerrit.wikimedia.org/r/625881 (owner: 10Filippo Giunchedi)
[12:20:15] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) starting maintenance  do not expect any outages will be disconnecting pdu`s in about 1 hour
[12:25:18] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "+2 actually, but I think it's better you merge before deploying and I donno your schedule" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625626 (owner: 10Hnowlan)
[12:26:01] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: convert to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625632 (owner: 10Hnowlan)
[12:26:32] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: citoid: add TLS LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868)
[12:26:34] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: citoid: promote https lvs to production status [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868)
[12:26:36] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: citoid: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868)
[12:27:00] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[12:27:00] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[12:27:03] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12522 and previous config saved to /var/cache/conftool/dbconfig/20200908-122702-kormat.json
[12:27:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:27:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:28:46] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: add alertmanager hiera variables for o11y [puppet] - 10https://gerrit.wikimedia.org/r/625885
[12:28:48] <wikibugs>	 (03PS1) 10Filippo Giunchedi: pontoon: switch observability stack to wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/625886
[12:28:57] <wikibugs>	 (03PS1) 10Vgutierrez: Merge remote-tracking branch 'origin/master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625887
[12:29:24] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Merge remote-tracking branch 'origin/master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625887 (owner: 10Vgutierrez)
[12:30:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] default-network-policy: allow restbase HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/625839 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[12:31:32] <wikibugs>	 (03Merged) 10jenkins-bot: default-network-policy: allow restbase HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/625839 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto)
[12:32:34] <wikibugs>	 (03PS6) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413)
[12:33:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar)
[12:33:19] <wikibugs>	 (03PS2) 10Vgutierrez: 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632)
[12:33:42] <wikibugs>	 (03PS7) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413)
[12:33:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez)
[12:34:47] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[12:34:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:35:46] <logmsgbot>	 !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12523 and previous config saved to /var/cache/conftool/dbconfig/20200908-123546-kormat.json
[12:35:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add alertmanager hiera variables for o11y [puppet] - 10https://gerrit.wikimedia.org/r/625885 (owner: 10Filippo Giunchedi)
[12:39:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: switch observability stack to wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/625886 (owner: 10Filippo Giunchedi)
[12:39:31] <wikibugs>	 (03PS4) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649
[12:39:33] <wikibugs>	 (03PS3) 10Hashar: .gitignore docker-pkg-build.log [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759
[12:39:35] <wikibugs>	 (03PS5) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611)
[12:40:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890
[12:41:11] <wikibugs>	 (03PS2) 10Hashar: python-build: do not archive previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619779
[12:42:47] <wikibugs>	 (03PS3) 10Vgutierrez: 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632)
[12:43:15] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez)
[12:47:50] <logmsgbot>	 !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[12:47:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:49:57] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890
[12:52:04] <wikibugs>	 (03PS3) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782
[12:53:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey)
[12:53:47] <elukey>	 ah!
[12:54:14] <elukey>	 of course I made a change without running tox and I get punished
[12:54:23] <elukey>	 *make
[12:54:40] <wikibugs>	 (03PS1) 10Holger Knust: WIP:  Add new watchlist job [dumps] - 10https://gerrit.wikimedia.org/r/625895 (https://phabricator.wikimedia.org/T51133)
[12:56:01] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "One possible small issue and a couple of nits, the rest LGTM." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey)
[12:57:29] <wikibugs>	 (03CR) 10Holger Knust: "The python portion (incomplete)" [dumps] - 10https://gerrit.wikimedia.org/r/625895 (https://phabricator.wikimedia.org/T51133) (owner: 10Holger Knust)
[13:00:55] <wikibugs>	 (03CR) 10Elukey: Add cookbook to restart Hadoop master daemons. (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey)
[13:01:12] <wikibugs>	 (03PS4) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782
[13:01:17] <icinga-wm>	 PROBLEM - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[13:02:01] <icinga-wm>	 PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:02:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey)
[13:02:55] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900
[13:03:51] <wikibugs>	 (03CR) 10Ottomata: [C: 03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza)
[13:04:28] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'staging' .
[13:04:28] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' .
[13:04:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:04:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:05:11] <wikibugs>	 (03CR) 10Ottomata: "Ok!  Might be next week..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto)
[13:08:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'staging' .
[13:08:09] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' .
[13:08:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:08] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' .
[13:09:08] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' .
[13:09:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:25] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] designate: stop creating 'legacy' entries (that is, things under wmflabs) [puppet] - 10https://gerrit.wikimedia.org/r/620937 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[13:10:23] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:10:27] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[13:10:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey)
[13:11:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Papaul) The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not saying that there is an error but there were...
[13:12:39] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' .
[13:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:52] <wikibugs>	 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) Excellent, makes sense @Papaul  Right now it is not a good moment to depool an s3 host due to some on-going investigations. I will ping you once we are ready to depool this host and get it upgrad...
[13:13:51] <icinga-wm>	 PROBLEM - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[13:13:56] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' .
[13:13:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:32] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters
[13:14:32] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
[13:14:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' .
[13:16:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' .
[13:16:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:22] <wikibugs>	 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper)
[13:16:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' .
[13:16:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' .
[13:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' .
[13:16:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' .
[13:16:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:16:49] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on icinga1001 is OK: (C)0 le (W)25 le 30.28 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[13:16:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:07] <icinga-wm>	 RECOVERY - Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:17:10] <librenms-wmf>	 04Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Juniper alarm active
[13:17:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' .
[13:17:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'staging' .
[13:17:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:17:35] <icinga-wm>	 RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on icinga1001 is OK: (C)0 le (W)25 le 31.98 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
[13:17:48] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.roll-restart-masters.py: fix cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/625905
[13:18:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' .
[13:18:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'staging' .
[13:18:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' .
[13:18:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:04] <akosiaris>	 XioNoX: those 2 seems to be saying contradictory things. Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms vs Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Juniper alarm active
[13:18:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:20] <akosiaris>	 which made me ask "Is there an alarm or not after all?"
[13:18:33] <XioNoX>	 jclark-ctr, cmjohnson1 please !log when starting a maintenance :)
[13:18:36] <cmjohnson1>	 !log swapping pdu's in eqiad, mgmt for racks d3 and d4 will go down 
[13:18:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:18:45] <cmjohnson1>	 just doing that now
[13:18:47] <XioNoX>	 cmjohnson1: they're already down :)
[13:19:00] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-masters.py: fix cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/625905 (owner: 10Elukey)
[13:19:03] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Aklapper)
[13:19:51] <XioNoX>	 akosiaris: Icinga and LibreNMS have their own latency
[13:20:00] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' .
[13:20:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'staging' .
[13:20:24] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' .
[13:20:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:32] <XioNoX>	 akosiaris: but yeah, I think it's enough in normal time to double check, and those are uncommon enough
[13:20:36] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[13:20:36] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[13:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'staging' .
[13:20:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'test' .
[13:20:43] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' .
[13:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:20:56] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters
[13:20:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:06] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' .
[13:21:06] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' .
[13:21:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:32] <wikibugs>	 (03PS1) 10Hashar: Add CI entry point to run tox [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909
[13:21:34] <wikibugs>	 (03PS1) 10Hashar: update_version: improve tox.ini [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910
[13:21:40] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'staging' .
[13:21:40] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' .
[13:21:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:21:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:22:10] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active
[13:22:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add CI entry point to run tox [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 (owner: 10Hashar)
[13:22:53] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) @Jclark-ctr I understand that is hpe's response. What is //your// advice regarding followup steps, close this due to "no actionable"?
[13:23:47] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890
[13:24:48] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "We would need CI to be setup for that update_version script.  I have proposed the boilerplate config in two follow up changes:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi)
[13:24:49] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' .
[13:24:49] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' .
[13:24:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:49] <mateusbs17>	 !log Restarted puppetdb on deployment-puppetdb03 (T248041)
[13:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:25:55] <stashbot>	 T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041
[13:26:05] <wikibugs>	 (03Merged) 10jenkins-bot: Make update_version.py work with python 3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi)
[13:26:22] <wikibugs>	 (03CR) 10Hashar: "recheck unrelated issue with eventgate-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 (owner: 10Hashar)
[13:26:46] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' .
[13:26:46] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' .
[13:26:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:26:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:04] <wikibugs>	 (03CR) 10Hashar: "Locally one would have basepython=python3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar)
[13:28:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' .
[13:28:25] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' .
[13:28:27] <wikibugs>	 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper)
[13:28:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:28:34] <wikibugs>	 10Operations, 10Developer-Advocacy, 10Discourse: Migration of  discourse-mediawiki.wmflabs.org from wmflabs to production - https://phabricator.wikimedia.org/T184461 (10Aklapper) 05Stalled→03Declined There are no active Discourse instances in Wikimedia currently (discourse-mediawiki.wmflabs.org and #Spac...
[13:30:09] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:30:33] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Neutron config: remove the 'tld' variable [puppet] - 10https://gerrit.wikimedia.org/r/624763 (owner: 10Andrew Bogott)
[13:30:36] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' .
[13:30:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:58] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[13:32:23] <wikibugs>	 10Operations, 10Developer-Advocacy, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) 05Stalled→03Declined Declining for the time being, as there are no active Discourse instances in Wikimedia currently (discourse-mediawiki.w...
[13:32:25] <wikibugs>	 10Operations, 10Developer-Advocacy, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper)
[13:33:15] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:33:30] <wikibugs>	 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe)
[13:33:57] <icinga-wm>	 PROBLEM - IPMI Sensor Status on pc1010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[13:34:02] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0)
[13:34:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:12] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:34:29] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dell Tech Support via r1hhmgz5xjn6.0b-gampeak.na98.bnc.salesforce.com    8:30 AM (3 minutes ago)   to me  ** Please Do Not Change Subj...
[13:34:55] <icinga-wm>	 PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:01] <icinga-wm>	 PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:03] <icinga-wm>	 PROBLEM - Host stat1006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:13] <icinga-wm>	 PROBLEM - Host kubernetes1013 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:19] <icinga-wm>	 PROBLEM - Host wtp1045 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:55] <cmjohnson1>	 !log the power cable was not properly seated and lost power to asw2-d3-eqiad
[13:35:56] <akosiaris>	 ehm what?
[13:35:57] <icinga-wm>	 PROBLEM - Host elastic1063 is DOWN: PING CRITICAL - Packet loss = 100%
[13:35:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:03] <akosiaris>	 ah that explains it
[13:36:09] <icinga-wm>	 PROBLEM - Host dbproxy1017 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:13] <icinga-wm>	 PROBLEM - Host maps1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:32] <icinga-wm>	 PROBLEM - Host elastic1062 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:43] <cmjohnson1>	 XioNox it powering up now,  I double checked it but was still wrong
[13:36:45] <icinga-wm>	 PROBLEM - Host mw1365 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:45] <icinga-wm>	 PROBLEM - Host wtp1043 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:45] <icinga-wm>	 PROBLEM - Host wtp1044 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:47] <icinga-wm>	 PROBLEM - Host mw1364 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:47] <icinga-wm>	 PROBLEM - Host aqs1009 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:47] <icinga-wm>	 PROBLEM - Host restbase1018 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:49] <icinga-wm>	 PROBLEM - Host restbase1025 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:53] <icinga-wm>	 PROBLEM - Host eventlog1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:55] <icinga-wm>	 PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:59] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.roll-restart-masters.py: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/625913
[13:36:59] <icinga-wm>	 PROBLEM - Host ores1007 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:59] <icinga-wm>	 PROBLEM - Host scb1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:36:59] <icinga-wm>	 PROBLEM - Host rdb1006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:01] <icinga-wm>	 PROBLEM - Host sessionstore1003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:01] <icinga-wm>	 PROBLEM - Host mw1363 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:09] <icinga-wm>	 PROBLEM - Host pc1010 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:09] <icinga-wm>	 PROBLEM - Host schema1004 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:09] <icinga-wm>	 PROBLEM - Host releases1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:11] <_joe_>	 wow that many systems
[13:37:24] <_joe_>	 sessionstore1003 is cassandra, not sure if something will be needed
[13:37:25] <cmjohnson1>	 full rack
[13:37:27] <icinga-wm>	 PROBLEM - Host thorium is DOWN: PING CRITICAL - Packet loss = 100%
[13:37:48] <jynus>	 not worried about eqiad nodes, but should we be worried about restbase on codfw?
[13:37:50] <volans>	 list of hosts: https://netbox.wikimedia.org/dcim/racks/37/
[13:37:52] <akosiaris>	 cmjohnson1: is this just networking? or did all those hosts also lost power?
[13:38:06] <akosiaris>	 I am gather the former, just double checking
[13:38:07] <icinga-wm>	 PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:38:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:23] <marostegui>	 I guess it is just the switch because of the PDU maintenance akosiaris ?
[13:38:29] <icinga-wm>	 PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:29] <elukey>	 is the rack switch booting up though?
[13:38:33] <icinga-wm>	 PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:37] <icinga-wm>	 PROBLEM - Host mw1357 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:37] <icinga-wm>	 PROBLEM - Host mw1356 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:37] <akosiaris>	 marostegui: my guess as well
[13:38:39] <icinga-wm>	 PROBLEM - Host kafka-jumbo1009 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:41] <icinga-wm>	 PROBLEM - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:41] <icinga-wm>	 PROBLEM - Host ms-be1039 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:41] <icinga-wm>	 PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:43] <icinga-wm>	 PROBLEM - Host dumpsdata1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:43] <_joe_>	 akosiaris: it seems mobileapps is having issue in codfw
[13:38:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:47] <icinga-wm>	 PROBLEM - Host mw1354 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:49] <icinga-wm>	 PROBLEM - Host labweb1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:50] <icinga-wm>	 PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:50] <icinga-wm>	 PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:51] <icinga-wm>	 PROBLEM - Host mw1355 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:51] <icinga-wm>	 PROBLEM - Host mw1352 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:55] <icinga-wm>	 PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:57] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:38:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:38:57] <icinga-wm>	 PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:57] <icinga-wm>	 PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100%
[13:38:57] <icinga-wm>	 PROBLEM - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:01] <icinga-wm>	 PROBLEM - Host aqs1006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:01] <icinga-wm>	 PROBLEM - Host an-presto1003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:03] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[13:39:03] <icinga-wm>	 PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:03] <icinga-wm>	 PROBLEM - Host mw1359 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:03] <icinga-wm>	 PROBLEM - Host mw1362 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:05] <icinga-wm>	 PROBLEM - Host es1018 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:09] <icinga-wm>	 PROBLEM - Host snapshot1007 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:10] <akosiaris>	 this is weird...
[13:39:11] <icinga-wm>	 PROBLEM - Host elastic1060 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:17] <icinga-wm>	 PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:18] <akosiaris>	 I did not expect codfw to be impacted by this
[13:39:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:21] <icinga-wm>	 PROBLEM - Host elastic1061 is DOWN: PING CRITICAL - Packet loss = 100%
[13:39:21] <rzl>	 👋
[13:39:26] <_joe_>	 akosiaris: probably it's not
[13:39:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:32] <apergos>	 is this for us?
[13:39:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:34] <_joe_>	 let's focus on mobileapps in codfw
[13:39:37] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[13:39:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:41] <apergos>	 (I am in a meeting but will drop if needed)
[13:39:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:43] <_joe_>	 that's what generates the featured endpoint
[13:39:44] <jynus>	 _joe_: thinking a monitoring issue?
[13:39:48] <_joe_>	 no.
[13:39:51] <icinga-wm>	 PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={redis_maps,swagger_check_restbase_cluster_codfw,swagger_check_wikifeeds_codfw,swagger_check_wikifeeds_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[13:39:54] <jynus>	 ah, so real then
[13:39:59] <_joe_>	 oh wikifeeds
[13:39:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:39:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:40:00] <volans>	 elastic1061 is in D1
[13:40:00] <stashbot>	 D1: Initial commit - https://phabricator.wikimedia.org/D1
[13:40:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:40:04] <jayme>	 apergos: rzl power failure in rack switch
[13:40:05] <volans>	 whu is affected?
[13:40:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:40:08] <akosiaris>	 _joe_: thinking coincidence?
[13:40:09] <apergos>	 ah
[13:40:11] <icinga-wm>	 RECOVERY - Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm
[13:40:12] <_joe_>	 akosiaris: not sure
[13:40:15] <akosiaris>	 I this it's related to the lost restbase nodes
[13:40:17] <rzl>	 jayme: ack, saw
[13:40:19] <icinga-wm>	 PROBLEM - Host mw1358 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:20] <volans>	 host list and rack dont' match for me
[13:40:21] <icinga-wm>	 PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:21] <icinga-wm>	 PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:23] <_joe_>	 akosiaris: oh possibly
[13:40:25] <_joe_>	 yes
[13:40:26] <akosiaris>	 wikifeeds AND mobileapps talk to restbase
[13:40:29] <icinga-wm>	 PROBLEM - Host wtp1046 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:33] <akosiaris>	 and both have issues
[13:40:35] <icinga-wm>	 PROBLEM - Host wtp1048 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:36] <_joe_>	 yes this is way more than we expected
[13:40:37] <icinga-wm>	 PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:41] <volans>	 rack d1 affected too?
[13:40:43] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[13:40:47] <icinga-wm>	 PROBLEM - Host ores1008 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:49] <icinga-wm>	 PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:52] <icinga-wm>	 PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:53] <mark>	 cmjohnson1: please focus on communication first - we'd like to understand what went down
[13:40:55] <icinga-wm>	 PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:57] <icinga-wm>	 PROBLEM - Host druid1003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:40:59] <marostegui>	 volans: one of the hosts was in D3 too
[13:41:00] <stashbot>	 D3: test - ignore - https://phabricator.wikimedia.org/D3
[13:41:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:41:05] <volans>	 and d4 too
[13:41:05] <icinga-wm>	 PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100%
[13:41:25] <mark>	 so the switches are interconnected with the rest of the row, it /can/ have impact on others, although they should route around problems
[13:41:26] <apergos>	 ok I am here if I can help
[13:41:34] <marostegui>	 volans: so d3 and d4 are the PDU work that was going on, so that is "expected"
[13:41:34] <elukey>	 maybe let's move to #-sre ?
[13:41:44] <marostegui>	 elukey: yes please
[13:41:50] <volans>	 too may things at once yeah indeed movjng
[13:42:11] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5183 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:42:11] <icinga-wm>	 PROBLEM - SSH on conf1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:42:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:42:13] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:42:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:42:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:42:19] <icinga-wm>	 PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[13:42:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-s
[13:42:33] <icinga-wm>	 }/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was receive
[13:42:33] <icinga-wm>	 imedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:42:33] <mark>	 i sent chris a text to come talk to us on irc
[13:42:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:42:42] <cmjohnson1>	 XioNox it's up but not seeing any traffic
[13:42:47] <icinga-wm>	 PROBLEM - SSH on ms-be1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:42:51] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:42:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:42:59] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:43:00] <cmjohnson1>	 mark see message above
[13:43:03] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:43:07] <mark>	 cmjohnson1: so
[13:43:07] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:43:13] <mark>	 cmjohnson1: is just the switch down, or the entire rack?
[13:43:20] <cmjohnson1>	 just the swith
[13:43:22] <cmjohnson1>	 switch
[13:43:25] <mark>	 we are seeing impact in other racks
[13:43:27] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-s
[13:43:27] <icinga-wm>	 }/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:43:27] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[13:43:30] <mark>	 is there any chance they are also impacted by power?
[13:43:54] <cmjohnson1>	 no
[13:43:55] <icinga-wm>	 PROBLEM - SSH on backup1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:43:57] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/m
[13:43:57] <icinga-wm>	 file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:43:59] <mark>	 and is the switch that was affected powered again and booting back up?
[13:44:01] <icinga-wm>	 PROBLEM - Auth DNS on dns1002 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:44:09] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:44:11] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1287.eqiad.wmnet, mw1386.eqiad.wmnet, mw1284.eqiad.wmnet, mw1412.eqiad.wmnet, mw1327.eqiad.wmnet, mw1380.eqiad.wmnet, mw1371.eqiad.wmnet, mw1321.eqiad.wmnet, mw1401.eqiad.wmnet, mw1274.eqiad.wmnet, mw1405.eqiad.wmnet, mw1395.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw1290.eqiad.wmnet]) https://wikitech.wikimedi
[13:44:13] <icinga-wm>	 PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100%
[13:44:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1267.eqiad.wmnet, mw1268.eqiad.wmnet, mw1327.eqiad.wmnet, mw1395.eqiad.wmnet, mw1411.eqiad.wmnet, mw1371.eqiad.wmnet, mw1369.eqiad.wmnet, mw1270.eqiad.wmnet, mw1331.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet are marked down but pooled: api_80: Servers mw1339.eqiad.wmnet are marked down but pooled: apa
[13:44:13] <icinga-wm>	 mw1333.eqiad.wmnet, mw1274.eqiad.wmnet, mw1372.eqiad.wmnet, mw1331.eqiad.wmnet, mw1322.eqiad.wmnet, mw1395.eqiad.wmnet, mw1261.eqiad.wmnet, mw1369.eqiad.wmnet, mw1370.eqiad.wmnet, mw1328.eqiad.wmnet, mw1270.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet, mw1405.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1400.eqiad.wmnet, mw1276.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P
[13:44:19] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:44:34] <cmjohnson1>	 The switch lost power, there is a chance that there was a power surge when plugging it back in
[13:44:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:44:38] <cmjohnson1>	 the switch 
[13:44:45] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:44:46] <mark>	 cmjohnson1: does it look like it is powered right now?
[13:44:48] <mark>	 XioNoX: ping
[13:44:51] <cmjohnson1>	 mark I want to pull power and reboot
[13:44:51] <icinga-wm>	 PROBLEM - SSH on ms-be1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:44:57] <icinga-wm>	 PROBLEM - Host ms-be1026 is DOWN: PING CRITICAL - Packet loss = 100%
[13:44:57] <cmjohnson1>	 it is powered on
[13:44:59] <icinga-wm>	 PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:45:02] <icinga-wm>	 PROBLEM - Check systemd state on es1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:02] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and 
[13:45:02] <icinga-wm>	 returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 (expecting: 404): /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected st
[13:45:02] <icinga-wm>	 ng: 200): /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected stat https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:45:03] <cmjohnson1>	 no link lights
[13:45:08] <mark>	 cmjohnson1: let's coordinate
[13:45:17] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1319.eqiad.wmnet, mw1395.eqiad.wmnet, mw1371.eqiad.wmnet, mw1329.eqiad.wmnet, mw1367.eqiad.wmnet, mw1274.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1321.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1369.eqiad.wmnet, mw1272.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet, mw1326.eqiad.
[13:45:17] <icinga-wm>	 down but pooled: api_80: Servers mw1386.eqiad.wmnet, mw1284.eqiad.wmnet, mw1378.eqiad.wmnet, mw1383.eqiad.wmnet, mw1388.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1401.eqiad.wmnet, mw1331.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1321.eqiad.wmnet, mw1269.eqiad.wmnet, mw1403.eqiad.wmnet, mw1325.eqiad.wmnet, mw1274.eqiad.wmnet, mw1261.eqiad.wmnet, mw1413.eqiad.wmnet, mw1369.eqiad.
[13:45:17] <icinga-wm>	 ad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw1326.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1276.eqiad.wmnet, mw1290.eqiad.wmnet, mw1383.eqiad.wmnet are ma https://wikitech.wikimedia.org/wiki/PyBal
[13:45:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:45:35] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[13:45:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:45:39] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02264 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:45:45] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:45:47] <icinga-wm>	 PROBLEM - Host ms-be1056 is DOWN: PING CRITICAL - Packet loss = 100%
[13:45:47] <icinga-wm>	 PROBLEM - SSH on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:45:47] <icinga-wm>	 RECOVERY - SSH on backup1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:45:49] <icinga-wm>	 PROBLEM - Check systemd state on db1103 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:49] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:46:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:46:07] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:46:13] <icinga-wm>	 PROBLEM - SSH on analytics1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:46:17] <icinga-wm>	 PROBLEM - Check systemd state on db1143 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:46:23] <icinga-wm>	 PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:46:57] <icinga-wm>	 PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: PING CRITICAL - Packet loss = 100%
[13:46:57] <_joe_>	 wat
[13:47:01] <icinga-wm>	 PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[13:47:02] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01429 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[13:47:11] <icinga-wm>	 PROBLEM - Check systemd state on restbase1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:12] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:47:15] <icinga-wm>	 PROBLEM - SSH on kafka-jumbo1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:47:23] <icinga-wm>	 PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:47:25] <icinga-wm>	 PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[13:47:34] <hashar>	 _joe_: a switch is down
[13:47:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:47:39] <icinga-wm>	 PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:47:39] <icinga-wm>	 RECOVERY - SSH on cp1089 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:47:43] <icinga-wm>	 PROBLEM - Check systemd state on mw1320 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:47:43] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:47:55] <cmjohnson1>	 XioNoX , mark: next step?  
[13:47:56] <icinga-wm>	 PROBLEM - Auth DNS #page on nsa-v4 is CRITICAL: DNS_QUERY CRITICAL - no socket TCP[198.35.27.27] Connection timed out https://wikitech.wikimedia.org/wiki/DNS
[13:47:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:48:01] <mark>	 cmjohnson1: please wait
[13:48:02] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={mc2034,mc2035} site=codfw tunnel={mc1034_v4,mc1035_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[13:48:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:09] <icinga-wm>	 PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:11] <mark>	 i am looking at the switches, XioNoX not around
[13:48:15] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:48:16] <_joe_>	 we're losing dns too
[13:48:23] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:48:27] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:48:35] <icinga-wm>	 PROBLEM - SSH on dns1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:48:35] <icinga-wm>	 PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:48:42] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 3 https://wikitech.wikimedia.org/wiki/HAProxy
[13:48:45] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1035 site=eqiad tunnel=mc2035_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[13:48:47] <icinga-wm>	 RECOVERY - SSH on ms-be1043 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:48:48] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/625757
[13:48:49] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:48:55] <icinga-wm>	 RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:48:57] <icinga-wm>	 PROBLEM - SSH on cloudelastic1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:48:57] <icinga-wm>	 PROBLEM - SSH on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:48:59] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:49:03] <icinga-wm>	 RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[13:49:05] <icinga-wm>	 PROBLEM - Check systemd state on mc1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:09] <icinga-wm>	 PROBLEM - SSH on rdb1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:49:11] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 6.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:49:11] <icinga-wm>	 PROBLEM - SSH on analytics1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:49:12] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:49:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/625757 (owner: 10Andrew Bogott)
[13:49:17] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:23] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:27] <icinga-wm>	 PROBLEM - Check systemd state on mw1321 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:49:27] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:49:29] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:49:35] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:49:39] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:49:41] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.1 200 Ok - 32228 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:49:51] <icinga-wm>	 PROBLEM - SSH on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:49:55] <icinga-wm>	 PROBLEM - SSH on lvs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:49:57] <icinga-wm>	 PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:49:59] <icinga-wm>	 PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:50:01] <icinga-wm>	 RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:50:07] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:50:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:50:17] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:50:19] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:50:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1321.eqiad.wmnet, mw1401.eqiad.wmnet, mw1371.eqiad.wmnet, mw1372.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1272.eqiad.wmnet, mw1274.eqiad.wmnet, mw1261.eqiad.wmnet, mw1413.eqiad.wmnet, mw1328.eqiad.wmnet, mw1405.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet are marked dow
[13:50:22] <icinga-wm>	 ches_80: Servers mw1321.eqiad.wmnet, mw1328.eqiad.wmnet, mw1265.eqiad.wmnet, mw1395.eqiad.wmnet, mw1371.eqiad.wmnet, mw1372.eqiad.wmnet, mw1367.eqiad.wmnet, mw1331.eqiad.wmnet, mw1322.eqiad.wmnet, mw1268.eqiad.wmnet, mw1403.eqiad.wmnet, mw1413.eqiad.wmnet, mw1411.eqiad.wmnet, mw1261.eqiad.wmnet, mw1270.eqiad.wmnet, mw1399.eqiad.wmnet, mw1326.eqiad.wmnet are marked down but pooled: api_80: Servers mw1290.eqiad.wmnet, mw1284.eqiad.
[13:50:22] <icinga-wm>	 ad.wmnet, mw1276.eqiad.wmnet, mw1412.eqiad.wmnet, mw1396.eqiad.wmnet, mw1404.eqiad.wmnet, mw1388.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:50:23] <icinga-wm>	 PROBLEM - Check systemd state on mw1269 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:50:25] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[13:50:27] <icinga-wm>	 RECOVERY - SSH on dns1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:50:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:50:35] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor
[13:50:37] <icinga-wm>	 PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqiad.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase
[13:50:41] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:50:49] <icinga-wm>	 RECOVERY - SSH on cloudelastic1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:50:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:51:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:51:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:51:07] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:51:11] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:51:11] <icinga-wm>	 RECOVERY - SSH on kafka-jumbo1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:51:17] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:51:25] <icinga-wm>	 PROBLEM - SSH on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:51:27] <icinga-wm>	 RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:51:31] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 23583 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:51:45] <icinga-wm>	 PROBLEM - Check systemd state on mw1333 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:51:51] <icinga-wm>	 RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:51:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:51:55] <icinga-wm>	 PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:51:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:03] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:52:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:52:17] <icinga-wm>	 PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS
[13:52:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:31] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:41] <icinga-wm>	 PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:52:43] <icinga-wm>	 PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:52:43] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:52:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:45] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:52:49] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 26031 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:52:49] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:52:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:52:51] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:52:55] <icinga-wm>	 RECOVERY - SSH on ms-be1056 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:52:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:53:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:03] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 2.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:53:05] <icinga-wm>	 PROBLEM - Apache HTTP on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:53:06] <_joe_>	 we seem to be losing more servers
[13:53:09] <icinga-wm>	 RECOVERY - SSH on analytics1077 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:53:11] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[13:53:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:29] <icinga-wm>	 PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:53:35] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:53:36] <wikibugs>	 (03PS1) 10Ppchelko: Enable OAuthRateLimiter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423)
[13:53:39] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:53:41] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[13:53:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 8.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:53:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:45] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1
[13:53:49] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:53:53] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[13:53:57] <icinga-wm>	 RECOVERY - SSH on lvs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:53:57] <wikibugs>	 (03CR) 10Ppchelko: [C: 04-2] "Depends on the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko)
[13:54:02] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:54:09] <icinga-wm>	 RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 9.377 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:54:11] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:54:11] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:54:13] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:54:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:54:19] <icinga-wm>	 RECOVERY - Auth DNS on dns1002 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[13:54:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:54:21] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:54:21] <icinga-wm>	 RECOVERY - SSH on analytics1076 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:54:22] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:54:25] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:54:35] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:54:35] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:54:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:54:39] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:54:41] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:54:45] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.0 200 OK - 25835 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:54:45] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 26020 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:54:55] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 0 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[13:54:59] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 8.079 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:55:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:55:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:55:09] <icinga-wm>	 PROBLEM - SSH on ms-be1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:55:13] <icinga-wm>	 RECOVERY - SSH on rdb1010 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:55:15] <jayme>	 ag
[13:55:17] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 6.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:55:18] <jayme>	 ic
[13:55:25] <icinga-wm>	 PROBLEM - SSH on an-worker1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:55:25] <icinga-wm>	 RECOVERY - SSH on mw1353 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:55:25] <icinga-wm>	 RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:55:35] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:55:37] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:55:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:55:52] <icinga-wm>	 RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:55:59] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:55:59] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 37 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[13:56:05] <icinga-wm>	 PROBLEM - SSH on an-worker1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:56:14] <librenms-wmf>	 08Warning Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Processor usage over 85%
[13:56:15] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:56:21] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 32219 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:56:25] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on logstash1011 is CRITICAL: 182 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011
[13:56:25] <icinga-wm>	 RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS
[13:56:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:56:41] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 41 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[13:56:41] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 27 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[13:56:57] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[13:57:02] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on logstash1010 is CRITICAL: 198 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010
[13:57:09] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:57:21] <icinga-wm>	 RECOVERY - SSH on an-worker1094 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:57:22] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={0,1,2} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso
[13:57:22] <icinga-wm>	 luster=logging-eqiad&var-topic=All&var-consumer_group=All
[13:57:25] <icinga-wm>	 PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:31] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 67 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[13:57:35] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:57:39] <icinga-wm>	 PROBLEM - SSH on kafka-jumbo1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:57:41] <icinga-wm>	 PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[13:57:47] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 32 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[13:57:53] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[13:57:55] <icinga-wm>	 PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:58:05] <icinga-wm>	 RECOVERY - SSH on cp1090 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:58:13] <icinga-wm>	 PROBLEM - SSH on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:58:32] <icinga-wm>	 RECOVERY - Host wdqs1008 is UP: PING WARNING - Packet loss = 60%, RTA = 0.24 ms
[13:58:32] <icinga-wm>	 RECOVERY - Host mw1355 is UP: PING WARNING - Packet loss = 77%, RTA = 0.22 ms
[13:58:33] <icinga-wm>	 RECOVERY - Host logstash1012 is UP: PING WARNING - Packet loss = 77%, RTA = 0.23 ms
[13:58:33] <icinga-wm>	 RECOVERY - Host mw1361 is UP: PING WARNING - Packet loss = 71%, RTA = 0.22 ms
[13:58:33] <icinga-wm>	 RECOVERY - Host kafka-jumbo1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:58:33] <icinga-wm>	 RECOVERY - Host an-worker1093 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[13:58:33] <icinga-wm>	 RECOVERY - Host dumpsdata1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[13:58:34] <icinga-wm>	 RECOVERY - Host mw1356 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[13:58:34] <icinga-wm>	 RECOVERY - Host wtp1047 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[13:58:35] <icinga-wm>	 RECOVERY - Host ores1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:58:35] <icinga-wm>	 RECOVERY - Host labweb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[13:58:36] <icinga-wm>	 RECOVERY - Host snapshot1007 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[13:58:36] <icinga-wm>	 RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[13:58:37] <wikibugs>	 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez)
[13:58:37] <icinga-wm>	 RECOVERY - Host aqs1006 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[13:58:37] <icinga-wm>	 RECOVERY - Host kafka-jumbo1006 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms
[13:58:38] <icinga-wm>	 RECOVERY - Host mw1362 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[13:58:38] <icinga-wm>	 RECOVERY - Host mw1357 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[13:58:39] <icinga-wm>	 RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[13:58:39] <icinga-wm>	 RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[13:58:40] <icinga-wm>	 RECOVERY - Host druid1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[13:58:41] <icinga-wm>	 RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[13:58:41] <icinga-wm>	 RECOVERY - Host mw1354 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[13:58:42] <icinga-wm>	 RECOVERY - Host wtp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[13:58:42] <icinga-wm>	 RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[13:58:42] <icinga-wm>	 RECOVERY - Host mw1359 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[13:58:43] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[13:58:43] <icinga-wm>	 RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:58:44] <icinga-wm>	 RECOVERY - Host an-presto1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[13:58:44] <icinga-wm>	 RECOVERY - Host ms-be1039 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms
[13:58:45] <icinga-wm>	 RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms
[13:58:45] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:58:46] <icinga-wm>	 RECOVERY - SSH on conf1006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:58:46] <icinga-wm>	 RECOVERY - Host elastic1061 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[13:58:47] <icinga-wm>	 RECOVERY - Host mc1035 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[13:58:48] <icinga-wm>	 RECOVERY - Host mw1352 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[13:58:49] <icinga-wm>	 RECOVERY - Host wtp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:58:49] <icinga-wm>	 RECOVERY - Host ms-be1056 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[13:58:50] <icinga-wm>	 RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[13:58:50] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.212 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:58:50] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:58:51] <icinga-wm>	 RECOVERY - Host ms-be1026 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[13:58:51] <icinga-wm>	 RECOVERY - Host es1018 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[13:58:52] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[13:58:52] <icinga-wm>	 RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[13:58:53] <icinga-wm>	 RECOVERY - Host mw1358 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms
[13:58:53] <icinga-wm>	 RECOVERY - Host centrallog1001 is UP: PING OK - Packet loss = 0%, RTA = 6.12 ms
[13:58:55] <icinga-wm>	 RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[13:58:55] <icinga-wm>	 RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[13:58:59] <icinga-wm>	 RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[13:59:03] <icinga-wm>	 RECOVERY - Host elastic1060 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[13:59:07] <icinga-wm>	 RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[13:59:09] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 23583 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[13:59:09] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp108[789].eqiad.wmnet
[13:59:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:15] <icinga-wm>	 RECOVERY - SSH on ms-be1043 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:59:17] <icinga-wm>	 RECOVERY - SSH on cp1087 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:59:18] <bblack>	 !log depooling cp1087-1090
[13:59:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:25] <icinga-wm>	 RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:59:25] <icinga-wm>	 RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[13:59:25] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp1090.eqiad.wmnet
[13:59:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:29] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:31] <icinga-wm>	 RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[13:59:31] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[13:59:33] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[13:59:35] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[13:59:37] <icinga-wm>	 RECOVERY - SSH on kafka-jumbo1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[13:59:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[13:59:42] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[13:59:45] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 56, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[13:59:53] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[13:59:59] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[14:00:05] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.0 200 OK - 23423 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:00:12] <icinga-wm>	 RECOVERY - SSH on an-worker1095 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:00:13] <icinga-wm>	 RECOVERY - SSH on cp1089 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:00:19] <icinga-wm>	 RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:00:30] <icinga-wm>	 RECOVERY - Auth DNS #page on nsa-v4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[14:00:32] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:00:39] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:00:43] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on logstash1011 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011
[14:01:13] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:01:23] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on logstash1010 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010
[14:01:25] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:01:32] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:01:39] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:01:45] <icinga-wm>	 PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:01:49] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:01:59] <icinga-wm>	 PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100%
[14:01:59] <icinga-wm>	 PROBLEM - Host mw1362 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:01] <icinga-wm>	 PROBLEM - Host elastic1060 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:01] <icinga-wm>	 PROBLEM - Host es1018 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:05] <icinga-wm>	 PROBLEM - Host mw1352 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:07] <icinga-wm>	 PROBLEM - Host dumpsdata1002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:02:07] <icinga-wm>	 PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:07] <icinga-wm>	 PROBLEM - Host mw1359 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:09] <icinga-wm>	 PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:11] <icinga-wm>	 PROBLEM - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:11] <icinga-wm>	 PROBLEM - Host mw1355 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:13] <icinga-wm>	 PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:15] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85%
[14:02:19] <icinga-wm>	 PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:23] <icinga-wm>	 PROBLEM - Host snapshot1007 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:27] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:02:27] <icinga-wm>	 PROBLEM - Host mw1354 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:31] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:02:37] <icinga-wm>	 PROBLEM - Host aqs1006 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:37] <icinga-wm>	 PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:39] <icinga-wm>	 PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:43] <icinga-wm>	 PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:45] <icinga-wm>	 PROBLEM - Host elastic1061 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:51] <icinga-wm>	 PROBLEM - Host mw1357 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:51] <icinga-wm>	 PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:52] <_joe_>	 there we go again
[14:02:57] <icinga-wm>	 PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100%
[14:02:59] <icinga-wm>	 PROBLEM - Host an-presto1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:03] <icinga-wm>	 PROBLEM - Host mw1356 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:05] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 876.3 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:03:07] <icinga-wm>	 PROBLEM - Host ms-be1039 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:27] <icinga-wm>	 PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:27] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 3 https://wikitech.wikimedia.org/wiki/HAProxy
[14:03:28] <_joe_>	 oh sigh
[14:03:37] <icinga-wm>	 PROBLEM - Host kafka-jumbo1009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:45] <_joe_>	 elukey: we might need to disable replication in mcrouter
[14:03:48] <icinga-wm>	 PROBLEM - Host labweb1002 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:53] <logmsgbot>	 !log kormat@cumin1001 START - Cookbook sre.hosts.downtime
[14:03:53] <logmsgbot>	 !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0)
[14:03:57] <icinga-wm>	 PROBLEM - Host wtp1048 is DOWN: PING CRITICAL - Packet loss = 100%
[14:03:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:01] <icinga-wm>	 PROBLEM - Host druid1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:01] <icinga-wm>	 PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:01] <icinga-wm>	 PROBLEM - Host mw1358 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:01] <icinga-wm>	 PROBLEM - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:07] <icinga-wm>	 PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:07] <icinga-wm>	 PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:07] <icinga-wm>	 PROBLEM - Host wtp1046 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:09] <icinga-wm>	 PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:15] <icinga-wm>	 PROBLEM - Host ores1008 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:15] <icinga-wm>	 PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100%
[14:04:33] <icinga-wm>	 PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100%
[14:05:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:05:41] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{proj
[14:05:41] <icinga-wm>	 }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) timed out before a response was received: /analytics.w
[14:05:41] <icinga-wm>	 dits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipe https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:05:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m
[14:05:55] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:05:59] <icinga-wm>	 PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: 
[14:05:59] <icinga-wm>	  most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:06:02] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:06:07] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:06:09] <icinga-wm>	 PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:06:19] <icinga-wm>	 PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:06:33] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:06:33] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:06:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:06:51] <icinga-wm>	 PROBLEM - SSH on dbproxy1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:06:55] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:58] <icinga-wm>	 PROBLEM - Auth DNS #page on nsa-v4 is CRITICAL: DNS_QUERY CRITICAL - no socket TCP[198.35.27.27] Connection timed out https://wikitech.wikimedia.org/wiki/DNS
[14:06:59] <icinga-wm>	 PROBLEM - Check systemd state on thanos-be1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:06:59] <icinga-wm>	 PROBLEM - SSH on backup1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:07:03] <icinga-wm>	 PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe
[14:07:03] <icinga-wm>	 t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:07:15] <icinga-wm>	 PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:07:19] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[14:07:25] <icinga-wm>	 PROBLEM - SSH on conf1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:07:29] <icinga-wm>	 PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:07:35] <icinga-wm>	 PROBLEM - SSH on elastic1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:07:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:07:57] <icinga-wm>	 PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100%
[14:07:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:07:59] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:08:05] <icinga-wm>	 PROBLEM - Check systemd state on restbase1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:08:07] <icinga-wm>	 PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:13] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:08:19] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:08:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1319.eqiad.wmnet, mw1395.eqiad.wmnet, mw1324.eqiad.wmnet, mw1367.eqiad.wmnet, mw1322.eqiad.wmnet, mw1333.eqiad.wmnet, mw1401.eqiad.wmnet, mw1403.eqiad.wmnet, mw1327.eqiad.wmnet, mw1328.eqiad.wmnet, mw1413.eqiad.wmnet, mw1369.eqiad.wmnet, mw1261.eqiad.wmnet, mw1405.eqiad.wmnet, mw1265.eqiad.wmnet are marked dow
[14:08:22] <icinga-wm>	 _80: Servers mw1342.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:08:35] <icinga-wm>	 PROBLEM - Check systemd state on es1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:08:47] <icinga-wm>	 RECOVERY - SSH on dbproxy1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:08:51] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:08:51] <icinga-wm>	 PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100%
[14:08:53] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:09:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:17] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:09:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:19] <icinga-wm>	 PROBLEM - Check systemd state on kafka-main1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:28] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui please see below  Hello Papaul,  After looking over the TSR and the link you'd sent me regarding the troubleshooting for t...
[14:09:31] <icinga-wm>	 RECOVERY - SSH on elastic1064 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:09:35] <icinga-wm>	 PROBLEM - Host ms-be1056 is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:39] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:09:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:42] <icinga-wm>	 PROBLEM - Check systemd state on mw1287 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:49] <icinga-wm>	 PROBLEM - Check systemd state on mw1274 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:09:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:09:55] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:09:57] <icinga-wm>	 PROBLEM - Check systemd state on mw1371 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:03] <icinga-wm>	 PROBLEM - SSH on cloudelastic1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:10:11] <icinga-wm>	 PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:15] <icinga-wm>	 PROBLEM - SSH on rdb1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:10:25] <icinga-wm>	 PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:10:27] <icinga-wm>	 PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:10:31] <icinga-wm>	 PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:31] <icinga-wm>	 PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:35] <icinga-wm>	 PROBLEM - Check systemd state on etcd1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:10:43] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:10:45] <icinga-wm>	 PROBLEM - Check systemd state on dbproxy1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:51] <icinga-wm>	 PROBLEM - Check systemd state on db1137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:10:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:51] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:10:57] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[14:10:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:10:59] <icinga-wm>	 PROBLEM - SSH on an-worker1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:11:01] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:11:11] <icinga-wm>	 PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:13] <icinga-wm>	 PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:11:25] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:11:25] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:25] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[14:11:27] <icinga-wm>	 RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging
[14:11:31] <icinga-wm>	 PROBLEM - Host ms-be1026 is DOWN: PING CRITICAL - Packet loss = 100%
[14:11:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:11:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:35] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[14:11:37] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 30 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[14:11:39] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 19 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[14:11:39] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:41] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:11:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:47] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:11:51] <icinga-wm>	 PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor
[14:11:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:11:57] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:11:57] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy
[14:11:59] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:12:01] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1034 site=eqiad tunnel=mc2034_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:12:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:12:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:12:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:12:17] <icinga-wm>	 RECOVERY - SSH on rdb1010 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:12:22] <icinga-wm>	 PROBLEM - Check systemd state on mw1369 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:12:29] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 67 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[14:12:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:12:31] <icinga-wm>	 PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:12:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:12:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:12:33] <icinga-wm>	 PROBLEM - Check systemd state on mw1400 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:12:36] <icinga-wm>	 PROBLEM - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:12:37] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:12:45] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 16 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[14:12:49] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 3.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:12:51] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1395.eqiad.wmnet, mw1401.eqiad.wmnet, mw1411.eqiad.wmnet, mw1274.eqiad.wmnet, mw1388.eqiad.wmnet, mw1372.eqiad.wmnet, mw1328.eqiad.wmnet, mw1267.eqiad.wmnet, mw1399.eqiad.wmnet, mw1326.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[14:12:51] <icinga-wm>	 PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[14:12:57] <icinga-wm>	 PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:13:01] <icinga-wm>	 RECOVERY - SSH on an-worker1095 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:13:03] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 13 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[14:13:05] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 3.593 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:13:09] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[14:13:09] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:13:11] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:13:15] <icinga-wm>	 RECOVERY - SSH on backup1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:13:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:13:20] <icinga-wm>	 RECOVERY - Auth DNS #page on nsa-v4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[14:13:23] <icinga-wm>	 RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 9.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:13:25] <icinga-wm>	 PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={mc2034,mc2035} site=codfw tunnel={mc1034_v4,mc1035_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:13:29] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 32220 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:13:29] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[14:13:32] <icinga-wm>	 PROBLEM - SSH on ms-be1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:13:33] <icinga-wm>	 RECOVERY - Apache HTTP on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:13:38] <bblack>	 !log dns1002 - disable puppet + bird service (stop advertising recdns from row D)
[14:13:38] <akosiaris>	 !log drain kubernetes1013, kubernetes1004. They are on row D 
[14:13:39] <icinga-wm>	 PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[14:13:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:13:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:45] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.584 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:13:45] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6145 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:13:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:52] <icinga-wm>	 RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[14:13:53] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:13:57] <icinga-wm>	 PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[14:14:07] <icinga-wm>	 PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:14:07] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:14:19] <icinga-wm>	 PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: PING CRITICAL - Packet loss = 100%
[14:14:27] <icinga-wm>	 PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish
[14:14:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:14:35] <icinga-wm>	 PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:38] <icinga-wm>	 RECOVERY - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[14:14:41] <icinga-wm>	 PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:59] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:15:05] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:15:15] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:15:19] <icinga-wm>	 RECOVERY - Host logstash1012 is UP: PING WARNING - Packet loss = 33%, RTA = 0.19 ms
[14:15:19] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.1 200 Ok - 32228 bytes in 7.420 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:15:21] <icinga-wm>	 RECOVERY - Host es1018 is UP: PING WARNING - Packet loss = 71%, RTA = 0.22 ms
[14:15:21] <icinga-wm>	 RECOVERY - Host mc1035 is UP: PING WARNING - Packet loss = 33%, RTA = 0.28 ms
[14:15:21] <icinga-wm>	 RECOVERY - Host cp1088 is UP: PING WARNING - Packet loss = 60%, RTA = 0.19 ms
[14:15:21] <icinga-wm>	 RECOVERY - Host centrallog1001 is UP: PING WARNING - Packet loss = 33%, RTA = 4.19 ms
[14:15:21] <icinga-wm>	 RECOVERY - Host kafka-jumbo1009 is UP: PING WARNING - Packet loss = 71%, RTA = 0.28 ms
[14:15:22] <icinga-wm>	 RECOVERY - Host an-worker1093 is UP: PING WARNING - Packet loss = 90%, RTA = 0.23 ms
[14:15:22] <icinga-wm>	 RECOVERY - Host mw1361 is UP: PING WARNING - Packet loss = 90%, RTA = 0.19 ms
[14:15:23] <icinga-wm>	 RECOVERY - Host mw1355 is UP: PING WARNING - Packet loss = 33%, RTA = 0.23 ms
[14:15:23] <icinga-wm>	 RECOVERY - Host stat1005 is UP: PING WARNING - Packet loss = 33%, RTA = 0.25 ms
[14:15:24] <icinga-wm>	 RECOVERY - Host mw1359 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[14:15:25] <icinga-wm>	 RECOVERY - Host mw1356 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[14:15:25] <icinga-wm>	 RECOVERY - Host aqs1006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:15:25] <icinga-wm>	 RECOVERY - Host mw1352 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[14:15:26] <icinga-wm>	 RECOVERY - Host ms-be1056 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:15:27] <icinga-wm>	 RECOVERY - Host elastic1061 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:15:27] <icinga-wm>	 RECOVERY - Host wtp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[14:15:27] <icinga-wm>	 RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[14:15:28] <icinga-wm>	 RECOVERY - Host mw1362 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[14:15:29] <icinga-wm>	 RECOVERY - Host mw1358 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[14:15:29] <icinga-wm>	 RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[14:15:29] <icinga-wm>	 RECOVERY - Host wtp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:15:30] <icinga-wm>	 RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[14:15:30] <icinga-wm>	 RECOVERY - Host ms-be1026 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms
[14:15:31] <icinga-wm>	 RECOVERY - Host snapshot1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[14:15:31] <icinga-wm>	 RECOVERY - Host mw1354 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms
[14:15:32] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:15:33] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:15:33] <icinga-wm>	 RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms
[14:15:33] <icinga-wm>	 RECOVERY - Host druid1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:15:34] <icinga-wm>	 RECOVERY - Host elastic1060 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[14:15:35] <icinga-wm>	 RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[14:15:35] <_joe_>	 here we go XioNoX
[14:15:35] <icinga-wm>	 RECOVERY - Host mw1357 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:15:36] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:36] <icinga-wm>	 RECOVERY - SSH on ms-be1059 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:15:37] <icinga-wm>	 RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[14:15:37] <icinga-wm>	 RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:15:38] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:15:39] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:15:39] <icinga-wm>	 RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering
[14:15:39] <icinga-wm>	 RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[14:15:40] <icinga-wm>	 RECOVERY - Host kafka-jumbo1006 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms
[14:15:40] <icinga-wm>	 RECOVERY - Host labweb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[14:15:41] <icinga-wm>	 RECOVERY - Host ores1008 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:15:42] <icinga-wm>	 RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:15:42] <icinga-wm>	 RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms
[14:15:45] <icinga-wm>	 RECOVERY - Host dumpsdata1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:15:47] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:15:47] <icinga-wm>	 RECOVERY - Host wtp1047 is UP: PING OK - Packet loss = 0%, RTA = 2.90 ms
[14:15:47] <icinga-wm>	 RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms
[14:15:47] <icinga-wm>	 RECOVERY - Host ms-be1039 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms
[14:15:47] <icinga-wm>	 RECOVERY - Host an-presto1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[14:15:51] <icinga-wm>	 RECOVERY - SSH on conf1006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:15:55] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.1 200 Ok - 32004 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:15:57] <icinga-wm>	 RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[14:15:59] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:09] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.0 200 OK - 25825 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:16:09] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 26012 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:16:11] <icinga-wm>	 PROBLEM - Check systemd state on flerovium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:16:11] <icinga-wm>	 RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:16:19] <icinga-wm>	 RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 26029 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:16:19] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:16:25] <icinga-wm>	 RECOVERY - SSH on cloudelastic1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[14:16:35] <icinga-wm>	 RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:16:53] <icinga-wm>	 RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[14:16:55] <icinga-wm>	 RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms
[14:16:57] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:16:57] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:17:01] <icinga-wm>	 PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:17:01] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.1 200 Ok - 35308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:17:05] <icinga-wm>	 RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[14:17:13] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 35323 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[14:17:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:19] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[14:17:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:19] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003
[14:17:25] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:17:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:41] <icinga-wm>	 RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status
[14:17:41] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:17:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:55] <icinga-wm>	 RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:17:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:17:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:05] <icinga-wm>	 PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:18:07] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:07] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005
[14:18:09] <icinga-wm>	 RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[14:18:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:25] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:18:25] <icinga-wm>	 RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs
[14:18:25] <icinga-wm>	 RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish
[14:18:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:31] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:37] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET
[14:18:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:49] <icinga-wm>	 RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[14:18:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:18:59] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[14:19:11] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:19:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:20:03] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 593 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[14:20:07] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002
[14:20:11] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[14:20:25] <elukey>	 godog: --^
[14:20:26] <icinga-wm>	 PROBLEM - Kafka Broker Under Replicated Partitions on logstash1010 is CRITICAL: 219 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010
[14:20:27] <icinga-wm>	 PROBLEM - Druid middlemanager on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:20:43] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:20:45] <icinga-wm>	 PROBLEM - Druid overlord on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:20:47] <icinga-wm>	 PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={logback-info,rsyslog-err,rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikim
[14:20:47] <icinga-wm>	 stash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[14:20:49] <icinga-wm>	 PROBLEM - Druid broker on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:20:56] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004
[14:20:59] <icinga-wm>	 PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:21:08] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on logstash1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:21:11] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001
[14:21:17] <icinga-wm>	 PROBLEM - Druid coordinator on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:21:29] <godog>	 elukey: sadly known :( centrallog1001 is one of the host affected
[14:21:39] <icinga-wm>	 PROBLEM - Druid historical on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:21:41] <elukey>	 ah snap sorry for the ping
[14:21:47] <icinga-wm>	 PROBLEM - Check systemd state on logstash1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:22:19] <_joe_>	 ferm seems to be failing almost everywhere
[14:22:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:13] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1386 days) https://wikitech.wikimedia.org/wiki/Logs
[14:24:41] <icinga-wm>	 RECOVERY - Check systemd state on mw1371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:25:14] <moritzm>	 I'm bouncing ferm on affected hosts, on e.g. mw1371 it was caused by prom1004 failing to resolve
[14:25:16] <icinga-wm>	 RECOVERY - Kafka Broker Server #page on logstash1012 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:25:53] <icinga-wm>	 RECOVERY - Check systemd state on logstash1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:26:35] <icinga-wm>	 RECOVERY - Kafka Broker Under Replicated Partitions on logstash1010 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010
[14:26:57] <icinga-wm>	 RECOVERY - Check systemd state on es1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:25] <icinga-wm>	 RECOVERY - Check systemd state on es1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:27:31] <icinga-wm>	 PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100%
[14:27:47] <icinga-wm>	 RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:15] <_joe_>	 uh
[14:28:22] <_joe_>	 what happened to mc1033?
[14:28:22] <icinga-wm>	 RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:52] <icinga-wm>	 RECOVERY - Druid overlord on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:28:55] <icinga-wm>	 RECOVERY - Check systemd state on restbase1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:28:55] <icinga-wm>	 RECOVERY - Druid broker on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:28:57] <icinga-wm>	 RECOVERY - Check systemd state on elastic1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:03] <icinga-wm>	 RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:03] <icinga-wm>	 RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:15] <icinga-wm>	 RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:19] <icinga-wm>	 RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:29:21] <icinga-wm>	 RECOVERY - Druid coordinator on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:29:27] <moritzm>	 can't even connect to mc1033's mgmt
[14:29:39] <icinga-wm>	 RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[14:29:41] <icinga-wm>	 RECOVERY - Druid historical on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:29:54] <akosiaris>	 moritzm: well, it's up now
[14:30:01] <moritzm>	 yeah, it rebooted apparently
[14:30:11] <moritzm>	 0 uptime, checking logs
[14:30:15] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul the host is depooled, we can power it off for you whenever you like
[14:30:31] <icinga-wm>	 RECOVERY - Druid middlemanager on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid
[14:30:51] <icinga-wm>	 RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:30:51] <icinga-wm>	 RECOVERY - Check systemd state on mw1369 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:03] <icinga-wm>	 RECOVERY - Check systemd state on mw1400 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:11] <icinga-wm>	 RECOVERY - Check systemd state on mw1321 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:13] <volans>	 !log restarted ssh on mc1033 from console
[14:31:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:17] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1530 days) https://wikitech.wikimedia.org/wiki/Logs
[14:31:23] <icinga-wm>	 RECOVERY - Check systemd state on mw1333 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:29] <icinga-wm>	 RECOVERY - Check systemd state on mw1320 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:05] <icinga-wm>	 RECOVERY - Check systemd state on mw1269 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:13] <moritzm>	 per SEL mc1033 went down with "power supply unplugged" for PS1
[14:32:23] <icinga-wm>	 RECOVERY - Check systemd state on mw1287 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:31] <icinga-wm>	 RECOVERY - Check systemd state on mw1274 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:32:36] <moritzm>	 but those should be redundant I'd think?
[14:32:49] <volans>	 moritzm: yes got rebooted 
[14:32:49] <apergos>	 a reminder that conversation and work about the incident is happening in the wikimedia-sre channel
[14:33:56] <moritzm>	 !log bouncing ferm on hosts where ferm.service failed due to DNS resolution issues for prometheus hosts
[14:34:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:34:15] <_joe_>	 moritzm: I did it already for the mw*
[14:34:47] <icinga-wm>	 RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:01] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:06] <moritzm>	 ack, there's a a few more for swift and more I'm currently doing
[14:36:15] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:15] <elukey>	 I am doing some analytics ones
[14:36:57] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:36:57] <icinga-wm>	 RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:01] <icinga-wm>	 RECOVERY - Check systemd state on etcd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:07] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:11] <icinga-wm>	 RECOVERY - Check systemd state on analytics1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:37:45] <icinga-wm>	 RECOVERY - Check systemd state on kafka-main1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:07] <icinga-wm>	 RECOVERY - Check systemd state on flerovium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:31] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00445 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:38:33] <icinga-wm>	 RECOVERY - Check systemd state on mc1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:38:37] <icinga-wm>	 RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:15] <icinga-wm>	 RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:19] <icinga-wm>	 RECOVERY - Check systemd state on db1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:27] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:35] <icinga-wm>	 RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:39:45] <icinga-wm>	 RECOVERY - Check systemd state on db1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:40:45] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:01] <icinga-wm>	 RECOVERY - Check systemd state on db1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:06] <wikibugs>	 (03Abandoned) 10Andrew Bogott: Revert "Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/625757 (owner: 10Andrew Bogott)
[14:41:23] <icinga-wm>	 RECOVERY - Check systemd state on thanos-be1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:41:45] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[14:41:49] <icinga-wm>	 RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:29] <icinga-wm>	 RECOVERY - Check systemd state on restbase1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:29] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:42:51] <icinga-wm>	 RECOVERY - Check systemd state on dbproxy1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:25] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:00] <XioNoX>	 !log reboot asw2-d3-eqiad
[14:44:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:45:01] <icinga-wm>	 RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops
[14:45:28] <wikibugs>	 (03PS1) 10Jbond: icinga: add bblack alternate case for correct authorisation [puppet] - 10https://gerrit.wikimedia.org/r/625925
[14:45:44] <jynus>	 !log restarting bacula-dir @ backup1001
[14:45:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:24] <jynus>	 yeah, some ongoing full backups got errors, no big deal
[14:46:36] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] icinga: add bblack alternate case for correct authorisation [puppet] - 10https://gerrit.wikimedia.org/r/625925 (owner: 10Jbond)
[14:46:37] <jynus>	 I will check all daemons look healty, I needed a restart
[14:50:06] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.0006357 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:53:33] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-masters.py: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/625913 (owner: 10Elukey)
[14:53:37] <marostegui>	 !log Reload dbproxy1016 to recover the alert
[14:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:45] <_joe_>	 jouncebot: next
[14:57:45] <jouncebot>	 In 1 hour(s) and 2 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1600)
[15:02:56] <XioNoX>	 !log request virtual-chassis vc-port set pic-slot 1 member 1 port 1
[15:02:56] <XioNoX>	 !log request virtual-chassis vc-port set pic-slot 0 member 2 port 50
[15:02:56] <XioNoX>	 !log request virtual-chassis vc-port set pic-slot 1 member 4 port 0
[15:02:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:00] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=kubernetes1004.*
[15:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:42] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: service=kubesvc,name=kubernetes1013.*
[15:03:44] <icinga-wm>	 RECOVERY - Host dbproxy1017 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[15:03:44] <icinga-wm>	 RECOVERY - Host pc1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms
[15:03:44] <icinga-wm>	 RECOVERY - Host scb1004 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[15:03:44] <icinga-wm>	 RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[15:03:44] <icinga-wm>	 RECOVERY - Host kubernetes1004 is UP: PING WARNING - Packet loss = 75%, RTA = 0.25 ms
[15:03:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:46] <icinga-wm>	 RECOVERY - Host sessionstore1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[15:03:46] <icinga-wm>	 RECOVERY - Host wtp1044 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms
[15:03:46] <icinga-wm>	 RECOVERY - Host restbase1018 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[15:03:46] <icinga-wm>	 RECOVERY - Host rdb1006 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[15:03:47] <icinga-wm>	 RECOVERY - Host thorium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms
[15:03:47] <icinga-wm>	 RECOVERY - Host stat1006 is UP: PING WARNING - Packet loss = 50%, RTA = 0.20 ms
[15:03:48] <icinga-wm>	 RECOVERY - Host mw1363 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[15:03:48] <icinga-wm>	 RECOVERY - Host releases1002 is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms
[15:03:48] <icinga-wm>	 RECOVERY - Host ores1007 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[15:03:49] <icinga-wm>	 RECOVERY - Host wtp1043 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[15:03:50] <icinga-wm>	 RECOVERY - Host mw1364 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[15:03:50] <icinga-wm>	 RECOVERY - Host mw1365 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[15:03:50] <icinga-wm>	 RECOVERY - Host wtp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[15:03:51] <icinga-wm>	 RECOVERY - Host restbase1025 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms
[15:03:52] <icinga-wm>	 RECOVERY - Host kubernetes1013 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[15:03:58] <icinga-wm>	 RECOVERY - Host elastic1062 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms
[15:04:00] <icinga-wm>	 RECOVERY - Host maps1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms
[15:04:00] <icinga-wm>	 RECOVERY - Host elastic1063 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[15:04:00] <icinga-wm>	 RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms
[15:04:08] <icinga-wm>	 RECOVERY - Host aqs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms
[15:04:12] <icinga-wm>	 RECOVERY - Host eventlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms
[15:04:12] <icinga-wm>	 RECOVERY - Host schema1004 is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms
[15:05:46] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:05:54] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy
[15:06:20] <icinga-wm>	 RECOVERY - IPMI Sensor Status on pc1010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:07:32] <librenms-wmf>	 04Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Emergency syslog message
[15:07:48] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:07:56] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02056 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:08:24] <icinga-wm>	 RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets
[15:08:44] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:08:44] <icinga-wm>	 RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 60, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:08:50] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[15:08:54] <icinga-wm>	 RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 250, down: 5, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:08:56] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:10:22] <marostegui>	 !log Start mysql on db1106 after PDU maintenance is done
[15:10:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:12:32] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message
[15:13:10] <herron>	 !log rolling restart of elk5 logstashes
[15:13:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:36] <bblack>	 !log repool cp1087-90 (eqiad row D)
[15:14:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:14:52] <vgutierrez>	 nice :)
[15:14:58] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp108[789].eqiad.wmnet
[15:14:59] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1090.eqiad.wmnet
[15:15:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:20] <_joe_>	 !log starting wdqs-updater on wdqs1005
[15:16:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:16:40] <wikibugs>	 (03PS1) 10Jbond: bird: ensure bird service is running [puppet] - 10https://gerrit.wikimedia.org/r/625926
[15:17:34] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:16] <icinga-wm>	 PROBLEM - ores_workers_running on ores1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[15:18:32] <icinga-wm>	 PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:18:34] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: pc1 on pc1010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4819.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:18:49] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters
[15:18:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:55] <jynus>	 checking pc1010, I am sure it is expected
[15:19:17] <wikibugs>	 (03CR) 10Jbond: "ready for review - PCC: https://puppet-compiler.wmflabs.org/compiler1002/24994/" [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond)
[15:19:31] <_joe_>	 !log restarted ferm on wdqs1011
[15:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:19:49] <jynus>	 pc1010 recovering nicely, catching up from pc1007
[15:19:59] <jynus>	 I will ack it
[15:20:48] <_joe_>	 !log restarted celery-ores-worker.service on ores1007
[15:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:26] <jynus>	 trying to remove noise right now to focus on possible leftovers
[15:22:12] <icinga-wm>	 RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:48] <icinga-wm>	 RECOVERY - ores_workers_running on ores1007 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES
[15:26:38] <logmsgbot>	 !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
[15:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:29:53] <wikibugs>	 (03PS5) 10Ottomata: Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609)
[15:30:03] <wikibugs>	 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 2 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10jcrespo) I am CCin...
[15:30:03] <elukey>	 !log roll restart of hadoop master daemons on an-master100[1,2] after the cookbook failed 
[15:30:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:30:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:31:44] <wikibugs>	 (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24995/" [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff)
[15:32:30] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: service=kubesvc,name=kubernetes1013.*
[15:32:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:33:08] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr)
[15:33:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr)
[15:34:34] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002491 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:34:43] <logmsgbot>	 !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1004.*
[15:34:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:36:03] <wikibugs>	 (03PS6) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425)
[15:39:46] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add $site parameter to wmflib::service:get_url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:43:44] <wikibugs>	 10Operations, 10netops: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 (10ayounsi) p:05Triage→03High
[15:45:47] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski)
[15:47:58] <wikibugs>	 (03PS6) 10Ottomata: Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609)
[15:48:34] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:49:23] <wikibugs>	 (03PS1) 10Jbond: pki: drop ocsp vhost and serve over http [puppet] - 10https://gerrit.wikimedia.org/r/625929 (https://phabricator.wikimedia.org/T259117)
[15:49:45] <wikibugs>	 (03PS7) 10Ottomata: Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609)
[15:51:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 (owner: 10Giuseppe Lavagetto)
[15:52:38] <wikibugs>	 (03CR) 10Ayounsi: "From https://wikitech.wikimedia.org/wiki/Anycast" [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond)
[15:53:42] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[15:54:07] <wikibugs>	 (03PS1) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930
[15:55:08] <wikibugs>	 (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/24997/" [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:56:00] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but check the puppet compiler ofc." [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:58:02] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:58:08] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond)
[15:58:41] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:58:53] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] Add $site parameter to wmflib::service:get_url (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[15:59:01] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 (owner: 10Giuseppe Lavagetto)
[16:00:05] <jouncebot>	 jbond42 and cdanis: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1600).
[16:00:17] <wikibugs>	 (03Merged) 10jenkins-bot: Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 (owner: 10Giuseppe Lavagetto)
[16:02:19] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[16:02:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:02:51] <wikibugs>	 (03PS1) 10Elukey: sre.hadoop.roll-restart-masters.py: improve logging and sleep times [cookbooks] - 10https://gerrit.wikimedia.org/r/625932
[16:03:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] pki: drop ocsp vhost and serve over http [puppet] - 10https://gerrit.wikimedia.org/r/625929 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond)
[16:03:13] <wikibugs>	 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 2 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10Jdlrobson) It only...
[16:03:13] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[16:03:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:03:46] <logmsgbot>	 !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' .
[16:03:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:04] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev, 10cloud-services-team (Kanban): Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10dcausse) t206636 and wcqs-beta-01 are behind the se...
[16:04:12] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10Jgreen)
[16:08:17] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dell mentioned that it is something to do with the OS and requested the sosreport. since we can not share that information with them i...
[16:10:52] <wikibugs>	 (03PS1) 10Cwhite: parse_service_problem doesn't need instance-local data move parse_service problem to global function and have am import it clean up imports [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934
[16:11:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-masters.py: improve logging and sleep times [cookbooks] - 10https://gerrit.wikimedia.org/r/625932 (owner: 10Elukey)
[16:11:49] <wikibugs>	 (03CR) 10Cwhite: Add Icinga AM client (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi)
[16:12:03] <longma>	 !log 1.36.0-wmf.8 was branched at e81e81e91473cc8259c473165863aca8ecea2784 for T257976
[16:12:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:11] <stashbot>	 T257976: 1.36.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T257976
[16:12:48] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 (https://phabricator.wikimedia.org/T257976) (owner: 10TrainBranchBot)
[16:15:42] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: cxserver: enable the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625935 (https://phabricator.wikimedia.org/T255879)
[16:15:44] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: make template for the restbase uri configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625936 (https://phabricator.wikimedia.org/T255876)
[16:15:46] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mobileapps: use the service proxy for all calls in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625937 (https://phabricator.wikimedia.org/T255876)
[16:16:37] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul there is nothing really on the OS that we've seen that could cause these crashes. What we did on both crashes is the same:...
[16:22:46] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)
[16:24:33] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)  @Gehel  are you the right person for those servers? If yes i need to know what hardware raid type we are going to use.  Thanks
[16:24:42] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Add Icinga AM client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948)
[16:25:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: Add Icinga AM client (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi)
[16:25:51] <wikibugs>	 (03PS19) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609)
[16:25:53] <wikibugs>	 (03PS2) 10Ottomata: wgEventStreams - Set canary_events_enabled: true for eventgate test streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622876 (https://phabricator.wikimedia.org/T251609)
[16:26:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, eventually sending out problems will live in am.py exclusively (i.e. parse_service_problem)" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 (owner: 10Cwhite)
[16:26:53] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[16:28:02] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.861e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:28:13] <wikibugs>	 (03PS20) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609)
[16:28:37] <wikibugs>	 (03PS3) 10Ottomata: wgEventStreams - Set canary_events_enabled: true for eventgate test streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622876 (https://phabricator.wikimedia.org/T251609)
[16:31:54] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 (https://phabricator.wikimedia.org/T257976) (owner: 10TrainBranchBot)
[16:32:23] <jynus>	 40K memcache errors/minute on eqiad
[16:32:40] <jynus>	 https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=1
[16:33:08] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] wgEventStreams - Set canary_events_enabled: true for eventgate test streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622876 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[16:33:28] <wikibugs>	 (03PS21) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609)
[16:34:13] <herron>	 !log increased elk5 logstash JVM heaps to 2g (to help decrease kafka-logging consumer lag)
[16:34:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:34:38] <jynus>	 maybe just backlog being counted now?
[16:35:03] <jynus>	 logstash seems to have 2 hours of delay
[16:35:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:35:31] <logmsgbot>	 !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: Set canary_events_enabled: true for eventgate test streams and eventlogging_Test - T251609 (duration: 00m 58s)
[16:35:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:35:41] <stashbot>	 T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609
[16:35:59] <herron>	 jynus: yeah logstash is lagging and catching up now
[16:36:13] <jynus>	 some some rate-based alerts are only alerting now
[16:39:26] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 815.7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:39:42] <apergos>	 at least it spits out the recoveries pretty fast too
[16:40:25] <jynus>	 I wonder if there is any theoretical way to avoid or mitigate this when a network partition happens?
[16:46:17] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul)
[16:46:34] <godog>	 jynus: yeah it is a WIP, "mitigation" in this case will be to calculate the metrics from elasticsearch queries, as opposed to logstash so the metrics will reflect what's been indexed as opposed to what's been ingested ATM
[16:48:13] <godog>	 T256418 that is
[16:48:14] <stashbot>	 T256418: Evaluate alternative to Logstash StatsD outputs - https://phabricator.wikimedia.org/T256418
[16:48:34] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)
[16:50:01] <wikibugs>	 (03CR) 10Cwhite: "> Patch Set 1: Code-Review+1" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 (owner: 10Cwhite)
[16:52:14] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[16:55:21] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) > A-Side Information > Customer > WIKIMEDIA FOUNDATION INC. > IBX > DC6 > Cage > DC6:01:061130 > Cabinet > 0000 > Space ID > DC6:01:061130 > System Name > DC6:1:61130:WIKIMEDIA FOUNDA...
[16:56:42] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: pc1 on pc1010 is OK: OK slave_sql_lag Replication lag: 22.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[16:58:49] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul)
[16:59:31] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)
[17:00:04] <jouncebot>	 chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1700).
[17:01:35] <wikibugs>	 (03PS1) 10Hnowlan: api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277)
[17:02:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan)
[17:03:28] <klausman>	 !log attempted to add rock-dkms_3.3-19_all.deb to thirdparty/amd-rocm33 for use on analytics servers with GPUs
[17:03:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:13] <wikibugs>	 (03PS2) 10Hnowlan: api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277)
[17:09:28] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10Jgreen)
[17:18:42] <icinga-wm>	 PROBLEM - ensure kvm processes are running on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:22:47] <icinga-wm>	 RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All
[17:23:00] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki)
[17:23:02] <andrewbogott>	 !log rebooting cloudvirt1033
[17:23:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:28:07] <icinga-wm>	 RECOVERY - ensure kvm processes are running on cloudvirt1033 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting
[17:28:33] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH)
[17:36:32] <wikibugs>	 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki)
[17:37:32] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)
[17:45:40] <wikibugs>	 (03PS1) 10Bstorm: cloudceph: Add cpufreq tools to set cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/625947
[17:47:09] <Amir1>	 Stealing the services window, deploying a security fix
[17:50:32] <wikibugs>	 (03PS4) 10Effie Mouzeli: php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009)
[17:50:39] <wikibugs>	 (03PS5) 10Effie Mouzeli: php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009)
[17:53:19] <wikibugs>	 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 2 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10Amorymeltzer) That...
[17:53:37] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[17:53:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:54:25] <Amir1>	 !log Deployed patch for T262240
[17:54:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:08] <wikibugs>	 (03PS1) 10Jeena Huneidi: Increase the DPL cache time from 1 day to 7 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625950 (https://phabricator.wikimedia.org/T262240)
[17:57:10] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] Increase the DPL cache time from 1 day to 7 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625950 (https://phabricator.wikimedia.org/T262240) (owner: 10Jeena Huneidi)
[17:57:12] <wikibugs>	 (03PS1) 10Jeena Huneidi: testwikis wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625951
[17:57:14] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625951 (owner: 10Jeena Huneidi)
[17:57:45] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul)
[17:57:52] <logmsgbot>	 !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97)
[17:57:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:57:58] <wikibugs>	 (03Merged) 10jenkins-bot: Increase the DPL cache time from 1 day to 7 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625950 (https://phabricator.wikimedia.org/T262240) (owner: 10Jeena Huneidi)
[17:58:02] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625951 (owner: 10Jeena Huneidi)
[17:58:24] <ryankemper>	 ^ manual ctrl+c'd that cookbook since I'd forgotten to run it inside a tmux session
[17:58:30] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload
[17:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:59:03] <volans>	 ryankemper: you can add a call to https://doc.wikimedia.org/spicerack/master/api/spicerack.interactive.html#spicerack.interactive.ensure_shell_is_durable
[17:59:12] <volans>	 to the cookbook if that's needed 
[18:00:04] <jouncebot>	 Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1800)
[18:00:04] <volans>	 needed as in it's a long operation that should be run in a durable session (tmux/screen) 
[18:00:20] <ryankemper>	 volans: neat! yup this is a long-running operation so I will definitely add that, thanks for the tip
[18:00:29] <logmsgbot>	 !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.8
[18:00:30] <volans>	 np :)
[18:00:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:04:42] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski)
[18:15:22] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.0137 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[18:17:07] <wikibugs>	 (03CR) 10Ppchelko: [C: 03+1] "gosh this thing is annoying. Trying to survive armageddon" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan)
[18:19:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Convert wmcs-novastats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/624805 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs)
[18:21:10] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: increase live_migration_completion_timeout [puppet] - 10https://gerrit.wikimedia.org/r/625943 (owner: 10Andrew Bogott)
[18:22:35] <elukey>	 !log rm /srv/prometheus/ops/targets/mjolnir_msearch_eqiad.yaml on prometheus100[3,4] as cleanup after https://gerrit.wikimedia.org/r/621988 - T260305
[18:22:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:22:42] <stashbot>	 T260305: mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001 - https://phabricator.wikimedia.org/T260305
[18:24:37] <wikibugs>	 (03PS2) 10Andrew Bogott: OpenStack nova: increase live_migration_completion_timeout [puppet] - 10https://gerrit.wikimedia.org/r/625943
[18:24:39] <wikibugs>	 (03PS5) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081)
[18:26:04] <wikibugs>	 (03PS6) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081)
[18:27:57] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) (owner: 10Andrew Bogott)
[18:28:01] <wikibugs>	 (03PS1) 10Elukey: mjolnir: fix syslog identifier in the msearch systemd unit template [puppet] - 10https://gerrit.wikimedia.org/r/625952 (https://phabricator.wikimedia.org/T260305)
[18:29:44] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RKemper)
[18:30:26] <wikibugs>	 (03CR) 10Alex Paskulin: [C: 03+1] "Reviewed from a requirements perspective and looks good! Thanks, Hugh!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan)
[18:32:07] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] mjolnir: fix syslog identifier in the msearch systemd unit template [puppet] - 10https://gerrit.wikimedia.org/r/625952 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey)
[18:35:45] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RKemper) @Papaul We'd like to use RAID10 as our hardware RAID.  (There's a bit of context [[ https://phabricator.wikimedia.org/T257950#6338944 | here ]] wh...
[18:48:56] <wikibugs>	 (03PS1) 10Elukey: mjolnir: fix syslog identifier of the msearch instances [puppet] - 10https://gerrit.wikimedia.org/r/625954 (https://phabricator.wikimedia.org/T260305)
[18:50:14] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] mjolnir: fix syslog identifier of the msearch instances [puppet] - 10https://gerrit.wikimedia.org/r/625954 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey)
[18:57:34] <wikibugs>	 (03CR) 10Bstorm: "When this restarts via exec in puppet, it should set the governor. You can validate that with the command it uses to check (cpufreq-info -" [puppet] - 10https://gerrit.wikimedia.org/r/625947 (owner: 10Bstorm)
[19:00:05] <jouncebot>	 longma and liw: That opportune time is upon us again. Time for a Mediawiki train - American+European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1900).
[19:00:24] <longma>	 Still deploying to testwikis
[19:09:47] <wikibugs>	 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10good first task: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844 (10Gehel) 05Open→03Declined
[19:12:15] <logmsgbot>	 !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.8 (duration: 71m 45s)
[19:12:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:16:01] <longma>	 Deploying 1.36.0-wmf.8 to group0
[19:17:31] <wikibugs>	 (03PS1) 10Jeena Huneidi: group0 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625960
[19:17:33] <wikibugs>	 (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625960 (owner: 10Jeena Huneidi)
[19:18:14] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625960 (owner: 10Jeena Huneidi)
[19:19:40] <logmsgbot>	 !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.8
[19:19:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:30:30] <wikibugs>	 10Operations, 10MediaWiki-Uploading, 10SRE-swift-storage, 10Structured Data Engineering, and 2 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10Krinkle)
[19:30:42] <wikibugs>	 10Operations, 10MediaWiki-Uploading, 10SRE-swift-storage, 10Structured Data Engineering, and 2 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10Krinkle) This is a 1y+ production error still waiting to...
[19:37:12] <wikibugs>	 (03PS1) 10Catrope: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239)
[19:39:29] <wikibugs>	 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Mraish)
[19:41:01] <icinga-wm>	 PROBLEM - SSH on wtp1047.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:46:58] <wikibugs>	 (03CR) 10Ottomata: "Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata)
[19:55:47] <wikibugs>	 (03PS1) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087)
[19:58:00] <wikibugs>	 (03PS2) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087)
[20:10:09] <wikibugs>	 (03PS2) 10Cwhite: parse_service_problem doesn't need instance-local data move parse_service problem to global function and have am import it clean up imports [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934
[20:14:21] <wikibugs>	 (03Abandoned) 10Cwhite: prometheus: add apache2 es-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/621597 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite)
[20:26:40] <wikibugs>	 (03PS3) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087)
[20:33:19] <wikibugs>	 (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov)
[20:34:30] <wikibugs>	 (03PS1) 10Andrew Bogott: labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614)
[20:34:54] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[20:35:52] <wikibugs>	 (03PS2) 10Andrew Bogott: labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614)
[20:41:52] <icinga-wm>	 RECOVERY - SSH on wtp1047.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[20:45:38] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul)
[20:49:41] <wikibugs>	 (03CR) 10BryanDavis: labspuppetbackend: support requests for VMs/prefixes under .wmflabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[21:12:06] <wikibugs>	 (03PS3) 10Andrew Bogott: labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614)
[21:20:42] <wikibugs>	 (03PS1) 10Cwhite: profile: remove usage of logstash statsd outputs [puppet] - 10https://gerrit.wikimedia.org/r/625975 (https://phabricator.wikimedia.org/T256418)
[21:37:45] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[21:40:37] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)
[21:47:04] <wikibugs>	 (03PS1) 10Bstorm: tools-grid: Install correct version of php-igbinary [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186)
[21:49:39] <wikibugs>	 (03PS1) 10Andrew Bogott: labspuppetbackend: fix support for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625980 (https://phabricator.wikimedia.org/T260614)
[21:50:22] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: fix support for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625980 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[21:55:09] <wikibugs>	 (03PS1) 10Cwhite: profile: update alerts on mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418)
[21:55:22] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] "This is a flashback to https://phabricator.wikimedia.org/T213666#4893465." [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) (owner: 10Bstorm)
[21:56:30] <wikibugs>	 (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) (owner: 10Bstorm)
[21:56:36] <wikibugs>	 (03CR) 10Bstorm: [C: 03+2] tools-grid: Install correct version of php-igbinary [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) (owner: 10Bstorm)
[21:57:22] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@7a3221d]: refreshing to clobber local hacks
[21:57:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:57:34] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@7a3221d]: refreshing to clobber local hacks (duration: 00m 13s)
[21:57:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:02:30] <wikibugs>	 (03PS1) 10Cwhite: prometheus: update mediawiki query timestamp filter [puppet] - 10https://gerrit.wikimedia.org/r/625984 (https://phabricator.wikimedia.org/T256418)
[22:02:39] <logmsgbot>	 !log pt1979@cumin2001 START - Cookbook sre.dns.netbox
[22:02:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:38] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) @RKemper thank you for the info. What stripe size for the RAID 10?
[22:08:31] <logmsgbot>	 !log andrew@deploy1001 Started deploy [horizon/deploy@7d727eb]: very minor wmf-puppet-dashboard update
[22:08:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:12:06] <logmsgbot>	 !log andrew@deploy1001 Finished deploy [horizon/deploy@7d727eb]: very minor wmf-puppet-dashboard update (duration: 03m 35s)
[22:12:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:14:34] <logmsgbot>	 !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:14:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:21:32] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul)
[22:34:19] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul)
[22:39:48] <wikibugs>	 (03PS1) 10Andrew Bogott: labspuppetbackend: rearrange args to re.sub [puppet] - 10https://gerrit.wikimedia.org/r/625992 (https://phabricator.wikimedia.org/T260614)
[22:40:44] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+1] labspuppetbackend: rearrange args to re.sub [puppet] - 10https://gerrit.wikimedia.org/r/625992 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[22:40:47] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: rearrange args to re.sub [puppet] - 10https://gerrit.wikimedia.org/r/625992 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott)
[22:59:37] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan)
[23:00:04] <jouncebot>	 RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:05:36] <icinga-wm>	 PROBLEM - SSH on wdqs1005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:08:30] <wikibugs>	 (03PS1) 10Cwhite: update debian files to handle new prometheus-icinga-am service [debs/prometheus-icinga-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/626001
[23:11:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:15:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:44:32] <icinga-wm>	 PROBLEM - SSH on wtp1047.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:46:51] <bd808>	 jouncebot: refresh
[23:46:53] <jouncebot>	 I refreshed my knowledge about deployments.
[23:48:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[23:57:34] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops