[00:09:05] PROBLEM - Number of messages locally queued by purged for processing on cp2037 is CRITICAL: cluster=cache_text instance=cp2037 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [00:09:45] PROBLEM - Number of messages locally queued by purged for processing on cp2033 is CRITICAL: cluster=cache_text instance=cp2033 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [00:10:05] PROBLEM - Number of messages locally queued by purged for processing on cp1079 is CRITICAL: cluster=cache_text instance=cp1079 job=purged layer=backend site=eqiad https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [00:10:25] PROBLEM - Number of messages locally queued by purged for processing on cp5009 is CRITICAL: cluster=cache_text instance=cp5009 job=purged layer=backend site=eqsin https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [00:10:33] PROBLEM - Number of messages locally queued by purged for processing on cp2035 is CRITICAL: cluster=cache_text instance=cp2035 job=purged layer=backend site=codfw https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [00:11:31] PROBLEM - Number of messages locally queued by purged for processing on cp3050 is CRITICAL: cluster=cache_text instance=cp3050 job=purged layer=backend site=esams https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [00:14:13] RECOVERY - Number of messages locally queued by purged for processing on cp5009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqsin+prometheus/ops&var-instance=cp5009 [00:15:21] RECOVERY - Number of messages locally queued by purged for processing on cp3050 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=esams+prometheus/ops&var-instance=cp3050 [00:15:47] RECOVERY - Number of messages locally queued by purged for processing on cp1079 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=eqiad+prometheus/ops&var-instance=cp1079 [00:16:15] RECOVERY - Number of messages locally queued by purged for processing on cp2035 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2035 [00:17:21] RECOVERY - Number of messages locally queued by purged for processing on cp2033 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2033 [00:18:37] RECOVERY - Number of messages locally queued by purged for processing on cp2037 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Purged%23Alerts https://grafana.wikimedia.org/dashboard/db/purged?var-datasource=codfw+prometheus/ops&var-instance=cp2037 [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 [02:18:21] (03PS2) 10DannyS712: Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 (https://phabricator.wikimedia.org/T257976) (owner: 10TrainBranchBot) [03:02:17] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=webperf_arclamp site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:07:57] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [03:17:31] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) @Marostegui the Next time you have this problem, open the first 1GB NIC and change the setting from None to PXE and do t... [03:47:11] (03PS1) 10Gergő Tisza: Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) [04:15:05] (03CR) 10Nuria: [C: 03+1] "Thanks for cleaning up" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [04:37:42] (03PS5) 10Andrew Bogott: Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) [04:37:44] (03PS1) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) [04:38:08] (03PS2) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) [04:38:19] (03CR) 10jerkins-bot: [V: 04-1] Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) (owner: 10Andrew Bogott) [04:38:36] (03CR) 10jerkins-bot: [V: 04-1] Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) (owner: 10Andrew Bogott) [04:39:14] (03PS1) 10Andrew Bogott: wmcs-novastats-capacity: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/625773 [04:40:36] (03Abandoned) 10Andrew Bogott: wmcs-novastats-capacity: reformat with black [puppet] - 10https://gerrit.wikimedia.org/r/625773 (owner: 10Andrew Bogott) [04:40:38] (03PS3) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) [04:45:21] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [04:47:17] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:15:11] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['es2026.codfw.wmnet... [05:23:28] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) >>! In T260373#6441588, @Papaul wrote: > @Marostegui the Next time you have this problem, > > open the first 1GB NIC... [05:23:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:32:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime [05:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:34:26] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [05:34:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:35:25] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:55:14] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['es2026.codfw.wmnet'] ` and were **ALL** successful. [05:56:49] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) es2026 got installed correctly: ` root@es2026:~# free -g ; df -hT /srv total used free... [06:04:41] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) I have given it most of the vg remaining size: ` root@es2026:~# pvs PV VG Fmt Attr PSize PFree /dev/... [06:05:35] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Marostegui) [06:14:43] !log Stop MySQL on db1106 for PDU maintenance T261452 [06:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:49] T261452: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 [06:18:15] (03PS1) 10Marostegui: install_server: Do not reimage es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625778 [06:18:52] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625778 (owner: 10Marostegui) [06:21:20] (03PS1) 10Legoktm: [DNM] Dummy change to test CI [puppet] - 10https://gerrit.wikimedia.org/r/625779 [06:21:49] 10Operations, 10ops-eqiad, 10Discovery-Search (Current work), 10Patch-For-Review: (Need by: 2020-04-02) rack/setup/install relforge100[34] - https://phabricator.wikimedia.org/T241791 (10elukey) @RKemper not sure how familiar are you with the magical world of serial consoles, I'll add a few links and then y... [06:22:46] (03PS2) 10Legoktm: [DNM] Dummy change to test CI [puppet] - 10https://gerrit.wikimedia.org/r/625779 [06:23:19] !log roll restart of Hadoop master daemons on an-master100[1,2] to pick up new opejdk settings [06:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:58] (03CR) 10ZPapierski: Multiple instances of msearch_daemon (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [06:25:59] (03PS3) 10Legoktm: build: Use at least commit-message-validator 0.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) [06:29:33] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [06:31:27] !log Deploy schema change on s5 eqiad master - T253276 [06:31:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:34] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [06:37:27] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [06:38:26] (03CR) 10Jforrester: [C: 03+1] Disable event logging in MediaViewer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [06:40:55] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [06:41:05] (03CR) 10jerkins-bot: [V: 04-1] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [06:42:54] (03PS3) 10Giuseppe Lavagetto: mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) [06:43:10] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] mobileapps: use the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625619 (https://phabricator.wikimedia.org/T255876) (owner: 10Giuseppe Lavagetto) [06:44:09] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [06:47:16] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [06:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:42] (03PS1) 10ArielGlenn: update dumps web pages to note that recent versions of windows 7zip work [puppet] - 10https://gerrit.wikimedia.org/r/625780 (https://phabricator.wikimedia.org/T208647) [06:50:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1106 for PDU maintenance', diff saved to https://phabricator.wikimedia.org/P12513 and previous config saved to /var/cache/conftool/dbconfig/20200908-065022-marostegui.json [06:50:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:52] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [06:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:31] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [06:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:56] !log Deploy schema change on s2 eqiad master - T253276 [06:59:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:01] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [06:59:41] (03PS3) 10Muehlenhoff: Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635 [07:00:21] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [07:00:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:37] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete cas-graphite and cas-icinga DNS entries [dns] - 10https://gerrit.wikimedia.org/r/625635 (owner: 10Muehlenhoff) [07:14:31] (03PS2) 10Muehlenhoff: reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [07:15:29] (03CR) 10jerkins-bot: [V: 04-1] reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 (owner: 10Muehlenhoff) [07:20:45] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:26:02] (03PS1) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 [07:26:23] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [07:26:31] (03CR) 10Hashar: [C: 03+1] build: Use at least commit-message-validator 0.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) (owner: 10Legoktm) [07:27:31] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10JMeybohm) [07:30:38] (03PS2) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 [07:37:17] (03PS3) 10Muehlenhoff: reboot-groups (WIP) [cookbooks] - 10https://gerrit.wikimedia.org/r/625597 [07:37:52] (03CR) 10Muehlenhoff: [C: 03+2] Retire the HTTP listener for debmonitor (along with ferm rules) [puppet] - 10https://gerrit.wikimedia.org/r/625658 (owner: 10Muehlenhoff) [07:40:04] !log move HE from ix to transit BGP group on cr3-eqsin [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:49] (03CR) 10Elukey: "Pcc for the search-loader vms: https://puppet-compiler.wmflabs.org/compiler1002/24989/" [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [07:42:03] jouncebot: now [07:42:04] No deployments scheduled for the next 3 hour(s) and 17 minute(s) [07:44:34] !log roll restart kafka daemons on kafka-jumbo100[7-9] to pick up opendjk upgrades [07:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:43] PROBLEM - debmonitor.wikimedia.org:80 on debmonitor2002 is CRITICAL: connect to address 10.192.32.42 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [07:44:49] !log urbanecm@deploy1001 Synchronized private/PrivateSettings.php: Revert "Update T250887 mitigations" (T250887; T262242) (duration: 00m 59s) [07:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:56] T262242: Save Timing regression on 2020-09-07 at 18:04 UTC - https://phabricator.wikimedia.org/T262242 [07:46:40] (03CR) 10Alexandros Kosiaris: [C: 03+1] push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [07:50:18] PROBLEM - debmonitor.wikimedia.org:80 on debmonitor1002 is CRITICAL: connect to address 10.64.16.72 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [07:51:11] ^ expected, will be fixed with next Puppet run on icinga1001 [07:52:01] (03CR) 10JMeybohm: [C: 03+1] push-notifications: add proxy settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/625709 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [07:52:37] 10Operations, 10Release-Engineering-Team-TODO, 10Release-Engineering-Team (Deployment services), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar) [07:58:12] PROBLEM - debmonitor.wikimedia.org:80 on debmonitor2001 is CRITICAL: connect to address 10.192.0.14 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [08:03:28] (03PS10) 10JMeybohm: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [08:06:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [08:08:18] PROBLEM - debmonitor.wikimedia.org:80 on debmonitor1001 is CRITICAL: connect to address 10.64.32.62 and port 80: Connection refused https://wikitech.wikimedia.org/wiki/Debmonitor [08:11:15] (03PS4) 10Jbond: build: Use at least commit-message-validator 0.7.0 [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) (owner: 10Legoktm) [08:12:02] 10Operations, 10Traffic: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10ayounsi) 👍 [08:13:08] (03CR) 10Jbond: [C: 03+2] "LGTM, merging thanks" [puppet] - 10https://gerrit.wikimedia.org/r/625779 (https://phabricator.wikimedia.org/T166066) (owner: 10Legoktm) [08:16:29] !log installing 4.19.132 kernel on buster systems (only installing the deb, reboots separately) [08:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:18] 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, and 2 others: Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) >>! In T166066#5087807, @jbond wrote: >> In addition, Jenkins doesn't seem to like having more than Change-... [08:20:50] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=eventgate-main,name=eqiad [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:46] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=codfw [08:21:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:11] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=eqiad [08:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:05] !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=blubberoid,name=eqiad [08:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:08] (03PS1) 10Jgiannelos: Enable OpenAPI spec on push-notifications service [deployment-charts] - 10https://gerrit.wikimedia.org/r/625832 (https://phabricator.wikimedia.org/T261635) [08:35:10] (03PS1) 10Volans: cr: update debmonitor IPs in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) [08:35:38] XioNoX: if you're around and have a second for a quick review ^^^ :) [08:35:54] volans: what are you pointing to/? [08:35:56] (I know, you probabkky don't see it) [08:36:00] https://gerrit.wikimedia.org/r/c/operations/homer/public/+/625833 :D [08:36:10] !!! [08:36:25] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [08:37:33] (03CR) 10Ayounsi: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [08:37:39] volans: all good! [08:37:57] ack, should I run it on all CRs or just eqiad/codfw? [08:38:25] (03CR) 10Volans: [C: 03+2] cr: update debmonitor IPs in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [08:39:13] (03Merged) 10jenkins-bot: cr: update debmonitor IPs in firewall rules [homer/public] - 10https://gerrit.wikimedia.org/r/625833 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [08:39:44] XioNoX: also thanks for the super quick review :) [08:41:06] volans: eqiad should be enough as it's for analytics iirc [08:41:18] and cloud [08:43:28] indeed, homer 'cr*codfw*' diff has no diffs [08:45:01] !log running homer 'cr*eqiad*' commit "Update debmonitor IPs, T261489" [08:45:01] thanks! [08:45:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:45:07] T261489: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 [08:45:30] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) [08:46:05] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) p:05Triage→03Medium [08:46:44] XioNoX: I am wondering if we could create a script to check ips in filters to spot for the ones without a corresponding A/AAAA/PTR/etc.. set of records [08:46:55] (03CR) 10Jbond: [C: 03+2] pki: add vhosts for pki and ocsp which will proxy to the backend cfssl [puppet] - 10https://gerrit.wikimedia.org/r/625708 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [08:47:18] if analytics is the only segment of the network affected I can create a custom one [08:47:31] but everytime that I check that list I find super stale things [08:47:42] and people tend to forget about the vlan rules etc.. [08:48:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Reduce db2127's weight', diff saved to https://phabricator.wikimedia.org/P12514 and previous config saved to /var/cache/conftool/dbconfig/20200908-084834-marostegui.json [08:48:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:25] elukey: you mean in CI? it wouldn't have catch this one as the old hosts are still there for now [08:49:30] will be decom in the next days [08:49:46] nono I mean a regular icinga alert [08:50:41] in this case: say you forgot about the analytics filters, and you decommed debmonitor1001 - I'd get an alert for stale ip in the analytics filter right after it [08:51:41] we do already convert IPs in the config in ipaddress objects, to ensure they are valid IPs and simplify their usage, I'm wondering if we could add a dns resolution there too [08:51:56] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, kubernet [08:51:56] t, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes101 [08:51:56] marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:52:01] given that we have the twice a day check that checks the live config vs the repo [08:52:12] akosiaris: expected I guess? %%% [08:52:14] *^^^ [08:52:26] (03PS1) 10Jbond: profile::pki: fix apache config [puppet] - 10https://gerrit.wikimedia.org/r/625837 (https://phabricator.wikimedia.org/T259117) [08:52:28] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, kubernet [08:52:28] t, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes101 [08:52:28] marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:52:43] volans: ah that would be great, I'd only need something that tells me about stale entries. It would be great for my mental sanity :D [08:53:02] this is me [08:53:15] volans: yes [08:53:32] I 'll schedule downtime for that as well [08:53:43] (03CR) 10Jbond: [C: 03+2] profile::pki: fix apache config [puppet] - 10https://gerrit.wikimedia.org/r/625837 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [08:53:50] elukey: the only usecase we need to solve is when running homer from local env, so we might have to force the resolver to be one of our public ns [08:54:20] elukey: care to open a task about it? we can discuss there with arzhel and decide what to do [08:54:32] volans: https://gerrit.wikimedia.org/r/c/operations/debs/wmf-sre-laptop/+/614787 ;) [08:55:03] jbond42: lol, yeah but can't assume everyone has that :D [08:55:18] yes was mostly in jest :) [08:55:36] * jbond42 is reminded volans uses a mac [08:55:43] !log Deploy schema change on s7 eqiad master - T253276 [08:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:48] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [08:56:47] ACKNOWLEDGEMENT - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, [08:56:47] iad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1004.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1001.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kuber [08:56:47] mnet are marked down but pooled alexandros kosiaris kubernetes etcd upgrade https://wikitech.wikimedia.org/wiki/PyBal [08:56:47] ACKNOWLEDGEMENT - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - blubberoid_8748: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled: blubberoid-https_4666: Servers kubernetes1008.eqiad.wmnet, [08:56:47] iad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet are marked down but pooled: api-gateway_8087: Servers kubernetes1008.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1002.eqiad.wmnet, kuber [08:56:47] mnet are marked down but pooled alexandros kosiaris kubernetes etcd upgrade https://wikitech.wikimedia.org/wiki/PyBal [08:57:00] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:57:32] volans, elukey, I think the answer would be to use a tool to manage the ACLs. I tested and liked capirca, but it has some downsides, like a new format for defining rules [08:57:33] (03PS1) 10Giuseppe Lavagetto: default-network-policy: allow restbase HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/625839 (https://phabricator.wikimedia.org/T244843) [08:59:36] XioNoX: mmm not sure, if people change hosts/DNS and we don't run the tool for a while, we don't get any hint that things are stale no? [09:00:48] (03PS1) 10Muehlenhoff: Remove debmonitor1001/2001 [dns] - 10https://gerrit.wikimedia.org/r/625840 (https://phabricator.wikimedia.org/T261489) [09:00:53] (03PS1) 10Muehlenhoff: Remove debmonitor1001/2001 [puppet] - 10https://gerrit.wikimedia.org/r/625841 (https://phabricator.wikimedia.org/T261489) [09:01:16] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={swagger_check_citoid_cluster_eqiad,swagger_check_cxserver_cluster_eqiad,swagger_check_echostore_eqiad} site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:02:42] this is a good example https://gerrit.wikimedia.org/r/625840 - the ips were in the analytics filters, and if Riccardo didn't update the term they'd become stale soon. Adding a quick icinga check that runs twice a day for example would have raised an alert to analytics [09:03:48] elukey: but the tool would run each day with the daily diffs [09:03:57] (just some thoughts of course) [09:03:58] prom alert is us as well akosiaris...but thats probably bad do ack/downtime [09:04:47] XioNoX: ah ok I didn't get this part, yes that would work too [09:07:08] elukey: anyway, please open a task, that's a usecase of managing/checking ACLs with a tool that I didn't think about [09:08:48] XioNoX: yes I know you don't want to talk with me, I'll open a task :D [09:10:06] elukey: I already exceeded my monthly allowed words with you [09:10:15] :) [09:10:57] (which was 0) [09:15:38] ahhahha [09:15:40] (03PS1) 10Marostegui: mariadb: Productionize es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625842 (https://phabricator.wikimedia.org/T261717) [09:15:51] what a nice working environment [09:15:53] :D [09:17:21] XioNoX: it's possible that define in homer controls only the analytics side of it and not the cloud one? (cc moritzm ) [09:17:35] I still see failures on cloudvirt hosts for example [09:18:17] volans: I don't understand [09:18:44] cloudvirt hosts can't connect to the new debmonitor hosts [12]002 [09:19:06] and the rest of cloud-related hosts AFAICT [09:19:08] (03Restored) 10Jcrespo: profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [09:19:12] (03PS1) 10Vgutierrez: 1.8: Bump version number [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625843 (https://phabricator.wikimedia.org/T261632) [09:19:21] (03CR) 10Jcrespo: [C: 03+1] profile::backup: remove helium from ferm directors [puppet] - 10https://gerrit.wikimedia.org/r/621042 (https://phabricator.wikimedia.org/T260717) (owner: 10Dzahn) [09:19:35] I was hoping that the homer's block was managing both :) [09:20:40] !log disabling puppted on argon.eqiad.wmnet,chlorine.eqiad.wmnet,kubernetes[1001-1016].eqiad.wmnet - Reinitialize eqiad k8s cluster with new etcd - T239835 [09:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:47] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [09:21:05] volans: I see what you mean, what you changed was only for analytics, not sure if there are similar rules for cloud [09:21:35] finishing up what I'm doing then I can have a look [09:22:06] oh, my bad, found them [09:22:17] yeah, it's still failing on cloudcephosd e.g. [09:22:34] PROBLEM - PyBal IPVS diff check on lvs1016 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqi [09:22:34] tes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:22:42] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([kubernetes1008.eqiad.wmnet, kubernetes1001.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1009.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1003.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1002.eqi [09:22:42] tes1004.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [09:23:28] <_joe_> uh [09:23:45] <_joe_> "unknown to pybal"? [09:23:49] (03CR) 10Kormat: [C: 03+1] mariadb: Productionize es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625842 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [09:24:15] that looks strange [09:24:26] (03PS1) 10Volans: cr: update debmonitor IPs in firewall rules (#2) [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) [09:24:40] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: instance=kubernetes1001.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:25:00] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Fix includes to build against Varnish 6 [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625713 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [09:25:27] _joe_, jayme: do you need help with that? [09:25:48] <_joe_> vgutierrez: that alert is wrong [09:26:01] XioNoX: this should do (no hurry) https://gerrit.wikimedia.org/r/c/operations/homer/public/+/625844 [09:26:05] <_joe_> I'm not sure what it's measuring there, but I can tell you those hosts are well known to pybal :) [09:26:20] PROBLEM - kubelet operational latencies on kubernetes1013 is CRITICAL: instance=kubernetes1013.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:26:22] <_joe_> lvs1015:~$ curl localhost:9090/pools/proton_4030a [09:26:26] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:26:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [09:26:44] <_joe_> uhm eqiad fataling out? [09:26:50] <_joe_> what's still pooled a/a ? [09:27:20] <_joe_> akosiaris, jayme please pause a sec [09:27:21] (03CR) 10Ayounsi: [C: 03+1] "LGTM #2" [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [09:27:26] _joe_: ack [09:27:46] PROBLEM - kubelet operational latencies on kubernetes1009 is CRITICAL: instance=kubernetes1009.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:27:52] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [09:27:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:56] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12515 and previous config saved to /var/cache/conftool/dbconfig/20200908-092755-kormat.json [09:27:56] <_joe_> just timeouts [09:27:59] <_joe_> please proceed [09:27:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:04] ok [09:28:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:56] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:29:30] <_joe_> we're having some latency on the appservers cluster [09:29:34] <_joe_> not sure why [09:29:38] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:29:49] <_joe_> can we please downtime all the k8s hosts? [09:30:14] (03CR) 10Volans: [C: 03+2] cr: update debmonitor IPs in firewall rules (#2) [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [09:30:30] <_joe_> kormat / marostegui it seems s3 in codfw is having some query latency [09:30:38] (03Merged) 10jenkins-bot: cr: update debmonitor IPs in firewall rules (#2) [homer/public] - 10https://gerrit.wikimedia.org/r/625844 (https://phabricator.wikimedia.org/T261489) (owner: 10Volans) [09:30:43] _joe_: did you see the task I created? [09:30:50] _joe_: I subscribed you, it is that I believe [09:30:50] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:30:54] <_joe_> nope [09:31:32] _joe_: https://phabricator.wikimedia.org/T262240 [09:31:36] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:31:46] <_joe_> yeah I'm reading [09:31:57] _joe_: essentially the same thing we saw on sunday night [09:32:18] PROBLEM - kubelet operational latencies on kubernetes1002 is CRITICAL: instance=kubernetes1002.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:33:06] RECOVERY - kubelet operational latencies on kubernetes1009 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:33:07] !log jayme@cumin1001 START - Cookbook sre.hosts.downtime [09:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:15] !log jayme@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:18] RECOVERY - kubelet operational latencies on kubernetes1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:35:50] RECOVERY - kubelet operational latencies on kubernetes1013 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [09:36:36] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [09:37:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize es2026 [puppet] - 10https://gerrit.wikimedia.org/r/625842 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [09:37:51] !log running homer 'cr*eqiad*' commit "Update debmonitor IPs (#2), T261489" [09:37:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:58] T261489: Upgrade debmonitor to Buster - https://phabricator.wikimedia.org/T261489 [09:39:18] (03PS1) 10Hashar: base: add basic spec for base::standard_packages [puppet] - 10https://gerrit.wikimedia.org/r/625846 [09:39:20] (03PS1) 10Hashar: base: upgrade git on stretch to 2.20 [puppet] - 10https://gerrit.wikimedia.org/r/625847 (https://phabricator.wikimedia.org/T262244) [09:39:22] (03PS1) 10Hashar: git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) [09:39:24] (03PS1) 10Hashar: base: enable git protocol version2 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/625849 (https://phabricator.wikimedia.org/T262244) [09:39:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool es2014 - T261717', diff saved to https://phabricator.wikimedia.org/P12517 and previous config saved to /var/cache/conftool/dbconfig/20200908-093957-marostegui.json [09:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:05] T261717: Productionize es20[26-34] and es10[26-34] - https://phabricator.wikimedia.org/T261717 [09:40:20] (03CR) 10jerkins-bot: [V: 04-1] git: allow multiple calls to git::systemconfig [puppet] - 10https://gerrit.wikimedia.org/r/625848 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:40:29] (03CR) 10Muehlenhoff: [C: 04-1] "To upgrade git in stretch fleet-wide I'll simply upload it to the main component." [puppet] - 10https://gerrit.wikimedia.org/r/625847 (https://phabricator.wikimedia.org/T262244) (owner: 10Hashar) [09:41:04] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Deployment services), 10Release-Engineering-Team-TODO (2020-07-01 to 2020-09-30 (Q1)), 10git-protocol-v2: Upgrade git fleet wide to git 2.20 - https://phabricator.wikimedia.org/T262244 (10hashar) [09:43:26] !log Stop mysql on es2014 to clone es2026 T261717 [09:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:32] !log stopped calico-node and kube-apiserver on k8s nodes/masters T239835 [09:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:37] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [09:44:21] (03CR) 10JMeybohm: [C: 03+1] k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:45:20] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on icinga1001 is CRITICAL: 0 le 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [09:46:00] (03CR) 10JMeybohm: [C: 03+2] k8s: Migrate eqiad to the new etcd cluster [puppet] - 10https://gerrit.wikimedia.org/r/558355 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:46:05] (03CR) 10JMeybohm: [C: 03+2] Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:46:13] <_joe_> I think this ^^ is expected given no service is producing to eqiad [09:46:21] <_joe_> not even for test/health check queues [09:47:21] (03Merged) 10jenkins-bot: Switch eqiad calico controller to the new etcd cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/558473 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [09:48:10] _joe_ yep I think so too [09:48:25] (03PS1) 10Marostegui: es2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/625851 (https://phabricator.wikimedia.org/T261717) [09:48:33] <_joe_> yes, and it only happens when the services themselves are not running [09:49:45] PROBLEM - Prometheus k8s cache not updating on prometheus1003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [09:50:00] (03CR) 10Marostegui: [C: 03+2] es2014: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/625851 (https://phabricator.wikimedia.org/T261717) (owner: 10Marostegui) [09:50:51] prometheus alert expected as apiservers are down [09:52:56] !log enable puppet, run it on all k8s eqiad nodes and double check that calico-node is fine T239835 [09:53:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:02] T239835: setup new, buster based, kubernetes etcd servers for staging/codfw/eqiad cluster - https://phabricator.wikimedia.org/T239835 [09:53:38] (03CR) 10ArielGlenn: [C: 03+1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [09:54:15] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [09:56:11] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) We let the first chunk of about 700k GET requests for about a week, but nothing stood up much. [09:59:57] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) [10:00:28] 10Operations, 10ops-eqiad, 10netops: eqiad row D switch fabric recabling - https://phabricator.wikimedia.org/T256112 (10ayounsi) Postponed to Wednesday 16th, 11am UTC, 30min, the time the cables and optics arrive. [10:00:35] (03PS2) 10ArielGlenn: update dumps web pages to note that recent versions of windows 7zip work [puppet] - 10https://gerrit.wikimedia.org/r/625780 (https://phabricator.wikimedia.org/T208647) [10:01:28] (03CR) 10ArielGlenn: [C: 03+2] update dumps web pages to note that recent versions of windows 7zip work [puppet] - 10https://gerrit.wikimedia.org/r/625780 (https://phabricator.wikimedia.org/T208647) (owner: 10ArielGlenn) [10:02:34] 10Operations, 10Acme-chief, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10Vgutierrez) [10:02:36] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [10:03:49] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) [10:04:08] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10ayounsi) Postponed to Sept. 17th, 1pm Eastern, 17:00 UTC [10:06:18] 10Operations, 10ops-eqiad, 10DBA, 10netops, 10User-Kormat: Upgrade eqiad rack D4 to 10G switch - https://phabricator.wikimedia.org/T196487 (10Marostegui) Everything ok from the DB point of view. All the DB hosts in D4 can have a hard downtime, nothing will be impacted from our side. [10:08:52] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12519 and previous config saved to /var/cache/conftool/dbconfig/20200908-100852-kormat.json [10:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:55] (03PS1) 10Jbond: pki: install mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/625854 [10:11:13] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:11:14] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [10:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:27] (03CR) 10Jbond: [C: 03+2] pki: install mod_ssl [puppet] - 10https://gerrit.wikimedia.org/r/625854 (owner: 10Jbond) [10:13:22] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Push-Notification-Service, 10serviceops, and 2 others: Deploy push-notifications service to Kubernetes - https://phabricator.wikimedia.org/T256973 (10jijiki) [10:14:21] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] 1.8: Bump version number [software/varnish/libvmod-netmapper] - 10https://gerrit.wikimedia.org/r/625843 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [10:14:26] !log jmm@cumin2001 START - Cookbook sre.hosts.reboot-single [10:14:27] !log jmm@cumin2001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) [10:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:18] (03PS1) 10Jbond: pki: install sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/625856 [10:16:23] 10Operations, 10serviceops, 10Kubernetes, 10User-fsero: Upgrade calico in production to version 2.4+ - https://phabricator.wikimedia.org/T207804 (10jijiki) [10:17:44] (03CR) 10Jbond: [C: 03+2] pki: install sslcert::dhparam [puppet] - 10https://gerrit.wikimedia.org/r/625856 (owner: 10Jbond) [10:20:21] 10Operations, 10SRE-swift-storage, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar): Swift object servers become briefly unresponsive on a regular basis - https://phabricator.wikimedia.org/T226373 (10jijiki) [10:20:36] 10Operations, 10Performance-Team, 10SRE-swift-storage, 10Traffic, 10Patch-For-Review: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661 (10jijiki) [10:20:36] !log Deploy schema change on s4 eqiad master - T253276 [10:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:43] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [10:20:57] 10Operations, 10SRE-tools, 10User-Joe: Create cookbook to do `nodetool repair` across cassandra cluster - https://phabricator.wikimedia.org/T225694 (10jijiki) [10:21:17] PROBLEM - Prometheus k8s cache not updating on prometheus1004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [10:21:30] 10Operations, 10serviceops, 10Patch-For-Review: Create a mediawiki::cronjob define - https://phabricator.wikimedia.org/T211250 (10jijiki) [10:21:42] 10Operations, 10Puppet, 10Traffic, 10Patch-For-Review, and 2 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10jijiki) [10:22:55] (03PS1) 10Jbond: pki: enable headers module [puppet] - 10https://gerrit.wikimedia.org/r/625857 [10:23:33] 10Operations: Redefine privileges and access for perf-roots group - https://phabricator.wikimedia.org/T207666 (10jijiki) [10:25:12] (03CR) 10Jbond: [C: 03+2] pki: enable headers module [puppet] - 10https://gerrit.wikimedia.org/r/625857 (owner: 10Jbond) [10:27:43] 10Operations, 10SRE-tools, 10User-Joe: Create a spicerack cookbook to empty a ganeti node from VMs - https://phabricator.wikimedia.org/T203964 (10jijiki) [10:29:37] (03PS1) 10Jcrespo: mariadb: Move db1133 from m5 to core-test (backup testing db) [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) [10:38:12] (03CR) 10Jcrespo: "https://puppet-compiler.wmflabs.org/compiler1001/24990/db1133.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:39:04] (03CR) 10Jcrespo: [C: 03+2] mariadb: Move db1133 from m5 to core-test (backup testing db) [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:39:08] (03CR) 10Marostegui: [C: 03+1] "Make sure to full-upgrade and reboot the host for https://phabricator.wikimedia.org/T261389, if you can, please mark it as done." [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:40:56] (03CR) 10Jcrespo: "Thanks for the reminder. I think I was going to go for a full reimage into buster." [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:41:43] (03CR) 10Marostegui: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/625860 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:43:28] 10Operations, 10Scap, 10Wikimedia-General-or-Unknown, 10serviceops, 10Release-Engineering-Team (Deployment services): "Currently active MediaWiki versions:" broken on noc/conf - https://phabricator.wikimedia.org/T235338 (10jijiki) [10:44:04] (03PS1) 10Jcrespo: install_server: Reimage db1133 into buster [puppet] - 10https://gerrit.wikimedia.org/r/625863 (https://phabricator.wikimedia.org/T253217) [10:45:45] (03CR) 10Marostegui: [C: 03+1] install_server: Reimage db1133 into buster [puppet] - 10https://gerrit.wikimedia.org/r/625863 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:47:41] (03CR) 10Jcrespo: [C: 03+2] install_server: Reimage db1133 into buster [puppet] - 10https://gerrit.wikimedia.org/r/625863 (https://phabricator.wikimedia.org/T253217) (owner: 10Jcrespo) [10:53:16] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [10:53:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:29] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [10:53:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:40] !log Deploy schema change on s3 eqiad master - T253276 [10:53:42] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [10:53:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:46] T253276: Normalise MW Core database language fields length - https://phabricator.wikimedia.org/T253276 [10:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:45] 10Operations, 10serviceops, 10Patch-For-Review, 10User-jijiki: systemd-coredump can make a system unresponsive - https://phabricator.wikimedia.org/T236253 (10jijiki) [10:56:51] (03PS1) 10Jbond: pki: only enable client auth for API directory and add fqdn to aliases [puppet] - 10https://gerrit.wikimedia.org/r/625866 [11:00:05] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:05:45] (03PS1) 10JMeybohm: admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869 [11:08:56] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Replace nutcracker with mcrouter on thumbor* - https://phabricator.wikimedia.org/T221081 (10jijiki) [11:09:08] 10Operations, 10serviceops, 10User-jijiki: Move debugging symbols and tools to a new class - https://phabricator.wikimedia.org/T236048 (10jijiki) [11:09:29] 10Operations, 10Performance-Team, 10serviceops, 10Patch-For-Review, and 2 others: Reduce read pressure on mc* servers by adding a machine-local Memcached instance (on-host memcached) - https://phabricator.wikimedia.org/T244340 (10jijiki) [11:11:40] 10Operations, 10Thumbor, 10serviceops, 10Patch-For-Review, and 2 others: Build python-thumbor-wikimedia 2.9 Debian package and upload to apt.wikimedia.org - https://phabricator.wikimedia.org/T254845 (10jijiki) [11:15:08] !log jynus@cumin1001 START - Cookbook sre.hosts.downtime [11:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:20] !log jynus@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [11:18:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:29] (03CR) 10Muehlenhoff: [C: 03+2] graphite: Modernise Apache config [puppet] - 10https://gerrit.wikimedia.org/r/625609 (owner: 10Muehlenhoff) [11:33:50] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'rbac-deploy-clusterrole' . [11:33:51] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'coredns' . [11:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:08] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [11:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:23] (03PS2) 10JMeybohm: admin: Patch system:node clusterrolebinding on initialize_cluster.sh [deployment-charts] - 10https://gerrit.wikimedia.org/r/625869 [11:43:12] RECOVERY - Prometheus k8s cache not updating on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1003&var-datasource=eqiad+prometheus/ops [11:43:13] RECOVERY - Prometheus k8s cache not updating on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23k8s_cache_not_updating https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=prometheus1004&var-datasource=eqiad+prometheus/ops [12:04:16] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:04:17] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:04:19] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12520 and previous config saved to /var/cache/conftool/dbconfig/20200908-120419-kormat.json [12:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:04:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:03] 10Operations, 10Acme-chief, 10Cloud-VPS, 10Traffic, 10cloud-services-team (Kanban): acme-chief shouldn't try to perform OCSP stapling of expired certs - https://phabricator.wikimedia.org/T262251 (10Vgutierrez) p:05Triage→03Medium [12:10:01] (03PS1) 10Filippo Giunchedi: pontoon: set postgres shared buffers [puppet] - 10https://gerrit.wikimedia.org/r/625881 [12:11:32] godog: hah. i cheated and commented out `tuning.conf` from postgres config. this might be a better approach ;) [12:11:40] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12521 and previous config saved to /var/cache/conftool/dbconfig/20200908-121139-kormat.json [12:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:25] kormat: heheh I think I might have done sth similar, and then promptly came back to bite me (currently changing domains) [12:18:06] (03CR) 10Kormat: [C: 03+1] pontoon: set postgres shared buffers [puppet] - 10https://gerrit.wikimedia.org/r/625881 (owner: 10Filippo Giunchedi) [12:19:20] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: set postgres shared buffers [puppet] - 10https://gerrit.wikimedia.org/r/625881 (owner: 10Filippo Giunchedi) [12:20:15] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) starting maintenance do not expect any outages will be disconnecting pdu`s in about 1 hour [12:25:18] (03CR) 10Ppchelko: [C: 03+1] "+2 actually, but I think it's better you merge before deploying and I donno your schedule" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625626 (owner: 10Hnowlan) [12:26:01] (03CR) 10Ppchelko: [C: 03+1] changeprop-jobqueue: convert to new helmfile format [deployment-charts] - 10https://gerrit.wikimedia.org/r/625632 (owner: 10Hnowlan) [12:26:32] (03PS3) 10Giuseppe Lavagetto: citoid: add TLS LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625600 (https://phabricator.wikimedia.org/T255868) [12:26:34] (03PS3) 10Giuseppe Lavagetto: citoid: promote https lvs to production status [puppet] - 10https://gerrit.wikimedia.org/r/625601 (https://phabricator.wikimedia.org/T255868) [12:26:36] (03PS3) 10Giuseppe Lavagetto: citoid: remove unencrypted LVS endpoint [puppet] - 10https://gerrit.wikimedia.org/r/625603 (https://phabricator.wikimedia.org/T255868) [12:27:00] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [12:27:00] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [12:27:03] !log kormat@cumin1001 dbctl commit (dc=all): 'Rebooting for T261389', diff saved to https://phabricator.wikimedia.org/P12522 and previous config saved to /var/cache/conftool/dbconfig/20200908-122702-kormat.json [12:27:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:28:46] (03PS1) 10Filippo Giunchedi: pontoon: add alertmanager hiera variables for o11y [puppet] - 10https://gerrit.wikimedia.org/r/625885 [12:28:48] (03PS1) 10Filippo Giunchedi: pontoon: switch observability stack to wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/625886 [12:28:57] (03PS1) 10Vgutierrez: Merge remote-tracking branch 'origin/master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625887 [12:29:24] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] Merge remote-tracking branch 'origin/master' into debian [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625887 (owner: 10Vgutierrez) [12:30:09] (03CR) 10Giuseppe Lavagetto: [C: 03+2] default-network-policy: allow restbase HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/625839 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:31:32] (03Merged) 10jenkins-bot: default-network-policy: allow restbase HTTPS port [deployment-charts] - 10https://gerrit.wikimedia.org/r/625839 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [12:32:34] (03PS6) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [12:33:00] (03CR) 10jerkins-bot: [V: 04-1] Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) (owner: 10Hashar) [12:33:19] (03PS2) 10Vgutierrez: 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) [12:33:42] (03PS7) 10Hashar: Explicitly mentions the repository in scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/610254 (https://phabricator.wikimedia.org/T257413) [12:33:49] (03CR) 10jerkins-bot: [V: 04-1] 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [12:34:47] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [12:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:46] !log kormat@cumin1001 dbctl commit (dc=all): 'Repooling after reboot. T261389', diff saved to https://phabricator.wikimedia.org/P12523 and previous config saved to /var/cache/conftool/dbconfig/20200908-123546-kormat.json [12:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:49] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: add alertmanager hiera variables for o11y [puppet] - 10https://gerrit.wikimedia.org/r/625885 (owner: 10Filippo Giunchedi) [12:39:09] (03CR) 10Filippo Giunchedi: [C: 03+2] pontoon: switch observability stack to wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/625886 (owner: 10Filippo Giunchedi) [12:39:31] (03PS4) 10Hashar: Add basic doc for python-build* images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605649 [12:39:33] (03PS3) 10Hashar: .gitignore docker-pkg-build.log [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619759 [12:39:35] (03PS5) 10Hashar: python-build: reuse previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/605653 (https://phabricator.wikimedia.org/T259611) [12:40:32] (03PS1) 10Muehlenhoff: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890 [12:41:11] (03PS2) 10Hashar: python-build: do not archive previously built wheels [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/619779 [12:42:47] (03PS3) 10Vgutierrez: 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) [12:43:15] (03CR) 10jerkins-bot: [V: 04-1] 1.8-1: Rebuild against Varnish 6 [software/varnish/libvmod-netmapper] (debian) - 10https://gerrit.wikimedia.org/r/625659 (https://phabricator.wikimedia.org/T261632) (owner: 10Vgutierrez) [12:47:50] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [12:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:57] (03PS2) 10Muehlenhoff: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890 [12:52:04] (03PS3) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 [12:53:22] (03CR) 10jerkins-bot: [V: 04-1] Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey) [12:53:47] ah! [12:54:14] of course I made a change without running tox and I get punished [12:54:23] *make [12:54:40] (03PS1) 10Holger Knust: WIP: Add new watchlist job [dumps] - 10https://gerrit.wikimedia.org/r/625895 (https://phabricator.wikimedia.org/T51133) [12:56:01] (03CR) 10Volans: [C: 04-1] "One possible small issue and a couple of nits, the rest LGTM." (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey) [12:57:29] (03CR) 10Holger Knust: "The python portion (incomplete)" [dumps] - 10https://gerrit.wikimedia.org/r/625895 (https://phabricator.wikimedia.org/T51133) (owner: 10Holger Knust) [13:00:55] (03CR) 10Elukey: Add cookbook to restart Hadoop master daemons. (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey) [13:01:12] (03PS4) 10Elukey: Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 [13:01:17] PROBLEM - Host ps1-d4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:02:01] PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:02:53] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey) [13:02:55] (03PS1) 10Giuseppe Lavagetto: Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 [13:03:51] (03CR) 10Ottomata: [C: 03+1] "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625771 (https://phabricator.wikimedia.org/T260582) (owner: 10Gergő Tisza) [13:04:28] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [13:04:28] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [13:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:11] (03CR) 10Ottomata: "Ok! Might be next week..." [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:08:09] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'staging' . [13:08:09] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'citoid' for release 'production' . [13:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:08] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'production' . [13:09:08] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'cxserver' for release 'staging' . [13:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:25] (03CR) 10Andrew Bogott: [C: 03+2] designate: stop creating 'legacy' entries (that is, things under wmflabs) [puppet] - 10https://gerrit.wikimedia.org/r/620937 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [13:10:23] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:10:27] RECOVERY - PyBal IPVS diff check on lvs1016 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [13:10:57] (03CR) 10Elukey: [C: 03+2] Add cookbook to restart Hadoop master daemons. [cookbooks] - 10https://gerrit.wikimedia.org/r/625782 (owner: 10Elukey) [13:11:32] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Papaul) The log on says "It has been corrected by h/w and requires no further action" so i don't think this will be enough to replace the memory because it is not saying that there is an error but there were... [13:12:39] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop' for release 'production' . [13:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:52] 10Operations, 10ops-codfw, 10DBA: db2127 memory errors - https://phabricator.wikimedia.org/T262247 (10Marostegui) Excellent, makes sense @Papaul Right now it is not a good moment to depool an s3 host due to some on-going investigations. I will ping you once we are ready to depool this host and get it upgrad... [13:13:51] PROBLEM - Host ps1-d3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:13:56] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'changeprop-jobqueue' for release 'production' . [13:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:32] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters [13:14:32] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) [13:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'staging' . [13:16:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'echostore' for release 'production' . [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:22] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) [13:16:25] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'production' . [13:16:25] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics-external' for release 'canary' . [13:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:43] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'production' . [13:16:43] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-analytics' for release 'canary' . [13:16:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:49] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message produce rate in last 30m on icinga1001 is OK: (C)0 le (W)25 le 30.28 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:16:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:07] RECOVERY - Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:17:10] 04Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [13:17:25] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'production' . [13:17:25] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mathoid' for release 'staging' . [13:17:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:17:35] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw average message consume rate in last 30m on icinga1001 is OK: (C)0 le (W)25 le 31.98 https://wikitech.wikimedia.org/wiki/Kafka/Administration%23MirrorMaker https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [13:17:48] (03PS1) 10Elukey: sre.hadoop.roll-restart-masters.py: fix cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/625905 [13:18:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'production' . [13:18:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'staging' . [13:18:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'mobileapps' for release 'nontls' . [13:18:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:04] XioNoX: those 2 seems to be saying contradictory things. Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms vs Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Juniper alarm active [13:18:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:20] which made me ask "Is there an alarm or not after all?" [13:18:33] jclark-ctr, cmjohnson1 please !log when starting a maintenance :) [13:18:36] !log swapping pdu's in eqiad, mgmt for racks d3 and d4 will go down [13:18:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:45] just doing that now [13:18:47] cmjohnson1: they're already down :) [13:19:00] (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-masters.py: fix cumin aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/625905 (owner: 10Elukey) [13:19:03] 10Operations, 10Wikimedia-Mailing-lists: Integrate mailing lists in Wikimedia Space - https://phabricator.wikimedia.org/T226727 (10Aklapper) [13:19:51] akosiaris: Icinga and LibreNMS have their own latency [13:20:00] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'proton' for release 'production' . [13:20:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:24] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'staging' . [13:20:24] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'sessionstore' for release 'production' . [13:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:32] akosiaris: but yeah, I think it's enough in normal time to double check, and those are uncommon enough [13:20:36] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:20:36] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:43] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'staging' . [13:20:43] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'test' . [13:20:43] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'termbox' for release 'production' . [13:20:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:56] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters [13:20:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:06] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'production' . [13:21:06] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'wikifeeds' for release 'staging' . [13:21:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:32] (03PS1) 10Hashar: Add CI entry point to run tox [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 [13:21:34] (03PS1) 10Hashar: update_version: improve tox.ini [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 [13:21:40] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'staging' . [13:21:40] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'zotero' for release 'production' . [13:21:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Juniper alarm active [13:22:48] (03CR) 10jerkins-bot: [V: 04-1] Add CI entry point to run tox [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 (owner: 10Hashar) [13:22:53] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) @Jclark-ctr I understand that is hpe's response. What is //your// advice regarding followup steps, close this due to "no actionable"? [13:23:47] (03PS3) 10Muehlenhoff: Switch debmonitor to Envoy (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/625890 [13:24:48] (03CR) 10Hashar: [C: 03+2] "We would need CI to be setup for that update_version script. I have proposed the boilerplate config in two follow up changes:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [13:24:49] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'production' . [13:24:49] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-logging-external' for release 'canary' . [13:24:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:49] !log Restarted puppetdb on deployment-puppetdb03 (T248041) [13:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:55] T248041: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 [13:26:05] (03Merged) 10jenkins-bot: Make update_version.py work with python 3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [13:26:22] (03CR) 10Hashar: "recheck unrelated issue with eventgate-analytics" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625909 (owner: 10Hashar) [13:26:46] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'production' . [13:26:46] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventgate-main' for release 'canary' . [13:26:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:04] (03CR) 10Hashar: "Locally one would have basepython=python3:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625910 (owner: 10Hashar) [13:28:25] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'canary' . [13:28:25] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'eventstreams' for release 'production' . [13:28:27] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) [13:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:34] 10Operations, 10Developer-Advocacy, 10Discourse: Migration of discourse-mediawiki.wmflabs.org from wmflabs to production - https://phabricator.wikimedia.org/T184461 (10Aklapper) 05Stalled→03Declined There are no active Discourse instances in Wikimedia currently (discourse-mediawiki.wmflabs.org and #Spac... [13:30:09] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:30:33] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack Neutron config: remove the 'tld' variable [puppet] - 10https://gerrit.wikimedia.org/r/624763 (owner: 10Andrew Bogott) [13:30:36] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [13:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:58] (03CR) 10Andrew Bogott: [C: 03+2] Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [13:32:23] 10Operations, 10Developer-Advocacy, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) 05Stalled→03Declined Declining for the time being, as there are no active Discourse instances in Wikimedia currently (discourse-mediawiki.w... [13:32:25] 10Operations, 10Developer-Advocacy, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) [13:33:15] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:33:30] 10Operations, 10MediaWiki-General, 10serviceops, 10Patch-For-Review, 10Service-Architecture: Create a service-to-service proxy for handling HTTP calls from services to other entities - https://phabricator.wikimedia.org/T244843 (10Joe) [13:33:57] PROBLEM - IPMI Sensor Status on pc1010 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [13:34:02] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) [13:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:12] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:34:29] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dell Tech Support via r1hhmgz5xjn6.0b-gampeak.na98.bnc.salesforce.com 8:30 AM (3 minutes ago) to me ** Please Do Not Change Subj... [13:34:55] PROBLEM - Host kubernetes1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:01] PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:03] PROBLEM - Host stat1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:13] PROBLEM - Host kubernetes1013 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:19] PROBLEM - Host wtp1045 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:55] !log the power cable was not properly seated and lost power to asw2-d3-eqiad [13:35:56] ehm what? [13:35:57] PROBLEM - Host elastic1063 is DOWN: PING CRITICAL - Packet loss = 100% [13:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:03] ah that explains it [13:36:09] PROBLEM - Host dbproxy1017 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:13] PROBLEM - Host maps1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:32] PROBLEM - Host elastic1062 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:43] XioNox it powering up now, I double checked it but was still wrong [13:36:45] PROBLEM - Host mw1365 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:45] PROBLEM - Host wtp1043 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:45] PROBLEM - Host wtp1044 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:47] PROBLEM - Host mw1364 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:47] PROBLEM - Host aqs1009 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:47] PROBLEM - Host restbase1018 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:49] PROBLEM - Host restbase1025 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:53] PROBLEM - Host eventlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:55] PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:59] (03PS1) 10Elukey: sre.hadoop.roll-restart-masters.py: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/625913 [13:36:59] PROBLEM - Host ores1007 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:59] PROBLEM - Host scb1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:36:59] PROBLEM - Host rdb1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:01] PROBLEM - Host sessionstore1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:01] PROBLEM - Host mw1363 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:09] PROBLEM - Host pc1010 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:09] PROBLEM - Host schema1004 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:09] PROBLEM - Host releases1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:37:11] <_joe_> wow that many systems [13:37:24] <_joe_> sessionstore1003 is cassandra, not sure if something will be needed [13:37:25] full rack [13:37:27] PROBLEM - Host thorium is DOWN: PING CRITICAL - Packet loss = 100% [13:37:48] not worried about eqiad nodes, but should we be worried about restbase on codfw? [13:37:50] list of hosts: https://netbox.wikimedia.org/dcim/racks/37/ [13:37:52] cmjohnson1: is this just networking? or did all those hosts also lost power? [13:38:06] I am gather the former, just double checking [13:38:07] PROBLEM - Juniper alarms on asw2-d-eqiad is CRITICAL: JNX_ALARMS CRITICAL - 1 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:38:17] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:17] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:17] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:17] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:23] I guess it is just the switch because of the PDU maintenance akosiaris ? [13:38:29] PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:29] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:29] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:29] is the rack switch booting up though? [13:38:33] PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:33] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:37] PROBLEM - Host mw1357 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:37] PROBLEM - Host mw1356 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:37] marostegui: my guess as well [13:38:39] PROBLEM - Host kafka-jumbo1009 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:39] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:41] PROBLEM - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:41] PROBLEM - Host ms-be1039 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:41] PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:43] PROBLEM - Host dumpsdata1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:43] <_joe_> akosiaris: it seems mobileapps is having issue in codfw [13:38:45] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:47] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:47] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:47] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:47] PROBLEM - Host mw1354 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:49] PROBLEM - Host labweb1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:50] PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:50] PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:51] PROBLEM - Host mw1355 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:51] PROBLEM - Host mw1352 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:55] PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:57] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [13:38:57] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:38:57] PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:57] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:57] PROBLEM - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:01] PROBLEM - Host aqs1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:01] PROBLEM - Host an-presto1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:03] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:39:03] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:03] PROBLEM - Host mw1359 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:03] PROBLEM - Host mw1362 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:05] PROBLEM - Host es1018 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:09] PROBLEM - Host snapshot1007 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:10] this is weird... [13:39:11] PROBLEM - Host elastic1060 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:17] PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:18] I did not expect codfw to be impacted by this [13:39:19] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:21] PROBLEM - Host elastic1061 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:21] 👋 [13:39:26] <_joe_> akosiaris: probably it's not [13:39:29] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:29] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:31] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:32] is this for us? [13:39:33] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:33] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:34] <_joe_> let's focus on mobileapps in codfw [13:39:37] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:39:37] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:41] (I am in a meeting but will drop if needed) [13:39:43] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:43] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:43] <_joe_> that's what generates the featured endpoint [13:39:44] _joe_: thinking a monitoring issue? [13:39:48] <_joe_> no. [13:39:51] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job={redis_maps,swagger_check_restbase_cluster_codfw,swagger_check_wikifeeds_codfw,swagger_check_wikifeeds_eqiad} site={codfw,eqiad} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:39:54] ah, so real then [13:39:59] <_joe_> oh wikifeeds [13:39:59] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:39:59] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:40:00] elastic1061 is in D1 [13:40:00] D1: Initial commit - https://phabricator.wikimedia.org/D1 [13:40:01] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:40:04] apergos: rzl power failure in rack switch [13:40:05] whu is affected? [13:40:07] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:40:08] _joe_: thinking coincidence? [13:40:09] ah [13:40:11] RECOVERY - Juniper alarms on asw2-d-eqiad is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [13:40:12] <_joe_> akosiaris: not sure [13:40:15] I this it's related to the lost restbase nodes [13:40:17] jayme: ack, saw [13:40:19] PROBLEM - Host mw1358 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:20] host list and rack dont' match for me [13:40:21] PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:21] PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:23] <_joe_> akosiaris: oh possibly [13:40:25] <_joe_> yes [13:40:26] wikifeeds AND mobileapps talk to restbase [13:40:29] PROBLEM - Host wtp1046 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:33] and both have issues [13:40:35] PROBLEM - Host wtp1048 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:36] <_joe_> yes this is way more than we expected [13:40:37] PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:41] rack d1 affected too? [13:40:43] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [13:40:47] PROBLEM - Host ores1008 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:49] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:52] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:53] cmjohnson1: please focus on communication first - we'd like to understand what went down [13:40:55] PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:57] PROBLEM - Host druid1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:59] volans: one of the hosts was in D3 too [13:41:00] D3: test - ignore - https://phabricator.wikimedia.org/D3 [13:41:01] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:41:05] and d4 too [13:41:05] PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% [13:41:25] so the switches are interconnected with the rest of the row, it /can/ have impact on others, although they should route around problems [13:41:26] ok I am here if I can help [13:41:34] volans: so d3 and d4 are the PDU work that was going on, so that is "expected" [13:41:34] maybe let's move to #-sre ? [13:41:44] elukey: yes please [13:41:50] too may things at once yeah indeed movjng [13:42:11] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 5183 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:42:11] PROBLEM - SSH on conf1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:42:13] PROBLEM - Varnish HTTP text-frontend - port 80 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:42:13] PROBLEM - Varnish HTTP text-frontend - port 3126 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:42:13] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:42:13] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:42:19] PROBLEM - nova instance creation test on cloudcontrol1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [13:42:33] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-s [13:42:33] }/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was receive [13:42:33] imedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:42:33] i sent chris a text to come talk to us on irc [13:42:35] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:42:42] XioNox it's up but not seeing any traffic [13:42:47] PROBLEM - SSH on ms-be1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:42:51] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:42:52] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:42:59] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:43:00] mark see message above [13:43:03] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:43:07] cmjohnson1: so [13:43:07] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:43:13] cmjohnson1: is just the switch down, or the entire rack? [13:43:20] just the swith [13:43:22] switch [13:43:25] we are seeing impact in other racks [13:43:27] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-s [13:43:27] }/{start}/{end} (Get pagecounts) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:43:27] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [13:43:30] is there any chance they are also impacted by power? [13:43:54] no [13:43:55] PROBLEM - SSH on backup1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:43:57] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received: /analytics.wikimedia.org/v1/m [13:43:57] file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:43:59] and is the switch that was affected powered again and booting back up? [13:44:01] PROBLEM - Auth DNS on dns1002 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:44:09] PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:44:11] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1287.eqiad.wmnet, mw1386.eqiad.wmnet, mw1284.eqiad.wmnet, mw1412.eqiad.wmnet, mw1327.eqiad.wmnet, mw1380.eqiad.wmnet, mw1371.eqiad.wmnet, mw1321.eqiad.wmnet, mw1401.eqiad.wmnet, mw1274.eqiad.wmnet, mw1405.eqiad.wmnet, mw1395.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw1290.eqiad.wmnet]) https://wikitech.wikimedi [13:44:13] PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:13] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1267.eqiad.wmnet, mw1268.eqiad.wmnet, mw1327.eqiad.wmnet, mw1395.eqiad.wmnet, mw1411.eqiad.wmnet, mw1371.eqiad.wmnet, mw1369.eqiad.wmnet, mw1270.eqiad.wmnet, mw1331.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet are marked down but pooled: api_80: Servers mw1339.eqiad.wmnet are marked down but pooled: apa [13:44:13] mw1333.eqiad.wmnet, mw1274.eqiad.wmnet, mw1372.eqiad.wmnet, mw1331.eqiad.wmnet, mw1322.eqiad.wmnet, mw1395.eqiad.wmnet, mw1261.eqiad.wmnet, mw1369.eqiad.wmnet, mw1370.eqiad.wmnet, mw1328.eqiad.wmnet, mw1270.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet, mw1405.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1400.eqiad.wmnet, mw1276.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/P [13:44:19] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:44:34] The switch lost power, there is a chance that there was a power surge when plugging it back in [13:44:35] PROBLEM - Varnish HTTP text-frontend - port 3125 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:44:38] the switch [13:44:45] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:44:46] cmjohnson1: does it look like it is powered right now? [13:44:48] XioNoX: ping [13:44:51] mark I want to pull power and reboot [13:44:51] PROBLEM - SSH on ms-be1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:44:57] PROBLEM - Host ms-be1026 is DOWN: PING CRITICAL - Packet loss = 100% [13:44:57] it is powered on [13:44:59] PROBLEM - puppetmaster https on puppetmaster1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:45:02] PROBLEM - Check systemd state on es1016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:02] PROBLEM - recommendation_api endpoints health on scb1002 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) is CRITICAL: Test Caption addition suggestions returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - normal source and target with seed) is CRITICAL: Test article.creation.translation - normal source and [13:45:02] returned the unexpected status 504 (expecting: 200): /{domain}/v1/article/creation/translation/{source}{/seed} (article.creation.translation - bad seed) is CRITICAL: Test article.creation.translation - bad seed returned the unexpected status 504 (expecting: 404): /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) is CRITICAL: Test Caption translation suggestions returned the unexpected st [13:45:02] ng: 200): /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) is CRITICAL: Test Description translation suggestions returned the unexpected stat https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:45:03] no link lights [13:45:08] cmjohnson1: let's coordinate [13:45:17] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1319.eqiad.wmnet, mw1395.eqiad.wmnet, mw1371.eqiad.wmnet, mw1329.eqiad.wmnet, mw1367.eqiad.wmnet, mw1274.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1321.eqiad.wmnet, mw1327.eqiad.wmnet, mw1413.eqiad.wmnet, mw1369.eqiad.wmnet, mw1272.eqiad.wmnet, mw1399.eqiad.wmnet, mw1266.eqiad.wmnet, mw1326.eqiad. [13:45:17] down but pooled: api_80: Servers mw1386.eqiad.wmnet, mw1284.eqiad.wmnet, mw1378.eqiad.wmnet, mw1383.eqiad.wmnet, mw1388.eqiad.wmnet, mw1339.eqiad.wmnet are marked down but pooled: apaches_80: Servers mw1401.eqiad.wmnet, mw1331.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1321.eqiad.wmnet, mw1269.eqiad.wmnet, mw1403.eqiad.wmnet, mw1325.eqiad.wmnet, mw1274.eqiad.wmnet, mw1261.eqiad.wmnet, mw1413.eqiad.wmnet, mw1369.eqiad. [13:45:17] ad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet, mw1326.eqiad.wmnet are marked down but pooled: api-https_443: Servers mw1276.eqiad.wmnet, mw1290.eqiad.wmnet, mw1383.eqiad.wmnet are ma https://wikitech.wikimedia.org/wiki/PyBal [13:45:31] PROBLEM - Apache HTTP on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:45:35] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:45:35] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:45:39] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02264 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:45:45] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:45:47] PROBLEM - Host ms-be1056 is DOWN: PING CRITICAL - Packet loss = 100% [13:45:47] PROBLEM - SSH on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:45:47] RECOVERY - SSH on backup1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:45:49] PROBLEM - Check systemd state on db1103 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:49] PROBLEM - PHP7 rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:46:03] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:46:07] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:46:13] PROBLEM - SSH on analytics1076 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:46:17] PROBLEM - Check systemd state on db1143 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:46:23] PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:46:57] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: PING CRITICAL - Packet loss = 100% [13:46:57] <_joe_> wat [13:47:01] PROBLEM - rsyslog in eqiad is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [13:47:02] PROBLEM - Widespread puppet agent failures on icinga1001 is CRITICAL: 0.01429 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:47:11] PROBLEM - Check systemd state on restbase1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:12] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:12] PROBLEM - PHP7 rendering on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:47:15] PROBLEM - SSH on kafka-jumbo1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:47:23] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:47:25] PROBLEM - rsyslog in codfw is failing to deliver messages on icinga1001 is CRITICAL: action=fwd_centrallog1001.eqiad.wmnet:6514 https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [13:47:34] _joe_: a switch is down [13:47:35] PROBLEM - Apache HTTP on mw1400 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:47:39] PROBLEM - Check systemd state on analytics1058 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:39] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:47:39] RECOVERY - SSH on cp1089 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:47:43] PROBLEM - Check systemd state on mw1320 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:43] RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:47:55] XioNoX , mark: next step? [13:47:56] PROBLEM - Auth DNS #page on nsa-v4 is CRITICAL: DNS_QUERY CRITICAL - no socket TCP[198.35.27.27] Connection timed out https://wikitech.wikimedia.org/wiki/DNS [13:47:57] PROBLEM - PHP7 rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:48:01] cmjohnson1: please wait [13:48:02] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={mc2034,mc2035} site=codfw tunnel={mc1034_v4,mc1035_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:48:05] PROBLEM - Apache HTTP on mw1401 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:05] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:09] PROBLEM - Check systemd state on ores1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:11] i am looking at the switches, XioNoX not around [13:48:15] PROBLEM - PHP7 rendering on mw1395 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:48:16] <_joe_> we're losing dns too [13:48:23] PROBLEM - PHP7 rendering on mw1287 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:48:27] RECOVERY - Varnish HTTP text-frontend - port 3125 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:48:35] PROBLEM - SSH on dns1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:48:35] PROBLEM - Apache HTTP on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:48:42] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 3 https://wikitech.wikimedia.org/wiki/HAProxy [13:48:45] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1035 site=eqiad tunnel=mc2035_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [13:48:47] RECOVERY - SSH on ms-be1043 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:48:48] (03PS1) 10Andrew Bogott: Revert "Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/625757 [13:48:49] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:48:55] RECOVERY - puppetmaster https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 415 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:48:57] PROBLEM - SSH on cloudelastic1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:48:57] PROBLEM - SSH on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:48:59] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:49:03] RECOVERY - recommendation_api endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [13:49:05] PROBLEM - Check systemd state on mc1029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:09] PROBLEM - SSH on rdb1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:49:11] RECOVERY - PHP7 rendering on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 6.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:49:11] PROBLEM - SSH on analytics1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:49:12] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [13:49:14] (03CR) 10jerkins-bot: [V: 04-1] Revert "Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/625757 (owner: 10Andrew Bogott) [13:49:17] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:23] PROBLEM - Check systemd state on kubernetes1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:27] PROBLEM - Check systemd state on mw1321 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:49:27] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:49:29] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:49:35] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:49:39] PROBLEM - Varnish HTTP text-frontend - port 3123 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:49:41] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.1 200 Ok - 32228 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:49:51] PROBLEM - SSH on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:49:55] PROBLEM - SSH on lvs1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:49:57] PROBLEM - Apache HTTP on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:49:59] PROBLEM - Apache HTTP on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:50:01] RECOVERY - Apache HTTP on mw1401 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:50:07] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:50:17] RECOVERY - PHP7 rendering on mw1395 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.093 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:50:17] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:50:19] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:50:21] PROBLEM - PyBal backends health check on lvs1016 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1321.eqiad.wmnet, mw1401.eqiad.wmnet, mw1371.eqiad.wmnet, mw1372.eqiad.wmnet, mw1267.eqiad.wmnet, mw1322.eqiad.wmnet, mw1331.eqiad.wmnet, mw1272.eqiad.wmnet, mw1274.eqiad.wmnet, mw1261.eqiad.wmnet, mw1413.eqiad.wmnet, mw1328.eqiad.wmnet, mw1405.eqiad.wmnet, mw1320.eqiad.wmnet, mw1399.eqiad.wmnet are marked dow [13:50:22] ches_80: Servers mw1321.eqiad.wmnet, mw1328.eqiad.wmnet, mw1265.eqiad.wmnet, mw1395.eqiad.wmnet, mw1371.eqiad.wmnet, mw1372.eqiad.wmnet, mw1367.eqiad.wmnet, mw1331.eqiad.wmnet, mw1322.eqiad.wmnet, mw1268.eqiad.wmnet, mw1403.eqiad.wmnet, mw1413.eqiad.wmnet, mw1411.eqiad.wmnet, mw1261.eqiad.wmnet, mw1270.eqiad.wmnet, mw1399.eqiad.wmnet, mw1326.eqiad.wmnet are marked down but pooled: api_80: Servers mw1290.eqiad.wmnet, mw1284.eqiad. [13:50:22] ad.wmnet, mw1276.eqiad.wmnet, mw1412.eqiad.wmnet, mw1396.eqiad.wmnet, mw1404.eqiad.wmnet, mw1388.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:50:23] PROBLEM - Check systemd state on mw1269 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:25] PROBLEM - graphoid endpoints health on scb1002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [13:50:27] RECOVERY - SSH on dns1002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:50:27] RECOVERY - Apache HTTP on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.097 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:50:35] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [13:50:37] PROBLEM - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by ReadTimeoutError(HTTPSConnectionPool(host=text-lb.eqiad.wikimedia.org, port=443): Read timed out. (read timeout=15),): /api/rest_v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [13:50:41] PROBLEM - Varnish HTTP upload-frontend - port 3125 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:50:49] RECOVERY - SSH on cloudelastic1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:50:51] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:01] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:02] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:07] PROBLEM - PHP7 rendering on mw1388 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:51:11] PROBLEM - PHP7 rendering on mw1372 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:51:11] RECOVERY - SSH on kafka-jumbo1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:51:17] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:51:25] PROBLEM - SSH on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:51:27] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:51:31] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 23583 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:51:45] PROBLEM - Check systemd state on mw1333 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:45] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:51] RECOVERY - Apache HTTP on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.100 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:51:51] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:51:55] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:55] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:01] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:03] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:52:09] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.117 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:52:17] PROBLEM - AuthDNS-over-TLS Works on authdns1001 is CRITICAL: CRITICAL: ns[012] kdig DoTLS check failure https://wikitech.wikimedia.org/wiki/DNS [13:52:19] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:21] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:21] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:27] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:31] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:33] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:33] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:33] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:37] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:41] PROBLEM - Check systemd state on elastic1053 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:43] PROBLEM - Apache HTTP on mw1371 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:52:43] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:52:45] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:45] PROBLEM - PHP7 rendering on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:52:49] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 26031 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:52:49] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:52:49] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:49] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:52:51] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:52:55] RECOVERY - SSH on ms-be1056 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:52:57] PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:53:01] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:03] RECOVERY - PHP7 rendering on mw1388 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 2.066 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:53:05] PROBLEM - Apache HTTP on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:53:06] <_joe_> we seem to be losing more servers [13:53:09] RECOVERY - SSH on analytics1077 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:53:11] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [13:53:12] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:29] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:53:35] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:53:36] (03PS1) 10Ppchelko: Enable OAuthRateLimiter in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) [13:53:39] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:53:41] PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [13:53:45] RECOVERY - Apache HTTP on mw1267 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 8.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:53:45] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:45] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:45] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [13:53:49] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:51] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:53:53] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:53:57] RECOVERY - SSH on lvs1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:53:57] (03CR) 10Ppchelko: [C: 04-2] "Depends on the train." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625914 (https://phabricator.wikimedia.org/T258423) (owner: 10Ppchelko) [13:54:02] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:54:09] RECOVERY - Apache HTTP on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 9.377 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:54:11] RECOVERY - PHP7 rendering on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.106 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:54:11] RECOVERY - Varnish HTTP text-frontend - port 80 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:54:13] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:54:19] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:19] RECOVERY - Auth DNS on dns1002 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [13:54:21] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:54:21] PROBLEM - PHP7 rendering on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:54:21] RECOVERY - SSH on analytics1076 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:54:22] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:54:25] PROBLEM - Varnish HTTP upload-frontend - port 3126 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:54:35] RECOVERY - Varnish HTTP upload-frontend - port 3125 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:54:35] RECOVERY - PHP7 rendering on mw1287 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.102 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:54:39] RECOVERY - Apache HTTP on mw1371 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:54:39] RECOVERY - PHP7 rendering on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:54:41] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:54:45] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.0 200 OK - 25835 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:54:45] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 26020 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:54:55] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 0 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [13:54:59] RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 8.079 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:55:05] RECOVERY - Apache HTTP on mw1274 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.110 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:55:09] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:55:09] PROBLEM - SSH on ms-be1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:55:13] RECOVERY - SSH on rdb1010 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:55:15] ag [13:55:17] RECOVERY - PHP7 rendering on mw1372 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 6.086 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:55:18] ic [13:55:25] PROBLEM - SSH on an-worker1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:55:25] RECOVERY - SSH on mw1353 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:55:25] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:55:35] PROBLEM - Varnish HTTP upload-frontend - port 3122 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:55:37] PROBLEM - PHP7 rendering on mw1369 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:55:43] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:55:52] RECOVERY - Apache HTTP on mw1400 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.094 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:55:59] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:55:59] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 37 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [13:56:05] PROBLEM - SSH on an-worker1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:56:14] 08Warning Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Processor usage over 85% [13:56:15] RECOVERY - PHP7 rendering on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:56:21] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 32219 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:56:25] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1011 is CRITICAL: 182 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [13:56:25] RECOVERY - AuthDNS-over-TLS Works on authdns1001 is OK: OK: ns[012] kdig DoTLS check success https://wikitech.wikimedia.org/wiki/DNS [13:56:39] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:56:41] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 41 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [13:56:41] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 27 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [13:56:57] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:57:02] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1010 is CRITICAL: 198 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [13:57:09] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:57:21] RECOVERY - SSH on an-worker1094 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:57:22] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group=logstash instance=kafkamon1002 job=burrow partition={0,1,2} prometheus=ops site=eqiad topic={udp_localhost-info,udp_localhost-warning} https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-dataso [13:57:22] luster=logging-eqiad&var-topic=All&var-consumer_group=All [13:57:25] PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:31] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 67 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [13:57:35] RECOVERY - PHP7 rendering on mw1369 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:57:39] PROBLEM - SSH on kafka-jumbo1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:57:41] PROBLEM - Apache HTTP on mw1405 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [13:57:47] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 32 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [13:57:53] PROBLEM - Varnish HTTP text-frontend - port 3127 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [13:57:55] PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:57] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:05] RECOVERY - SSH on cp1090 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:58:13] PROBLEM - SSH on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:58:32] RECOVERY - Host wdqs1008 is UP: PING WARNING - Packet loss = 60%, RTA = 0.24 ms [13:58:32] RECOVERY - Host mw1355 is UP: PING WARNING - Packet loss = 77%, RTA = 0.22 ms [13:58:33] RECOVERY - Host logstash1012 is UP: PING WARNING - Packet loss = 77%, RTA = 0.23 ms [13:58:33] RECOVERY - Host mw1361 is UP: PING WARNING - Packet loss = 71%, RTA = 0.22 ms [13:58:33] RECOVERY - Host kafka-jumbo1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:58:33] RECOVERY - Host an-worker1093 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:58:33] RECOVERY - Host dumpsdata1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [13:58:34] RECOVERY - Host mw1356 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:58:34] RECOVERY - Host wtp1047 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:58:35] RECOVERY - Host ores1008 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:58:35] RECOVERY - Host labweb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:58:36] RECOVERY - Host snapshot1007 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:58:36] RECOVERY - Host cp1088 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [13:58:37] 10Operations, 10Analytics-Radar, 10Traffic, 10Patch-For-Review: Package varnish 6.0.x - https://phabricator.wikimedia.org/T261632 (10Vgutierrez) [13:58:37] RECOVERY - Host aqs1006 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:58:37] RECOVERY - Host kafka-jumbo1006 is UP: PING OK - Packet loss = 0%, RTA = 1.00 ms [13:58:38] RECOVERY - Host mw1362 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:58:38] RECOVERY - Host mw1357 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [13:58:39] RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [13:58:39] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [13:58:40] RECOVERY - Host druid1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:58:41] RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:58:41] RECOVERY - Host mw1354 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [13:58:42] RECOVERY - Host wtp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:58:42] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [13:58:42] RECOVERY - Host mw1359 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [13:58:43] RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.025 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [13:58:43] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:58:44] RECOVERY - Host an-presto1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [13:58:44] RECOVERY - Host ms-be1039 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [13:58:45] RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [13:58:45] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:58:46] RECOVERY - SSH on conf1006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:58:46] RECOVERY - Host elastic1061 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [13:58:47] RECOVERY - Host mc1035 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [13:58:48] RECOVERY - Host mw1352 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [13:58:49] RECOVERY - Host wtp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:58:49] RECOVERY - Host ms-be1056 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:58:50] RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [13:58:50] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.212 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:58:50] RECOVERY - PyBal backends health check on lvs1016 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:58:51] RECOVERY - Host ms-be1026 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [13:58:51] RECOVERY - Host es1018 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:58:52] RECOVERY - graphoid endpoints health on scb1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [13:58:52] RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [13:58:53] RECOVERY - Host mw1358 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [13:58:53] RECOVERY - Host centrallog1001 is UP: PING OK - Packet loss = 0%, RTA = 6.12 ms [13:58:55] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [13:58:55] RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [13:58:59] RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:59:03] RECOVERY - Host elastic1060 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [13:59:07] RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [13:59:09] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 23583 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:59:09] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp108[789].eqiad.wmnet [13:59:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:15] RECOVERY - SSH on ms-be1043 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:59:17] RECOVERY - SSH on cp1087 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:59:18] !log depooling cp1087-1090 [13:59:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:25] RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:59:25] RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [13:59:25] !log bblack@cumin1001 conftool action : set/pooled=no; selector: name=cp1090.eqiad.wmnet [13:59:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:29] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:31] RECOVERY - Host stat1005 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [13:59:31] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [13:59:33] RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.098 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [13:59:35] RECOVERY - Varnish HTTP upload-frontend - port 3122 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [13:59:37] RECOVERY - SSH on kafka-jumbo1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:59:42] RECOVERY - Apache HTTP on mw1405 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.092 second response time https://wikitech.wikimedia.org/wiki/Application_servers [13:59:42] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:59:45] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 56, down: 4, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:59:53] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:59:59] RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:00:05] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.0 200 OK - 23423 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:00:12] RECOVERY - SSH on an-worker1095 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:00:13] RECOVERY - SSH on cp1089 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:00:19] RECOVERY - Restbase edge eqiad on text-lb.eqiad.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:00:30] RECOVERY - Auth DNS #page on nsa-v4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:00:32] PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:00:39] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:00:43] RECOVERY - Kafka Broker Under Replicated Partitions on logstash1011 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [14:01:13] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:01:23] RECOVERY - Kafka Broker Under Replicated Partitions on logstash1010 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [14:01:25] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:01:32] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:01:39] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:01:45] PROBLEM - Apache HTTP on mw1361 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:01:49] RECOVERY - Varnish HTTP text-frontend - port 3123 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:01:59] PROBLEM - Host cp1088 is DOWN: PING CRITICAL - Packet loss = 100% [14:01:59] PROBLEM - Host mw1362 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:01] PROBLEM - Host elastic1060 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:01] PROBLEM - Host es1018 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:05] PROBLEM - Host mw1352 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:07] PROBLEM - Host dumpsdata1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:07] PROBLEM - Check systemd state on ms-be1023 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:02:07] PROBLEM - Host mc1035 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:07] PROBLEM - Host mw1359 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:09] PROBLEM - Host centrallog1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:11] PROBLEM - Host wtp1047 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:11] PROBLEM - Host mw1355 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:13] PROBLEM - Host wdqs1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:15] 08̶W̶a̶r̶n̶i̶n̶g Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Processor usage over 85% [14:02:19] PROBLEM - Host ms-be1037 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:23] PROBLEM - Host snapshot1007 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:27] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:02:27] PROBLEM - Host mw1354 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:31] RECOVERY - Varnish HTTP upload-frontend - port 3126 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:02:37] PROBLEM - Host aqs1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:37] PROBLEM - Host mc1034 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:39] PROBLEM - Host puppetmaster1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:43] PROBLEM - Host mw1350 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:45] PROBLEM - Host elastic1061 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:51] PROBLEM - Host mw1357 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:51] PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:52] <_joe_> there we go again [14:02:57] PROBLEM - Host mw1361 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:59] PROBLEM - Host an-presto1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:03] PROBLEM - Host mw1356 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:05] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 876.3 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:03:07] PROBLEM - Host ms-be1039 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:27] PROBLEM - Host restbase-dev1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:27] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 3 https://wikitech.wikimedia.org/wiki/HAProxy [14:03:28] <_joe_> oh sigh [14:03:37] PROBLEM - Host kafka-jumbo1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:45] <_joe_> elukey: we might need to disable replication in mcrouter [14:03:48] PROBLEM - Host labweb1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:53] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime [14:03:53] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:03:57] PROBLEM - Host wtp1048 is DOWN: PING CRITICAL - Packet loss = 100% [14:03:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:01] PROBLEM - Host druid1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:01] PROBLEM - Host mc1036 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:01] PROBLEM - Host mw1358 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:01] PROBLEM - Host kafka-jumbo1006 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:07] PROBLEM - Host snapshot1009 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:07] PROBLEM - Host mw1349 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:07] PROBLEM - Host wtp1046 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:09] PROBLEM - Host mw1351 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:15] PROBLEM - Host ores1008 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:15] PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100% [14:04:33] PROBLEM - Host stat1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:01] PROBLEM - PHP7 rendering on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:05:41] PROBLEM - aqs endpoints health on aqs1005 is CRITICAL: /analytics.wikimedia.org/v1/legacy/pagecounts/aggregate/{project}/{access-site}/{granularity}/{start}/{end} (Get pagecounts) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/unique-devices/{proj [14:05:41] }/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) timed out before a response was received: /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) timed out before a response was received: /analytics.w [14:05:41] dits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) is CRITICAL: Test Get daily edits for english wikipe https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:05:47] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [14:05:55] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:05:59] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) is CRITICAL: Test retrieve the most read articles for January 1, 2016 returned the unexpected status 429 (expecting: 200): /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) is CRITICAL: [14:05:59] most-read articles for January 1, 2016 (with aggregated=true) returned the unexpected status 429 (expecting: 200) https://wikitech.wikimedia.org/wiki/Wikifeeds [14:06:02] PROBLEM - aqs endpoints health on aqs1007 is CRITICAL: /analytics.wikimedia.org/v1/edits/per-page/{project}/{page-title}/{editor-type}/{granularity}/{start}/{end} (Get daily edits for english wikipedia page 0) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:07] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:06:09] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:06:19] PROBLEM - Apache HTTP on mw1353 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:06:33] RECOVERY - Varnish HTTP text-frontend - port 3126 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:06:33] PROBLEM - aqs endpoints health on aqs1008 is CRITICAL: /analytics.wikimedia.org/v1/unique-devices/{project}/{access-site}/{granularity}/{start}/{end} (Get unique devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:37] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:06:51] PROBLEM - SSH on dbproxy1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:06:55] PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:58] PROBLEM - Auth DNS #page on nsa-v4 is CRITICAL: DNS_QUERY CRITICAL - no socket TCP[198.35.27.27] Connection timed out https://wikitech.wikimedia.org/wiki/DNS [14:06:59] PROBLEM - Check systemd state on thanos-be1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:06:59] PROBLEM - SSH on backup1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:07:03] PROBLEM - aqs endpoints health on aqs1004 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) timed out before a response was received: /analytics.wikimedia.org/v1/pageviews/pe [14:07:03] t}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:07:15] PROBLEM - puppetmaster backend https on puppetmaster1001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 403 Forbidden https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:07:19] PROBLEM - Varnish HTTP text-frontend - port 3121 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:07:25] PROBLEM - SSH on conf1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:07:29] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:35] PROBLEM - SSH on elastic1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:07:35] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:07:57] PROBLEM - Host mc-gp1003 is DOWN: PING CRITICAL - Packet loss = 100% [14:07:59] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:07:59] RECOVERY - aqs endpoints health on aqs1007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:08:05] PROBLEM - Check systemd state on restbase1020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:05] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:08:07] PROBLEM - Check systemd state on an-druid1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:13] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:08:19] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:08:22] PROBLEM - PyBal backends health check on lvs1015 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw1319.eqiad.wmnet, mw1395.eqiad.wmnet, mw1324.eqiad.wmnet, mw1367.eqiad.wmnet, mw1322.eqiad.wmnet, mw1333.eqiad.wmnet, mw1401.eqiad.wmnet, mw1403.eqiad.wmnet, mw1327.eqiad.wmnet, mw1328.eqiad.wmnet, mw1413.eqiad.wmnet, mw1369.eqiad.wmnet, mw1261.eqiad.wmnet, mw1405.eqiad.wmnet, mw1265.eqiad.wmnet are marked dow [14:08:22] _80: Servers mw1342.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:08:35] PROBLEM - Check systemd state on es1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:08:47] RECOVERY - SSH on dbproxy1016 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:08:51] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:08:51] PROBLEM - Host an-worker1093 is DOWN: PING CRITICAL - Packet loss = 100% [14:08:53] PROBLEM - PHP7 rendering on mw1274 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:09:01] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:17] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:09:19] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:19] PROBLEM - Check systemd state on kafka-main1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:27] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:28] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) @Marostegui please see below Hello Papaul, After looking over the TSR and the link you'd sent me regarding the troubleshooting for t... [14:09:31] RECOVERY - SSH on elastic1064 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:09:35] PROBLEM - Host ms-be1056 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:39] PROBLEM - PHP7 rendering on mw1396 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:09:41] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:42] PROBLEM - Check systemd state on mw1287 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:42] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:49] PROBLEM - Check systemd state on mw1274 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:09:52] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:09:55] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:09:57] PROBLEM - Check systemd state on mw1371 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:03] PROBLEM - SSH on cloudelastic1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:10:11] PROBLEM - Check systemd state on elastic1040 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:11] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:11] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:12] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:15] PROBLEM - SSH on rdb1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:10:25] PROBLEM - Apache HTTP on mw1403 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:27] PROBLEM - Apache HTTP on mw1328 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:10:31] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:31] PROBLEM - Check systemd state on analytics1074 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:35] PROBLEM - Check systemd state on etcd1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:41] PROBLEM - PHP7 rendering on mw1321 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:43] RECOVERY - aqs endpoints health on aqs1008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:10:45] PROBLEM - Check systemd state on dbproxy1014 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:45] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:51] PROBLEM - Check systemd state on db1137 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:51] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:51] RECOVERY - PHP7 rendering on mw1274 is OK: HTTP OK: HTTP/1.1 302 Found - 643 bytes in 0.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:10:57] PROBLEM - rsyslog TLS listener on port 6514 on centrallog2001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [14:10:57] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:59] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:59] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:10:59] PROBLEM - SSH on an-worker1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:11:01] PROBLEM - PHP7 rendering on mw1327 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:11] PROBLEM - Check systemd state on gerrit1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:13] PROBLEM - Apache HTTP on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:25] PROBLEM - PHP7 rendering on mw1331 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:25] PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:25] PROBLEM - Varnish HTTP text-frontend - port 3124 on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:11:27] RECOVERY - puppetmaster backend https on puppetmaster1001 is OK: HTTP OK: Status line output matched 400 - 414 bytes in 0.021 second response time https://wikitech.wikimedia.org/wiki/Puppet%23Debugging [14:11:31] PROBLEM - Host ms-be1026 is DOWN: PING CRITICAL - Packet loss = 100% [14:11:31] PROBLEM - Apache HTTP on mw1399 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:11:32] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:32] PROBLEM - restbase endpoints health on restbase-dev1004 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:35] PROBLEM - Varnish HTTP upload-frontend - port 3121 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:11:37] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is CRITICAL: 30 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [14:11:39] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is CRITICAL: 19 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:11:39] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:41] PROBLEM - PHP7 rendering on mw1348 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:45] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:47] RECOVERY - PHP7 rendering on mw1396 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 9.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:11:51] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [14:11:52] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:53] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:11:57] PROBLEM - Check systemd state on ms-be1022 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:57] PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1 https://wikitech.wikimedia.org/wiki/HAProxy [14:11:59] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:12:01] PROBLEM - Aggregate IPsec Tunnel Status eqiad on icinga1001 is CRITICAL: instance=mc1034 site=eqiad tunnel=mc2034_v4 https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:12:03] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:12:09] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:12:17] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:12:17] RECOVERY - SSH on rdb1010 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:12:22] PROBLEM - Check systemd state on mw1369 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:23] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:12:29] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is CRITICAL: 67 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [14:12:29] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:12:31] PROBLEM - Apache HTTP on mw1380 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:12:33] RECOVERY - Apache HTTP on mw1328 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 3.107 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:12:33] RECOVERY - Apache HTTP on mw1403 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 6.095 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:12:33] PROBLEM - Check systemd state on mw1400 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:36] PROBLEM - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:12:37] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:12:45] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is CRITICAL: 16 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [14:12:49] RECOVERY - PHP7 rendering on mw1321 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 3.085 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:12:51] PROBLEM - PyBal IPVS diff check on lvs1015 is CRITICAL: CRITICAL: Hosts in IPVS but unknown to PyBal: set([mw1395.eqiad.wmnet, mw1401.eqiad.wmnet, mw1411.eqiad.wmnet, mw1274.eqiad.wmnet, mw1388.eqiad.wmnet, mw1372.eqiad.wmnet, mw1328.eqiad.wmnet, mw1267.eqiad.wmnet, mw1399.eqiad.wmnet, mw1326.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal [14:12:51] PROBLEM - Recursive DNS on 208.80.155.108 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [14:12:57] PROBLEM - PHP7 rendering on mw1267 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:01] RECOVERY - SSH on an-worker1095 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:13:03] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is CRITICAL: 13 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [14:13:05] RECOVERY - PHP7 rendering on mw1327 is OK: HTTP OK: HTTP/1.1 302 Found - 645 bytes in 3.593 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:09] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [14:13:09] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:13:11] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:13:15] RECOVERY - SSH on backup1001 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:13:19] PROBLEM - restbase endpoints health on restbase2021 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:13:20] RECOVERY - Auth DNS #page on nsa-v4 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:13:23] RECOVERY - Apache HTTP on mw1281 is OK: HTTP OK: HTTP/1.1 302 Found - 631 bytes in 9.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:25] PROBLEM - Aggregate IPsec Tunnel Status codfw on icinga1001 is CRITICAL: instance={mc2034,mc2035} site=codfw tunnel={mc1034_v4,mc1035_v4} https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:13:29] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 32220 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:13:29] PROBLEM - Varnish HTTP text-frontend - port 3120 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:13:32] PROBLEM - SSH on ms-be1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:13:33] RECOVERY - Apache HTTP on mw1399 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:13:38] !log dns1002 - disable puppet + bird service (stop advertising recdns from row D) [14:13:38] !log drain kubernetes1013, kubernetes1004. They are on row D [14:13:39] PROBLEM - Apache HTTP on mw1413 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [14:13:39] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:13:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:45] RECOVERY - PHP7 rendering on mw1348 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.584 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:13:45] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 6145 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:52] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 505 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [14:13:53] PROBLEM - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:13:57] PROBLEM - Varnish HTTP upload-frontend - port 3120 on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:14:07] PROBLEM - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:14:07] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:14:19] PROBLEM - Host 2620:0:861:4:208:80:155:108 is DOWN: PING CRITICAL - Packet loss = 100% [14:14:27] PROBLEM - Varnish HTTP text-frontend - port 3122 on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Varnish [14:14:35] RECOVERY - Apache HTTP on mw1380 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:14:35] PROBLEM - Check systemd state on kafkamon1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:38] RECOVERY - LVS wdqs-internal eqiad port 80/tcp - Wikidata Query Service - internal IPv4 #page on wdqs-internal.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:14:41] PROBLEM - Check systemd state on stat1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:14:59] RECOVERY - PHP7 rendering on mw1267 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.115 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:15:05] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:15:15] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1087 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:15:19] RECOVERY - Host logstash1012 is UP: PING WARNING - Packet loss = 33%, RTA = 0.19 ms [14:15:19] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.1 200 Ok - 32228 bytes in 7.420 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:15:21] RECOVERY - Host es1018 is UP: PING WARNING - Packet loss = 71%, RTA = 0.22 ms [14:15:21] RECOVERY - Host mc1035 is UP: PING WARNING - Packet loss = 33%, RTA = 0.28 ms [14:15:21] RECOVERY - Host cp1088 is UP: PING WARNING - Packet loss = 60%, RTA = 0.19 ms [14:15:21] RECOVERY - Host centrallog1001 is UP: PING WARNING - Packet loss = 33%, RTA = 4.19 ms [14:15:21] RECOVERY - Host kafka-jumbo1009 is UP: PING WARNING - Packet loss = 71%, RTA = 0.28 ms [14:15:22] RECOVERY - Host an-worker1093 is UP: PING WARNING - Packet loss = 90%, RTA = 0.23 ms [14:15:22] RECOVERY - Host mw1361 is UP: PING WARNING - Packet loss = 90%, RTA = 0.19 ms [14:15:23] RECOVERY - Host mw1355 is UP: PING WARNING - Packet loss = 33%, RTA = 0.23 ms [14:15:23] RECOVERY - Host stat1005 is UP: PING WARNING - Packet loss = 33%, RTA = 0.25 ms [14:15:24] RECOVERY - Host mw1359 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:15:25] RECOVERY - Host mw1356 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:15:25] RECOVERY - Host aqs1006 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:15:25] RECOVERY - Host mw1352 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [14:15:26] RECOVERY - Host ms-be1056 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:15:27] RECOVERY - Host elastic1061 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:15:27] RECOVERY - Host wtp1046 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:15:27] RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:15:28] RECOVERY - Host mw1362 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [14:15:29] RECOVERY - Host mw1358 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [14:15:29] RECOVERY - Host mw1349 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [14:15:29] RECOVERY - Host wtp1048 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:15:30] RECOVERY - Host mw1350 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [14:15:30] RECOVERY - Host ms-be1026 is UP: PING OK - Packet loss = 0%, RTA = 0.16 ms [14:15:31] RECOVERY - Host snapshot1007 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [14:15:31] RECOVERY - Host mw1354 is UP: PING OK - Packet loss = 0%, RTA = 0.21 ms [14:15:32] RECOVERY - Varnish HTTP text-frontend - port 3124 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:15:33] RECOVERY - Varnish HTTP text-frontend - port 3121 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:15:33] RECOVERY - Host mc1034 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [14:15:33] RECOVERY - Host druid1003 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:15:34] RECOVERY - Host elastic1060 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:15:35] RECOVERY - Host restbase-dev1006 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:15:35] <_joe_> here we go XioNoX [14:15:35] RECOVERY - Host mw1357 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:15:36] PROBLEM - Check systemd state on kafka-jumbo1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:36] RECOVERY - SSH on ms-be1059 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:15:37] RECOVERY - Host mc-gp1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [14:15:37] RECOVERY - Host puppetmaster1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:15:38] RECOVERY - Varnish HTTP upload-frontend - port 3121 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:15:39] RECOVERY - PHP7 rendering on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:15:39] RECOVERY - PHP7 rendering on mw1331 is OK: HTTP OK: HTTP/1.1 302 Found - 644 bytes in 0.109 second response time https://wikitech.wikimedia.org/wiki/Application_servers/Runbook%23PHP7_rendering [14:15:39] RECOVERY - Host mw1351 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:15:40] RECOVERY - Host kafka-jumbo1006 is UP: PING OK - Packet loss = 0%, RTA = 1.21 ms [14:15:40] RECOVERY - Host labweb1002 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [14:15:41] RECOVERY - Host ores1008 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:15:42] RECOVERY - Apache HTTP on mw1413 is OK: HTTP OK: HTTP/1.1 302 Found - 630 bytes in 0.103 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:15:42] RECOVERY - Host wdqs1008 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:15:45] RECOVERY - Host dumpsdata1002 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:15:47] PROBLEM - Check systemd state on cloudelastic1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:15:47] RECOVERY - Host wtp1047 is UP: PING OK - Packet loss = 0%, RTA = 2.90 ms [14:15:47] RECOVERY - Host ms-be1037 is UP: PING OK - Packet loss = 0%, RTA = 0.14 ms [14:15:47] RECOVERY - Host ms-be1039 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [14:15:47] RECOVERY - Host an-presto1003 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:15:51] RECOVERY - SSH on conf1006 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:15:55] RECOVERY - Ensure traffic_manager binds on 443 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.1 200 Ok - 32004 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:15:57] RECOVERY - Host snapshot1009 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [14:15:59] PROBLEM - Check systemd state on ms-be1038 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:09] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1090 is OK: HTTP OK: HTTP/1.0 200 OK - 25825 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:16:09] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.0 200 OK - 26012 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:16:11] PROBLEM - Check systemd state on flerovium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:11] RECOVERY - Host mc1036 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:16:19] RECOVERY - Ensure traffic_exporter binds on port 9122 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.0 200 OK - 26029 bytes in 0.014 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:16:19] RECOVERY - Aggregate IPsec Tunnel Status eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:16:25] RECOVERY - SSH on cloudelastic1004 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:16:35] RECOVERY - Apache HTTP on mw1361 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.091 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:53] RECOVERY - Apache HTTP on mw1353 is OK: HTTP OK: HTTP/1.1 302 Found - 629 bytes in 0.090 second response time https://wikitech.wikimedia.org/wiki/Application_servers [14:16:55] RECOVERY - Host 2620:0:861:4:208:80:155:108 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [14:16:57] RECOVERY - PyBal backends health check on lvs1015 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:16:57] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:17:01] PROBLEM - Check systemd state on elastic1064 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:17:01] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1089 is OK: HTTP OK: HTTP/1.1 200 Ok - 35308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:17:05] RECOVERY - Recursive DNS on 208.80.155.108 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [14:17:13] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp1087 is OK: HTTP OK: HTTP/1.1 200 Ok - 35323 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:17:13] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:13] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:19] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:19] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:19] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:19] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [14:17:19] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:19] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1003 [14:17:25] RECOVERY - Varnish HTTP text-frontend - port 3120 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:17:29] RECOVERY - restbase endpoints health on restbase2021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:31] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:41] RECOVERY - Aggregate IPsec Tunnel Status codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/strongswan https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status [14:17:41] RECOVERY - aqs endpoints health on aqs1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:17:51] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:55] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:55] RECOVERY - Varnish HTTP upload-frontend - port 3120 on cp1090 is OK: HTTP OK: HTTP/1.1 200 OK - 414 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:17:55] RECOVERY - restbase endpoints health on restbase-dev1004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:17:57] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:18:07] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:07] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1005 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1005 [14:18:09] RECOVERY - nova instance creation test on cloudcontrol1003 is OK: PROCS OK: 1 process with command name python, args nova-fullstack https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [14:18:15] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:15] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:15] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:15] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:25] RECOVERY - Varnish HTTP text-frontend - port 3122 on cp1089 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.000 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:18:25] RECOVERY - aqs endpoints health on aqs1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:18:25] RECOVERY - Varnish HTTP text-frontend - port 3127 on cp1087 is OK: HTTP OK: HTTP/1.1 200 OK - 413 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Varnish [14:18:27] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:27] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:31] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:37] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [14:18:39] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:45] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:47] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:47] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:47] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:49] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [14:18:52] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:18:59] RECOVERY - PyBal IPVS diff check on lvs1015 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:19:11] PROBLEM - Bird Internet Routing Daemon on dns1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [14:19:17] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:20:03] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 593 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:20:07] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1002 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1002 [14:20:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [14:20:25] godog: --^ [14:20:26] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1010 is CRITICAL: 219 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [14:20:27] PROBLEM - Druid middlemanager on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:20:43] PROBLEM - Check systemd state on wdqs1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:20:45] PROBLEM - Druid overlord on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:20:47] PROBLEM - Too many messages in kafka logging-eqiad #o11y on icinga1001 is CRITICAL: cluster=misc exported_cluster=logging-eqiad group={logstash,logstash-codfw,logstash7-codfw,logstash7-eqiad} instance=kafkamon1002 job=burrow partition={0,1,2,3,4,5} prometheus=ops site=eqiad topic={logback-info,rsyslog-err,rsyslog-info,rsyslog-notice,rsyslog-warning,udp_localhost-err,udp_localhost-info,udp_localhost-warning} https://wikitech.wikim [14:20:47] stash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [14:20:49] PROBLEM - Druid broker on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:20:56] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1004 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1004 [14:20:59] PROBLEM - Check systemd state on druid1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:08] PROBLEM - Kafka Broker Server #page on logstash1012 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:21:11] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-jumbo1001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=jumbo-eqiad&var-kafka_broker=kafka-jumbo1001 [14:21:17] PROBLEM - Druid coordinator on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:21:29] elukey: sadly known :( centrallog1001 is one of the host affected [14:21:39] PROBLEM - Druid historical on druid1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:21:41] ah snap sorry for the ping [14:21:47] PROBLEM - Check systemd state on logstash1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:22:19] <_joe_> ferm seems to be failing almost everywhere [14:22:45] RECOVERY - Check systemd state on wdqs1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:23:13] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1386 days) https://wikitech.wikimedia.org/wiki/Logs [14:24:41] RECOVERY - Check systemd state on mw1371 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:25:14] I'm bouncing ferm on affected hosts, on e.g. mw1371 it was caused by prom1004 failing to resolve [14:25:16] RECOVERY - Kafka Broker Server #page on logstash1012 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:25:53] RECOVERY - Check systemd state on logstash1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:26:35] RECOVERY - Kafka Broker Under Replicated Partitions on logstash1010 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [14:26:57] RECOVERY - Check systemd state on es1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:25] RECOVERY - Check systemd state on es1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:27:31] PROBLEM - Host mc1033 is DOWN: PING CRITICAL - Packet loss = 100% [14:27:47] RECOVERY - Check systemd state on gerrit1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:15] <_joe_> uh [14:28:22] <_joe_> what happened to mc1033? [14:28:22] RECOVERY - Check systemd state on elastic1053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:52] RECOVERY - Druid overlord on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server overlord https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:28:55] RECOVERY - Check systemd state on restbase1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:28:55] RECOVERY - Druid broker on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server broker https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:28:57] RECOVERY - Check systemd state on elastic1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:03] RECOVERY - Check systemd state on stat1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:03] RECOVERY - Check systemd state on druid1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:15] RECOVERY - Check systemd state on elastic1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:19] RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:21] RECOVERY - Druid coordinator on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server coordinator https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:29:27] can't even connect to mc1033's mgmt [14:29:39] RECOVERY - Host mc1033 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [14:29:41] RECOVERY - Druid historical on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server historical https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:29:54] moritzm: well, it's up now [14:30:01] yeah, it rebooted apparently [14:30:11] 0 uptime, checking logs [14:30:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul the host is depooled, we can power it off for you whenever you like [14:30:31] RECOVERY - Druid middlemanager on druid1003 is OK: PROCS OK: 1 process with command name java, args org.apache.druid.cli.Main server middleManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid [14:30:51] RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:30:51] RECOVERY - Check systemd state on mw1369 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:03] RECOVERY - Check systemd state on mw1400 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:11] RECOVERY - Check systemd state on mw1321 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:13] !log restarted ssh on mc1033 from console [14:31:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:17] RECOVERY - rsyslog TLS listener on port 6514 on centrallog2001 is OK: SSL OK - Certificate centrallog2001.codfw.wmnet valid until 2024-11-16 16:04:24 +0000 (expires in 1530 days) https://wikitech.wikimedia.org/wiki/Logs [14:31:23] RECOVERY - Check systemd state on mw1333 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:29] RECOVERY - Check systemd state on mw1320 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:05] RECOVERY - Check systemd state on mw1269 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:13] per SEL mc1033 went down with "power supply unplugged" for PS1 [14:32:23] RECOVERY - Check systemd state on mw1287 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:31] RECOVERY - Check systemd state on mw1274 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:32:36] but those should be redundant I'd think? [14:32:49] moritzm: yes got rebooted [14:32:49] a reminder that conversation and work about the incident is happening in the wikimedia-sre channel [14:33:56] !log bouncing ferm on hosts where ferm.service failed due to DNS resolution issues for prometheus hosts [14:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:15] <_joe_> moritzm: I did it already for the mw* [14:34:47] RECOVERY - Check systemd state on an-druid1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:01] RECOVERY - Check systemd state on ms-be1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:06] ack, there's a a few more for swift and more I'm currently doing [14:36:15] RECOVERY - Check systemd state on ms-be1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:15] I am doing some analytics ones [14:36:57] RECOVERY - Check systemd state on ms-be1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:57] RECOVERY - Check systemd state on analytics1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:01] RECOVERY - Check systemd state on etcd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:07] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:11] RECOVERY - Check systemd state on analytics1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:37:45] RECOVERY - Check systemd state on kafka-main1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:07] RECOVERY - Check systemd state on flerovium is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:31] RECOVERY - Widespread puppet agent failures on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.00445 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:38:33] RECOVERY - Check systemd state on mc1029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:37] RECOVERY - Check systemd state on kafkamon1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:15] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:19] RECOVERY - Check systemd state on db1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:27] RECOVERY - Check systemd state on kafka-jumbo1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:35] RECOVERY - Check systemd state on ores1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:39:45] RECOVERY - Check systemd state on db1143 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:40:45] RECOVERY - Check systemd state on kubernetes1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:01] RECOVERY - Check systemd state on db1137 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:06] (03Abandoned) 10Andrew Bogott: Revert "Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud" [puppet] - 10https://gerrit.wikimedia.org/r/625757 (owner: 10Andrew Bogott) [14:41:23] RECOVERY - Check systemd state on thanos-be1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:45] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [14:41:49] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:29] RECOVERY - Check systemd state on restbase1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:29] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:51] RECOVERY - Check systemd state on dbproxy1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:25] RECOVERY - Check systemd state on cloudelastic1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:00] !log reboot asw2-d3-eqiad [14:44:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:01] RECOVERY - rsyslog in codfw is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=codfw+prometheus/ops [14:45:28] (03PS1) 10Jbond: icinga: add bblack alternate case for correct authorisation [puppet] - 10https://gerrit.wikimedia.org/r/625925 [14:45:44] !log restarting bacula-dir @ backup1001 [14:45:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:24] yeah, some ongoing full backups got errors, no big deal [14:46:36] (03CR) 10Jbond: [C: 03+2] icinga: add bblack alternate case for correct authorisation [puppet] - 10https://gerrit.wikimedia.org/r/625925 (owner: 10Jbond) [14:46:37] I will check all daemons look healty, I needed a restart [14:50:06] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.0006357 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:53:33] (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-masters.py: improve logging [cookbooks] - 10https://gerrit.wikimedia.org/r/625913 (owner: 10Elukey) [14:53:37] !log Reload dbproxy1016 to recover the alert [14:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:45] <_joe_> jouncebot: next [14:57:45] In 1 hour(s) and 2 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1600) [15:02:56] !log request virtual-chassis vc-port set pic-slot 1 member 1 port 1 [15:02:56] !log request virtual-chassis vc-port set pic-slot 0 member 2 port 50 [15:02:56] !log request virtual-chassis vc-port set pic-slot 1 member 4 port 0 [15:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:00] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: name=kubernetes1004.* [15:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:42] !log jayme@cumin1001 conftool action : set/pooled=inactive; selector: service=kubesvc,name=kubernetes1013.* [15:03:44] RECOVERY - Host dbproxy1017 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [15:03:44] RECOVERY - Host pc1010 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [15:03:44] RECOVERY - Host scb1004 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:03:44] RECOVERY - Host wdqs1005 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:03:44] RECOVERY - Host kubernetes1004 is UP: PING WARNING - Packet loss = 75%, RTA = 0.25 ms [15:03:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:46] RECOVERY - Host sessionstore1003 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:03:46] RECOVERY - Host wtp1044 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [15:03:46] RECOVERY - Host restbase1018 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [15:03:46] RECOVERY - Host rdb1006 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:03:47] RECOVERY - Host thorium is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [15:03:47] RECOVERY - Host stat1006 is UP: PING WARNING - Packet loss = 50%, RTA = 0.20 ms [15:03:48] RECOVERY - Host mw1363 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:03:48] RECOVERY - Host releases1002 is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [15:03:48] RECOVERY - Host ores1007 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:03:49] RECOVERY - Host wtp1043 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:03:50] RECOVERY - Host mw1364 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:03:50] RECOVERY - Host mw1365 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [15:03:50] RECOVERY - Host wtp1045 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [15:03:51] RECOVERY - Host restbase1025 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [15:03:52] RECOVERY - Host kubernetes1013 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:03:58] RECOVERY - Host elastic1062 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [15:04:00] RECOVERY - Host maps1004 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [15:04:00] RECOVERY - Host elastic1063 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [15:04:00] RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.63 ms [15:04:08] RECOVERY - Host aqs1009 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [15:04:12] RECOVERY - Host eventlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.17 ms [15:04:12] RECOVERY - Host schema1004 is UP: PING OK - Packet loss = 0%, RTA = 1.43 ms [15:05:46] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:05:54] RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0 https://wikitech.wikimedia.org/wiki/HAProxy [15:06:20] RECOVERY - IPMI Sensor Status on pc1010 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:07:32] 04Critical Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Emergency syslog message [15:07:48] PROBLEM - Check systemd state on wdqs1005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:07:56] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.02056 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:08:24] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:08:44] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:08:44] RECOVERY - BGP status on cr1-eqiad is OK: BGP OK - up: 60, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:08:50] RECOVERY - Bird Internet Routing Daemon on dns1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:08:54] RECOVERY - BGP status on cr2-eqiad is OK: BGP OK - up: 250, down: 5, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:08:56] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:10:22] !log Start mysql on db1106 after PDU maintenance is done [15:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:32] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message [15:13:10] !log rolling restart of elk5 logstashes [15:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:36] !log repool cp1087-90 (eqiad row D) [15:14:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:52] nice :) [15:14:58] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp108[789].eqiad.wmnet [15:14:59] !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp1090.eqiad.wmnet [15:15:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:20] <_joe_> !log starting wdqs-updater on wdqs1005 [15:16:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:40] (03PS1) 10Jbond: bird: ensure bird service is running [puppet] - 10https://gerrit.wikimedia.org/r/625926 [15:17:34] RECOVERY - Check systemd state on wdqs1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:16] PROBLEM - ores_workers_running on ores1007 is CRITICAL: PROCS CRITICAL: 0 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:18:32] PROBLEM - Check systemd state on ores1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:18:34] PROBLEM - MariaDB Replica Lag: pc1 on pc1010 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 4819.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:18:49] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-masters [15:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:55] checking pc1010, I am sure it is expected [15:19:17] (03CR) 10Jbond: "ready for review - PCC: https://puppet-compiler.wmflabs.org/compiler1002/24994/" [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond) [15:19:31] <_joe_> !log restarted ferm on wdqs1011 [15:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:49] pc1010 recovering nicely, catching up from pc1007 [15:19:59] I will ack it [15:20:48] <_joe_> !log restarted celery-ores-worker.service on ores1007 [15:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:26] trying to remove noise right now to focus on possible leftovers [15:22:12] RECOVERY - Check systemd state on ores1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:48] RECOVERY - ores_workers_running on ores1007 is OK: PROCS OK: 91 processes with command name celery https://wikitech.wikimedia.org/wiki/ORES [15:26:38] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99) [15:26:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:53] (03PS5) 10Ottomata: Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [15:30:03] 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 2 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10jcrespo) I am CCin... [15:30:03] !log roll restart of hadoop master daemons on an-master100[1,2] after the cookbook failed [15:30:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:32] (03CR) 10jerkins-bot: [V: 04-1] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:31:44] (03CR) 10Muehlenhoff: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/24995/" [puppet] - 10https://gerrit.wikimedia.org/r/625890 (owner: 10Muehlenhoff) [15:32:30] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: service=kubesvc,name=kubernetes1013.* [15:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:08] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) [15:33:31] 10Operations, 10ops-eqiad, 10DBA, 10DC-Ops: Tue, Sept 8 PDU Upgrade 12pm-4pm UTC- Racks D3 and D4 - https://phabricator.wikimedia.org/T261452 (10Jclark-ctr) [15:34:34] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: (C)0.01 ge (W)0.006 ge 0.002491 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [15:34:43] !log jayme@cumin1001 conftool action : set/pooled=yes; selector: name=kubernetes1004.* [15:34:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:03] (03PS6) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [15:39:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] Add $site parameter to wmflib::service:get_url (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:43:44] 10Operations, 10netops: Audit Juniper EX snapshots version - https://phabricator.wikimedia.org/T262290 (10ayounsi) p:05Triage→03High [15:45:47] (03CR) 10Ebernhardson: [C: 03+1] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [15:47:58] (03PS6) 10Ottomata: Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [15:48:34] (03CR) 10jerkins-bot: [V: 04-1] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:49:23] (03PS1) 10Jbond: pki: drop ocsp vhost and serve over http [puppet] - 10https://gerrit.wikimedia.org/r/625929 (https://phabricator.wikimedia.org/T259117) [15:49:45] (03PS7) 10Ottomata: Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) [15:51:59] (03CR) 10JMeybohm: [C: 03+1] Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 (owner: 10Giuseppe Lavagetto) [15:52:38] (03CR) 10Ayounsi: "From https://wikitech.wikimedia.org/wiki/Anycast" [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond) [15:53:42] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624732 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:54:07] (03PS1) 10ArielGlenn: add a sample job for illustration purposes [dumps] - 10https://gerrit.wikimedia.org/r/625930 [15:55:08] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/24997/" [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:56:00] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but check the puppet compiler ofc." [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:58:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:58:08] (03CR) 10Jbond: [C: 04-1] "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/625926 (owner: 10Jbond) [15:58:41] (03CR) 10Ottomata: [C: 03+2] Add $site parameter to wmflib::service:get_url [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:58:53] (03CR) 10Ottomata: [C: 03+2] Add $site parameter to wmflib::service:get_url (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624092 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:59:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 (owner: 10Giuseppe Lavagetto) [16:00:05] jbond42 and cdanis: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1600). [16:00:17] (03Merged) 10jenkins-bot: Allow restbase https from the default policy too [deployment-charts] - 10https://gerrit.wikimedia.org/r/625900 (owner: 10Giuseppe Lavagetto) [16:02:19] !log akosiaris@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:02:51] (03PS1) 10Elukey: sre.hadoop.roll-restart-masters.py: improve logging and sleep times [cookbooks] - 10https://gerrit.wikimedia.org/r/625932 [16:03:08] (03CR) 10Jbond: [C: 03+2] pki: drop ocsp vhost and serve over http [puppet] - 10https://gerrit.wikimedia.org/r/625929 (https://phabricator.wikimedia.org/T259117) (owner: 10Jbond) [16:03:13] 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 2 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10Jdlrobson) It only... [16:03:13] !log akosiaris@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:03:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:46] !log akosiaris@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'kube-system' for release 'calico-policy-controller' . [16:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:04] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev, 10cloud-services-team (Kanban): Provide a way to have test servers on real hardware, isolated from production for Wikidata Query Service - https://phabricator.wikimedia.org/T206636 (10dcausse) t206636 and wcqs-beta-01 are behind the se... [16:04:12] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10Jgreen) [16:08:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Papaul) Dell mentioned that it is something to do with the OS and requested the sosreport. since we can not share that information with them i... [16:10:52] (03PS1) 10Cwhite: parse_service_problem doesn't need instance-local data move parse_service problem to global function and have am import it clean up imports [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 [16:11:42] (03CR) 10Elukey: [C: 03+2] sre.hadoop.roll-restart-masters.py: improve logging and sleep times [cookbooks] - 10https://gerrit.wikimedia.org/r/625932 (owner: 10Elukey) [16:11:49] (03CR) 10Cwhite: Add Icinga AM client (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:12:03] !log 1.36.0-wmf.8 was branched at e81e81e91473cc8259c473165863aca8ecea2784 for T257976 [16:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:11] T257976: 1.36.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T257976 [16:12:48] (03CR) 10Jeena Huneidi: [C: 03+2] Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 (https://phabricator.wikimedia.org/T257976) (owner: 10TrainBranchBot) [16:15:42] (03PS1) 10Giuseppe Lavagetto: cxserver: enable the service proxy in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625935 (https://phabricator.wikimedia.org/T255879) [16:15:44] (03PS1) 10Giuseppe Lavagetto: mobileapps: make template for the restbase uri configurable [deployment-charts] - 10https://gerrit.wikimedia.org/r/625936 (https://phabricator.wikimedia.org/T255876) [16:15:46] (03PS1) 10Giuseppe Lavagetto: mobileapps: use the service proxy for all calls in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/625937 (https://phabricator.wikimedia.org/T255876) [16:16:37] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review, 10User-Kormat: db2125 crashed - mgmt iface also not available - https://phabricator.wikimedia.org/T260670 (10Marostegui) @Papaul there is nothing really on the OS that we've seen that could cause these crashes. What we did on both crashes is the same:... [16:22:46] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [16:24:33] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) @Gehel are you the right person for those servers? If yes i need to know what hardware raid type we are going to use. Thanks [16:24:42] (03PS2) 10Filippo Giunchedi: Add Icinga AM client [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) [16:25:03] (03CR) 10Filippo Giunchedi: Add Icinga AM client (031 comment) [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625661 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [16:25:51] (03PS19) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [16:25:53] (03PS2) 10Ottomata: wgEventStreams - Set canary_events_enabled: true for eventgate test streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622876 (https://phabricator.wikimedia.org/T251609) [16:26:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, eventually sending out problems will live in am.py exclusively (i.e. parse_service_problem)" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 (owner: 10Cwhite) [16:26:53] (03CR) 10jerkins-bot: [V: 04-1] Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [16:28:02] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 2.861e+04 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:28:13] (03PS20) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [16:28:37] (03PS3) 10Ottomata: wgEventStreams - Set canary_events_enabled: true for eventgate test streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622876 (https://phabricator.wikimedia.org/T251609) [16:31:54] (03Merged) 10jenkins-bot: Branch commit for wmf/1.36.0-wmf.8 [core] (wmf/1.36.0-wmf.8) - 10https://gerrit.wikimedia.org/r/625745 (https://phabricator.wikimedia.org/T257976) (owner: 10TrainBranchBot) [16:32:23] 40K memcache errors/minute on eqiad [16:32:40] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&var-datasource=eqiad%20prometheus%2Fops&viewPanel=1 [16:33:08] (03CR) 10Ottomata: [C: 03+2] wgEventStreams - Set canary_events_enabled: true for eventgate test streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/622876 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [16:33:28] (03PS21) 10Ottomata: Canary events refinery job [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) [16:34:13] !log increased elk5 logstash JVM heaps to 2g (to help decrease kafka-logging consumer lag) [16:34:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:38] maybe just backlog being counted now? [16:35:03] logstash seems to have 2 hours of delay [16:35:08] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:35:31] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: wgEventStreams: Set canary_events_enabled: true for eventgate test streams and eventlogging_Test - T251609 (duration: 00m 58s) [16:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:41] T251609: Automate ingestion and refinement into Hive of event data from Kafka using stream configs and canary/heartbeat events - https://phabricator.wikimedia.org/T251609 [16:35:59] jynus: yeah logstash is lagging and catching up now [16:36:13] some some rate-based alerts are only alerting now [16:39:26] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 815.7 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:39:42] at least it spits out the recoveries pretty fast too [16:40:25] I wonder if there is any theoretical way to avoid or mitigate this when a network partition happens? [16:46:17] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [16:46:34] jynus: yeah it is a WIP, "mitigation" in this case will be to calculate the metrics from elasticsearch queries, as opposed to logstash so the metrics will reflect what's been indexed as opposed to what's been ingested ATM [16:48:13] T256418 that is [16:48:14] T256418: Evaluate alternative to Logstash StatsD outputs - https://phabricator.wikimedia.org/T256418 [16:48:34] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [16:50:01] (03CR) 10Cwhite: "> Patch Set 1: Code-Review+1" [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 (owner: 10Cwhite) [16:52:14] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:55:21] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) > A-Side Information > Customer > WIKIMEDIA FOUNDATION INC. > IBX > DC6 > Cage > DC6:01:061130 > Cabinet > 0000 > Space ID > DC6:01:061130 > System Name > DC6:1:61130:WIKIMEDIA FOUNDA... [16:56:42] RECOVERY - MariaDB Replica Lag: pc1 on pc1010 is OK: OK slave_sql_lag Replication lag: 22.71 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:58:49] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [16:59:31] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [17:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1700). [17:01:35] (03PS1) 10Hnowlan: api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) [17:02:29] (03CR) 10jerkins-bot: [V: 04-1] api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [17:03:28] !log attempted to add rock-dkms_3.3-19_all.deb to thirdparty/amd-rocm33 for use on analytics servers with GPUs [17:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:13] (03PS2) 10Hnowlan: api-proxy: Set password for ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) [17:09:28] 10Operations, 10fundraising-tech-ops, 10netops, 10observability: update nagios_nsca configuration in frack for new nsca servers - https://phabricator.wikimedia.org/T262291 (10Jgreen) [17:18:42] PROBLEM - ensure kvm processes are running on cloudvirt1033 is CRITICAL: PROCS CRITICAL: 0 processes with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:22:47] RECOVERY - Too many messages in kafka logging-eqiad #o11y on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Logstash%23Kafka_consumer_lag https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?from=now-3h&to=now&orgId=1&var-datasource=thanos&var-cluster=logging-eqiad&var-topic=All&var-consumer_group=All [17:23:00] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [17:23:02] !log rebooting cloudvirt1033 [17:23:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:07] RECOVERY - ensure kvm processes are running on cloudvirt1033 is OK: PROCS OK: 1 process with regex args qemu-system-x86_64 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:28:33] 10Operations, 10ops-eqiad, 10DC-Ops, 10netops: patch new cross-connect - https://phabricator.wikimedia.org/T261791 (10RobH) [17:36:32] 10Operations, 10serviceops, 10User-jijiki: Reproduce opcache corruptions in production - https://phabricator.wikimedia.org/T261009 (10jijiki) [17:37:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [17:45:40] (03PS1) 10Bstorm: cloudceph: Add cpufreq tools to set cpu governor [puppet] - 10https://gerrit.wikimedia.org/r/625947 [17:47:09] Stealing the services window, deploying a security fix [17:50:32] (03PS4) 10Effie Mouzeli: php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009) [17:50:39] (03PS5) 10Effie Mouzeli: php::admin: export additional opcache metrics [puppet] - 10https://gerrit.wikimedia.org/r/625224 (https://phabricator.wikimedia.org/T261009) [17:53:19] 10Operations, 10Advanced-Search, 10Discovery-Search, 10Traffic, and 2 others: Strange URL pattern after search https://en.wikipedia.org/w/index.php?sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance&sort=relevance ... - https://phabricator.wikimedia.org/T243884 (10Amorymeltzer) That... [17:53:37] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [17:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:25] !log Deployed patch for T262240 [17:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:08] (03PS1) 10Jeena Huneidi: Increase the DPL cache time from 1 day to 7 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625950 (https://phabricator.wikimedia.org/T262240) [17:57:10] (03CR) 10Jeena Huneidi: [C: 03+2] Increase the DPL cache time from 1 day to 7 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625950 (https://phabricator.wikimedia.org/T262240) (owner: 10Jeena Huneidi) [17:57:12] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625951 [17:57:14] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625951 (owner: 10Jeena Huneidi) [17:57:45] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [17:57:52] !log ryankemper@cumin1001 END (ERROR) - Cookbook sre.wdqs.data-reload (exit_code=97) [17:57:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:58] (03Merged) 10jenkins-bot: Increase the DPL cache time from 1 day to 7 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625950 (https://phabricator.wikimedia.org/T262240) (owner: 10Jeena Huneidi) [17:58:02] (03Merged) 10jenkins-bot: testwikis wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625951 (owner: 10Jeena Huneidi) [17:58:24] ^ manual ctrl+c'd that cookbook since I'd forgotten to run it inside a tmux session [17:58:30] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.data-reload [17:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:59:03] ryankemper: you can add a call to https://doc.wikimedia.org/spicerack/master/api/spicerack.interactive.html#spicerack.interactive.ensure_shell_is_durable [17:59:12] to the cookbook if that's needed [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1800) [18:00:04] needed as in it's a long operation that should be run in a durable session (tmux/screen) [18:00:20] volans: neat! yup this is a long-running operation so I will definitely add that, thanks for the tip [18:00:29] !log jhuneidi@deploy1001 Started scap: testwikis wikis to 1.36.0-wmf.8 [18:00:30] np :) [18:00:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:42] (03CR) 10Elukey: [C: 03+2] Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [18:15:22] PROBLEM - Widespread puppet agent failures- no resources reported on icinga1001 is CRITICAL: 0.0137 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:17:07] (03CR) 10Ppchelko: [C: 03+1] "gosh this thing is annoying. Trying to survive armageddon" [deployment-charts] - 10https://gerrit.wikimedia.org/r/625942 (https://phabricator.wikimedia.org/T235277) (owner: 10Hnowlan) [18:19:19] (03CR) 10Andrew Bogott: [C: 03+2] Convert wmcs-novastats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/624805 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [18:21:10] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack nova: increase live_migration_completion_timeout [puppet] - 10https://gerrit.wikimedia.org/r/625943 (owner: 10Andrew Bogott) [18:22:35] !log rm /srv/prometheus/ops/targets/mjolnir_msearch_eqiad.yaml on prometheus100[3,4] as cleanup after https://gerrit.wikimedia.org/r/621988 - T260305 [18:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:22:42] T260305: mjolnir-kafka-msearch-daemon dropping produced messages after move to search-loader[12]001 - https://phabricator.wikimedia.org/T260305 [18:24:37] (03PS2) 10Andrew Bogott: OpenStack nova: increase live_migration_completion_timeout [puppet] - 10https://gerrit.wikimedia.org/r/625943 [18:24:39] (03PS5) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) [18:26:04] (03PS6) 10Andrew Bogott: Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) [18:27:57] (03CR) 10Andrew Bogott: [C: 03+2] Update wmcs-novastats-capacity.py [puppet] - 10https://gerrit.wikimedia.org/r/625772 (https://phabricator.wikimedia.org/T262081) (owner: 10Andrew Bogott) [18:28:01] (03PS1) 10Elukey: mjolnir: fix syslog identifier in the msearch systemd unit template [puppet] - 10https://gerrit.wikimedia.org/r/625952 (https://phabricator.wikimedia.org/T260305) [18:29:44] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RKemper) [18:30:26] (03CR) 10Alex Paskulin: [C: 03+1] "Reviewed from a requirements perspective and looks good! Thanks, Hugh!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [18:32:07] (03CR) 10Elukey: [C: 03+2] mjolnir: fix syslog identifier in the msearch systemd unit template [puppet] - 10https://gerrit.wikimedia.org/r/625952 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [18:35:45] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10RKemper) @Papaul We'd like to use RAID10 as our hardware RAID. (There's a bit of context [[ https://phabricator.wikimedia.org/T257950#6338944 | here ]] wh... [18:48:56] (03PS1) 10Elukey: mjolnir: fix syslog identifier of the msearch instances [puppet] - 10https://gerrit.wikimedia.org/r/625954 (https://phabricator.wikimedia.org/T260305) [18:50:14] (03CR) 10Elukey: [C: 03+2] mjolnir: fix syslog identifier of the msearch instances [puppet] - 10https://gerrit.wikimedia.org/r/625954 (https://phabricator.wikimedia.org/T260305) (owner: 10Elukey) [18:57:34] (03CR) 10Bstorm: "When this restarts via exec in puppet, it should set the governor. You can validate that with the command it uses to check (cpufreq-info -" [puppet] - 10https://gerrit.wikimedia.org/r/625947 (owner: 10Bstorm) [19:00:05] longma and liw: That opportune time is upon us again. Time for a Mediawiki train - American+European Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T1900). [19:00:24] Still deploying to testwikis [19:09:47] 10Operations, 10Discovery, 10Discovery-Search, 10Elasticsearch, 10good first task: Improve Elasticsearch icinga alerting - https://phabricator.wikimedia.org/T133844 (10Gehel) 05Open→03Declined [19:12:15] !log jhuneidi@deploy1001 Finished scap: testwikis wikis to 1.36.0-wmf.8 (duration: 71m 45s) [19:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:16:01] Deploying 1.36.0-wmf.8 to group0 [19:17:31] (03PS1) 10Jeena Huneidi: group0 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625960 [19:17:33] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625960 (owner: 10Jeena Huneidi) [19:18:14] (03Merged) 10jenkins-bot: group0 wikis to 1.36.0-wmf.8 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625960 (owner: 10Jeena Huneidi) [19:19:40] !log jhuneidi@deploy1001 rebuilt and synchronized wikiversions files: group0 wikis to 1.36.0-wmf.8 [19:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:30] 10Operations, 10MediaWiki-Uploading, 10SRE-swift-storage, 10Structured Data Engineering, and 2 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10Krinkle) [19:30:42] 10Operations, 10MediaWiki-Uploading, 10SRE-swift-storage, 10Structured Data Engineering, and 2 others: API uploads fatal with UploadChunkFileException: Error storing file in '/tmp' backend-fail-internal - https://phabricator.wikimedia.org/T228292 (10Krinkle) This is a 1y+ production error still waiting to... [19:37:12] (03PS1) 10Catrope: Enable and configure GrowthExperiments on plwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/625963 (https://phabricator.wikimedia.org/T254239) [19:39:29] 10Operations, 10LDAP-Access-Requests: LDAP access to the nda group for Michael Raish - https://phabricator.wikimedia.org/T262316 (10Mraish) [19:41:01] PROBLEM - SSH on wtp1047.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:46:58] (03CR) 10Ottomata: "Looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/624168 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:55:47] (03PS1) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087) [19:58:00] (03PS2) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087) [20:10:09] (03PS2) 10Cwhite: parse_service_problem doesn't need instance-local data move parse_service problem to global function and have am import it clean up imports [debs/prometheus-icinga-exporter] - 10https://gerrit.wikimedia.org/r/625934 [20:14:21] (03Abandoned) 10Cwhite: prometheus: add apache2 es-exporter config [puppet] - 10https://gerrit.wikimedia.org/r/621597 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:26:40] (03PS3) 10Ottomata: eventgate-logging-external - set cors: '*' [deployment-charts] - 10https://gerrit.wikimedia.org/r/625965 (https://phabricator.wikimedia.org/T262087) [20:33:19] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/624733 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [20:34:30] (03PS1) 10Andrew Bogott: labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) [20:34:54] (03CR) 10jerkins-bot: [V: 04-1] labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [20:35:52] (03PS2) 10Andrew Bogott: labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) [20:41:52] RECOVERY - SSH on wtp1047.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:45:38] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-09-15) rack/setup/install db2141 (or next in sequence) - https://phabricator.wikimedia.org/T260819 (10Papaul) [20:49:41] (03CR) 10BryanDavis: labspuppetbackend: support requests for VMs/prefixes under .wmflabs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [21:12:06] (03PS3) 10Andrew Bogott: labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) [21:20:42] (03PS1) 10Cwhite: profile: remove usage of logstash statsd outputs [puppet] - 10https://gerrit.wikimedia.org/r/625975 (https://phabricator.wikimedia.org/T256418) [21:37:45] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: support requests for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625969 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [21:40:37] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [21:47:04] (03PS1) 10Bstorm: tools-grid: Install correct version of php-igbinary [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) [21:49:39] (03PS1) 10Andrew Bogott: labspuppetbackend: fix support for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625980 (https://phabricator.wikimedia.org/T260614) [21:50:22] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: fix support for VMs/prefixes under .wmflabs [puppet] - 10https://gerrit.wikimedia.org/r/625980 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [21:55:09] (03PS1) 10Cwhite: profile: update alerts on mediawiki logs [puppet] - 10https://gerrit.wikimedia.org/r/625982 (https://phabricator.wikimedia.org/T256418) [21:55:22] (03CR) 10BryanDavis: [C: 03+1] "This is a flashback to https://phabricator.wikimedia.org/T213666#4893465." [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) (owner: 10Bstorm) [21:56:30] (03CR) 10Bstorm: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) (owner: 10Bstorm) [21:56:36] (03CR) 10Bstorm: [C: 03+2] tools-grid: Install correct version of php-igbinary [puppet] - 10https://gerrit.wikimedia.org/r/625979 (https://phabricator.wikimedia.org/T262186) (owner: 10Bstorm) [21:57:22] !log andrew@deploy1001 Started deploy [horizon/deploy@7a3221d]: refreshing to clobber local hacks [21:57:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:34] !log andrew@deploy1001 Finished deploy [horizon/deploy@7a3221d]: refreshing to clobber local hacks (duration: 00m 13s) [21:57:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:02:30] (03PS1) 10Cwhite: prometheus: update mediawiki query timestamp filter [puppet] - 10https://gerrit.wikimedia.org/r/625984 (https://phabricator.wikimedia.org/T256418) [22:02:39] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [22:02:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:06:38] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) @RKemper thank you for the info. What stripe size for the RAID 10? [22:08:31] !log andrew@deploy1001 Started deploy [horizon/deploy@7d727eb]: very minor wmf-puppet-dashboard update [22:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:12:06] !log andrew@deploy1001 Finished deploy [horizon/deploy@7d727eb]: very minor wmf-puppet-dashboard update (duration: 03m 35s) [22:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:34] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:14:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:32] 10Operations, 10ops-codfw, 10DC-Ops, 10Maps: (Need By: TBD) rack/setup/install maps20[05-10].codfw.wmnet - https://phabricator.wikimedia.org/T260271 (10Papaul) [22:34:19] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [22:39:48] (03PS1) 10Andrew Bogott: labspuppetbackend: rearrange args to re.sub [puppet] - 10https://gerrit.wikimedia.org/r/625992 (https://phabricator.wikimedia.org/T260614) [22:40:44] (03CR) 10BryanDavis: [C: 03+1] labspuppetbackend: rearrange args to re.sub [puppet] - 10https://gerrit.wikimedia.org/r/625992 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [22:40:47] (03CR) 10Andrew Bogott: [C: 03+2] labspuppetbackend: rearrange args to re.sub [puppet] - 10https://gerrit.wikimedia.org/r/625992 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [22:59:37] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/621967 (https://phabricator.wikimedia.org/T252990) (owner: 10Southparkfan) [23:00:04] RoanKattouw, Niharika, and Urbanecm: My dear minions, it's time we take the moon! Just kidding. Time for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200908T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:05:36] PROBLEM - SSH on wdqs1005.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:08:30] (03PS1) 10Cwhite: update debian files to handle new prometheus-icinga-am service [debs/prometheus-icinga-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/626001 [23:11:50] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:15:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:44:32] PROBLEM - SSH on wtp1047.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:46:51] jouncebot: refresh [23:46:53] I refreshed my knowledge about deployments. [23:48:02] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:57:34] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops