[00:09:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [00:10:13] (03CR) 10Krinkle: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [00:10:19] night night, James_F [00:12:35] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [01:18:19] !log ayounsi@deploy1001 Started deploy [netbox/deploy@367ca84]: test [01:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:18:21] !log ayounsi@deploy1001 Finished deploy [netbox/deploy@367ca84]: test (duration: 00m 02s) [01:18:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:28] 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10greg) [01:24:39] 10Operations, 10Wikimedia-Mailing-lists: Please create new team mailing list - https://phabricator.wikimedia.org/T232178 (10Jrbranaa) [02:04:21] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:05:57] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:18:33] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:20:07] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [02:48:51] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29194632 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:59] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17429104 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:51:59] PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17468672 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:53:33] RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5488 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:53:33] RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 75360 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [03:07:07] !log restarting keyholder on deploy1001 [03:07:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:25] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (testing) [03:16:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:28] T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291 [03:16:46] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (testing) (duration: 00m 20s) [03:16:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:16] !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 [03:21:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:31] !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (duration: 00m 14s) [03:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:46] T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291 [03:32:59] (03PS4) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873) [03:33:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [03:33:25] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Andrew) 05Open→03Resolved [03:33:38] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [03:33:40] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Andrew) 05Open→03Resolved [03:33:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [03:33:56] (03CR) 10Andrew Bogott: [C: 03+2] openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott) [03:33:58] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Andrew) 05Open→03Resolved [03:34:56] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [03:36:01] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew) [03:39:35] 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10Andrew) [04:54:33] <_joe_> !log run systemctl reset-failed on kafka1001 to clear a 13 hours icinga alert [04:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:54] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) I have reserved the window on the Deployments page. [05:05:02] 10Operations, 10DBA, 10Patch-For-Review: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Marostegui) I have reserved the window on the Deployments page. [05:09:09] 10Operations, 10DBA: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) [05:11:32] !log Remove db2046 from tendril and zarcillo - T231767 [05:11:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:11:35] T231767: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 [05:11:36] (03PS1) 10Marostegui: mariadb: Decommission db2046 [puppet] - 10https://gerrit.wikimedia.org/r/534725 (https://phabricator.wikimedia.org/T231767) [05:13:25] PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:24] (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2046 [puppet] - 10https://gerrit.wikimedia.org/r/534725 (https://phabricator.wikimedia.org/T231767) (owner: 10Marostegui) [05:18:05] 10Operations, 10DBA, 10Patch-For-Review: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) [05:21:01] PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver [05:22:37] RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET [05:24:39] 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Krd) accountsecurity@wikimedia.org contrib@wikimedia.org donate-fwd@wikimedia.org educacao@wikimedia.org foundation@wikimedia.org helpdesk-l@wikimedia.org pers@wikimedia.org orange-... [05:31:20] !log Stop MySQL on db2046 - T231767 [05:31:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:31:24] T231767: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 [05:32:11] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) a:05Marostegui→03RobH [05:32:26] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) This host is ready for #dc-ops to decommission [05:36:30] 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Dzahn) I checked the email addresses provided and they are all routed to OTRS except these: helpdesk-l@lists.wikiemdia.org - This is a mailman list. pers@wikimedia.org is undelive... [05:37:10] 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) [05:41:27] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-09-02 05:29:42 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [05:58:17] (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema) [06:00:10] (03PS3) 10Giuseppe Lavagetto: scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) [06:02:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto) [06:03:01] !log puppetmaster1001 - copying cassandra-ca-manager to /usr/local/bin - deleting expired restbase-dev1004 certs - running cassandra-ca-manager services-dev.yaml T224554 [06:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:03:04] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [06:09:47] !log puppetmaster1001 - same for restbase-dev1005 and restbase-dev1006 (T224554) [06:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:50] T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 [06:14:20] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Dzahn) @Eevans I recreated the certs for restbase-dev1004 through restbase-dev1006 and committed in the private... [06:17:33] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Dzahn) ` @restbase-dev1004 : keytool -list -v -keystore /etc/cassandra-a/tls/server.key 2>/dev/null | grep "Va... [06:25:54] 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Krd) Done. [06:26:11] 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Krd) 05Open→03Resolved a:03Krd [06:26:16] <_joe_> I'm going to do a null deployment to check scap o the deployment servers [06:29:41] !log oblivian@deploy1001 Synchronized README: testing php conditional restarts (duration: 00m 55s) [06:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:36:58] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:37:11] <_joe_> wut? [06:37:17] <_joe_> I just synced the readme :P [06:38:32] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [06:48:53] (03PS2) 10Muehlenhoff: Add partman config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534609 [06:52:06] (03Abandoned) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn) [06:52:26] (03CR) 10Muehlenhoff: [C: 03+2] Add partman config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534609 (owner: 10Muehlenhoff) [06:52:33] (03CR) 10Dzahn: [C: 03+2] remove parsoid-vd/parsoid-rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/534554 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn) [06:52:41] (03PS2) 10Dzahn: remove parsoid-vd/parsoid-rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/534554 (https://phabricator.wikimedia.org/T229356) [07:02:51] 10Operations, 10Wikimedia-Mailing-lists: Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10Aklapper) [07:09:09] (03PS3) 10Dzahn: releases: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411) [07:13:51] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18194/" [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [07:14:00] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [07:14:54] (03CR) 10Filippo Giunchedi: "Should work ok, best to wait on Ibd3e53b7fd58 first IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/530442 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron) [07:16:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although not a service-checker/swagger expert" [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [07:19:22] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [07:24:04] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [07:28:22] (03PS23) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [07:28:43] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [07:36:10] 10Operations, 10Product-Analytics, 10Wikidata, 10Wikidata-Query-Service, and 3 others: MIgrate WDQS to new logging pipeline - https://phabricator.wikimedia.org/T232184 (10Mathew.onipe) [07:36:22] 10Operations, 10Product-Analytics, 10Wikidata, 10Wikidata-Query-Service, and 3 others: MIgrate WDQS to new logging pipeline - https://phabricator.wikimedia.org/T232184 (10Mathew.onipe) p:05Triage→03Normal [07:37:23] ema: yeah leftover from https://phabricator.wikimedia.org/T232007. I 've re-enabled it [07:40:18] (03PS1) 10Dzahn: ATS/varnish: switch backend for releases.wm.org to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/534759 (https://phabricator.wikimedia.org/T210411) [07:44:06] (03CR) 10Dzahn: [C: 03+2] ATS/varnish: switch backend for releases.wm.org to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/534759 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [07:48:16] !log running puppet on cp-text_eqiad / cp1075 - switching releases.wikimedia.org to TLS to backend [07:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:19] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [07:50:22] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) [07:50:51] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [07:51:25] 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) - releases.wikimedia.org switched to TLS - releases-jenkins remains todo - parsoid-vd / parsoid-rt tests on ruthenium - directors and DNS records removed - users wi... [08:05:19] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18196/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/534597 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn) [08:05:31] (03PS2) 10Dzahn: webperf: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534597 (https://phabricator.wikimedia.org/T210411) [08:08:12] chaomodus: netbox1001 - internal server error | netbox2001/netboxdb2001 - systemdstate [08:21:17] (03PS1) 10Dzahn: ssl/webperf: fix certificate file extension [puppet] - 10https://gerrit.wikimedia.org/r/534764 [08:26:51] (03CR) 10Dzahn: [C: 03+2] ssl/webperf: fix certificate file extension [puppet] - 10https://gerrit.wikimedia.org/r/534764 (owner: 10Dzahn) [08:27:02] (03PS2) 10Dzahn: ssl/webperf: fix certificate file extension [puppet] - 10https://gerrit.wikimedia.org/r/534764 [08:27:02] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:27:19] hrmm [08:27:25] dashboard not found? [08:28:36] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [08:31:49] (03CR) 10Petar.petkovic: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic) [08:35:08] 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui) [08:39:25] (03PS5) 10Gehel: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [08:40:27] (03CR) 10Gehel: [C: 03+2] elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [08:41:53] (03CR) 10Gehel: [C: 03+2] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [08:42:53] !log webperf* - /usr/local/sbin/build-envoy-config -c /etc/envoy | rm /etc/envoy/listeners.d/00-tls_terminator_443.yaml | run puppet - envoy now listening on 443 (T210411) [08:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:08] T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 [08:56:20] PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 4 days ago: Most recent backup 2019-09-02 08:38:15 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [09:30:56] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [10:08:20] PROBLEM - ElasticSearch health check for shards on 9400 on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: number_of_nodes: 1, relocating_shards: 0, active_shards: 6, active_primary_shards: 6, initializing_shards: 0, unassigned_shards: 5, number_of_in_flight_fetch: 0, number_of_data_nodes: 1, timed_out: False, active_shards_percent_as_number: 54.54545454545454, task_max_waiting_in_queu [10:08:20] ter_name: relforge-eqiad-small-alpha, status: yellow, delayed_unassigned_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:08:57] ^ oops [10:08:59] that's me [10:09:00] sorry [10:11:28] RECOVERY - ElasticSearch health check for shards on 9400 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_nodes: 2, number_of_pending_tasks: 0, timed_out: False, active_shards: 12, number_of_data_nodes: 2, active_p [10:11:28] status: green, cluster_name: relforge-eqiad-small-alpha, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:14:55] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [10:17:33] !log installing exim4 security updates [10:17:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:54] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [10:28:02] RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-09-06 08:32:24 from db1095.eqiad.wmnet:3313 (830 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [10:34:09] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the idea a lot, but I would go a different way with the implementation. See my comments inline for my suggestions." (035 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite) [10:40:57] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [10:45:56] 10Operations, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi) [10:49:13] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF service does not Vary responses by Content-Type, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE) [10:50:18] 10Operations, 10observability, 10Availability, 10Performance-Team (Radar): Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10fgiunchedi) 05Open→03Invalid Resolving in favor of {T88997} though please reopen if needed! [10:55:28] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1566659363[4](2019-09-02T23:06:21.576Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [10:56:17] looking [10:59:01] !log ladsgroup@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=testwikidatawiki (T225056) [10:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:04] T225056: Run Item Terms Rebuild script - https://phabricator.wikimedia.org/T225056 [10:59:36] !log force shard allocation - chi eqiad [10:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:14] 10Operations, 10Traffic: ats-tls is performing 3k DNS queries per second on cp5001 - https://phabricator.wikimedia.org/T232209 (10Vgutierrez) [11:06:53] (03PS1) 10Vgutierrez: ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) [11:09:07] (03PS2) 10Vgutierrez: ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) [11:10:55] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18198/" [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) (owner: 10Vgutierrez) [11:13:04] (03PS1) 10Gehel: Revert "elasticsearch: switch relforge to new logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/534786 [11:14:14] (03CR) 10Gehel: [C: 03+2] Revert "elasticsearch: switch relforge to new logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/534786 (owner: 10Gehel) [11:19:42] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work): Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Mathew.onipe) JsonLayout requires other dependencies for log4... [11:39:58] PROBLEM - Disk space on phab1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops [11:50:39] ^ fixing, that's some left over of the exim-heavy Puppet class phab1001 used to have [11:59:37] (03PS1) 10Kosta Harlan: WIP: Enable GrowthExperiments for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534789 (https://phabricator.wikimedia.org/T232060) [12:02:00] RECOVERY - Disk space on phab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops [12:04:00] (03PS1) 10Ladsgroup: mediawiki: Add rebuildItemTerms for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) [12:07:01] (03CR) 10Ladsgroup: "It needs one of SREs to start "/var/log/wikidata/wikidata-rebuildItemTerms.log" file with:" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:08:37] (03CR) 10Marostegui: "Should we maybe start it after the s8 failover on Tuesday?" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:11:45] (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:13:01] (03CR) 10Marostegui: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup) [12:28:42] PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:30:06] RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:35:08] (03CR) 10Ema: [C: 03+1] ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) (owner: 10Vgutierrez) [12:36:31] !log fix permissions on /var/spool/exim on krypton (hosts used to run the exim heavy role which uses different permissions than the light role) [12:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:51] !log cp5001: restart trafficserver-tls.service to clear icinga alert after segfault [12:37:54] 10Operations, 10media-storage: Have swift metrics available in Prometheus - https://phabricator.wikimedia.org/T187991 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done, followup in {T205870} [12:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:44] RECOVERY - traffic_server tls process restarted on cp5001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls [12:44:12] (03PS1) 10Muehlenhoff: Add initial site.pp entry for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534797 (https://phabricator.wikimedia.org/T231015) [12:46:52] (03CR) 10Giuseppe Lavagetto: [C: 03+2] restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445 (owner: 10Giuseppe Lavagetto) [12:46:56] (03PS3) 10Giuseppe Lavagetto: restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445 [12:46:57] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [12:49:11] (03PS2) 10Muehlenhoff: Add initial site.pp entry for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534797 (https://phabricator.wikimedia.org/T231015) [12:50:09] (03CR) 10Muehlenhoff: [C: 03+2] Add initial site.pp entry for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534797 (https://phabricator.wikimedia.org/T231015) (owner: 10Muehlenhoff) [13:00:32] (03CR) 10Giuseppe Lavagetto: [C: 03+2] kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 (owner: 10Giuseppe Lavagetto) [13:04:00] (03Merged) 10jenkins-bot: kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 (owner: 10Giuseppe Lavagetto) [13:07:17] 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi) [13:17:27] (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) (owner: 10Vgutierrez) [13:17:35] (03PS3) 10Vgutierrez: ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) [13:20:49] (03PS1) 10Giuseppe Lavagetto: Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 [13:22:28] (03PS4) 10Jcrespo: WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 [13:22:30] (03PS1) 10Jcrespo: testing stuff, not to be deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534804 [13:22:32] (03PS1) 10Jcrespo: [WIP] Add optional sanity checks to check mediawiki configuration [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 [13:22:56] (03CR) 10jerkins-bot: [V: 04-1] testing stuff, not to be deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534804 (owner: 10Jcrespo) [13:23:00] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add optional sanity checks to check mediawiki configuration [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [13:23:24] (03CR) 10jerkins-bot: [V: 04-1] Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto) [13:23:42] (03Abandoned) 10Jcrespo: testing stuff, not to be deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534804 (owner: 10Jcrespo) [13:25:49] (03CR) 10CDanis: [C: 03+1] Fix configuration file lookup when running with sudo (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto) [13:27:10] (03CR) 10Jcrespo: "This is not a blocker or a dependency for switchover.py ,but it helps make it faster, specially for things like replication_tree.py and re" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 (owner: 10Jcrespo) [13:27:45] 10Operations, 10Traffic, 10Patch-For-Review: ats-tls is performing 3k DNS queries per second on cp5001 - https://phabricator.wikimedia.org/T232209 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal [13:27:48] 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez) [13:28:20] (03CR) 10Jcrespo: "This change is ready for review." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo) [13:37:31] (03PS2) 10Giuseppe Lavagetto: Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 [13:38:27] (03CR) 10Volans: "Shouldn't we get this info from Netbox instead of by trial and error on the Ganeti side?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [13:41:30] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto) [13:44:06] (03Merged) 10jenkins-bot: Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto) [13:44:14] (03CR) 10Volans: "Some question inline" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov) [13:46:24] (03CR) 10Volans: [C: 03+1] "LGTM, AFAIK they are already not in prod anymore since a couple of weeks." [puppet] - 10https://gerrit.wikimedia.org/r/534017 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [13:47:15] (03CR) 10Volans: [C: 03+1] "LGTM, AFAIK all IP references in puppet/dns repos have already been replaced by the new hosts." [dns] - 10https://gerrit.wikimedia.org/r/534019 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff) [13:50:28] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-09-06 10:27:35 from db2098.codfw.wmnet:3313 (774 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:02:39] (03CR) 10Volans: [C: 04-1] "Two things that I think need to be fixed, the rest are all optional/nits" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey) [14:06:04] 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10MoritzMuehlenhoff) More generally speaking: Are the pybal-test* servers still used for testing/developing? Is there a specific reason they are in production and not in something like a "py... [14:43:34] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [14:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:22] 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff) [14:48:52] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:48:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:59] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:50:12] (03PS1) 10CDanis: dbctl: use explicit keyword arguments for the callback [software/conftool] - 10https://gerrit.wikimedia.org/r/534818 [14:50:14] (03PS1) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 [14:50:27] (03CR) 10jerkins-bot: [V: 04-1] dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (owner: 10CDanis) [14:51:07] (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (owner: 10CDanis) [14:51:11] !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm [14:51:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:33] (03PS2) 10Jhedden: openstack: Add codfw1dev glance API to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) [14:56:15] !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [14:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:46] (03CR) 10Jhedden: [C: 03+2] openstack: Add codfw1dev glance API to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [14:58:26] 10Operations, 10vm-requests: eqiad/codfw: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 (10MoritzMuehlenhoff) VMs have been created (but not yet installed) [15:01:20] (03CR) 10Muehlenhoff: "Good catch! Amended the patch" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [15:01:27] (03PS2) 10Muehlenhoff: Restrict NTP servers to production networks (including frack and network gear) [puppet] - 10https://gerrit.wikimedia.org/r/531808 [15:03:58] (03PS2) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) [15:05:13] (03CR) 10Ayounsi: [C: 03+1] "LGTM, please ping me after merging it so I can check network devices are still happy." [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [15:09:09] (03CR) 10Muehlenhoff: "Thanks, I'll ping you next week for this" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [15:10:20] (03Abandoned) 10Ayounsi: Fix dependencies [debs/pynetbox] - 10https://gerrit.wikimedia.org/r/534263 (owner: 10Ayounsi) [15:11:35] 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) Postponed to Thursday Sept 12th, 8am PST, 11am local time, 15:00 UTC. 3h [15:16:46] (03CR) 10Krinkle: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester) [15:18:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:22:02] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:22:47] (03PS1) 10Vgutierrez: ATS: Disable keep-alive on outgoing connections using native config options [puppet] - 10https://gerrit.wikimedia.org/r/534828 [15:24:29] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18202/" [puppet] - 10https://gerrit.wikimedia.org/r/534828 (owner: 10Vgutierrez) [15:29:12] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:32:22] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [15:35:31] 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) a:05RobH→03None [15:36:48] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:38:24] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:42:30] (03PS1) 10Vgutierrez: ATS: Disable server session sharing across clients [puppet] - 10https://gerrit.wikimedia.org/r/534831 [15:42:58] (03PS1) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534832 (https://phabricator.wikimedia.org/T223907) [15:44:35] (03PS1) 10CRusnov: python-build: add buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/534833 [15:45:49] (03PS2) 10Vgutierrez: ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828 [15:45:51] (03PS2) 10Vgutierrez: ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831 [15:55:50] (03CR) 10Phamhi: [C: 03+2] openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534832 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [15:56:38] (03PS2) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534832 (https://phabricator.wikimedia.org/T223907) [15:58:30] PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [15:59:52] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [15:59:56] RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [16:01:26] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [16:02:21] (03PS1) 10Jhedden: Revert "openstack: add haproxy health check path support" [puppet] - 10https://gerrit.wikimedia.org/r/534837 [16:03:24] (03CR) 10Jhedden: [C: 03+2] Revert "openstack: add haproxy health check path support" [puppet] - 10https://gerrit.wikimedia.org/r/534837 (owner: 10Jhedden) [16:03:29] (03CR) 10Jhedden: [V: 03+2 C: 03+2] Revert "openstack: add haproxy health check path support" [puppet] - 10https://gerrit.wikimedia.org/r/534837 (owner: 10Jhedden) [16:10:39] (03CR) 10Ayounsi: [C: 03+1] python-build: add buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/534833 (owner: 10CRusnov) [16:11:03] (03PS1) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534839 (https://phabricator.wikimedia.org/T223907) [16:12:41] (03CR) 10CRusnov: [V: 03+2 C: 03+2] python-build: add buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/534833 (owner: 10CRusnov) [16:14:52] 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10akosiaris) @Cmjohnson @wiki_willy What can we do to help get this unstuck? I am not at all sure why something like this would happen. Output of ` sudo megacli -AdpAllInfo -a... [16:16:25] paladox: Is there a Phabricator project for "blockers to Buster migration"? [16:16:48] I doin't think so, i just experenced this issue when upgrading a server that ran debian 9 [16:16:52] (upgraded to 10) [16:16:53] Right. [16:17:08] my work around is to install the deb from puppet :) [16:22:10] (03CR) 10Jhedden: [C: 03+2] openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534839 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden) [16:22:24] (03PS2) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534839 (https://phabricator.wikimedia.org/T223907) [16:27:51] (03PS1) 10CRusnov: Update to buster & upstream 2.6.3 (via v2.6.3-wmf1) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/534841 [16:35:02] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:37:04] (03CR) 10Ayounsi: [C: 03+1] "LGTM, but I'm concerned about doing a software upgrade at the same time of the migration (eg. DB migration complications)." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/534841 (owner: 10CRusnov) [16:40:44] (03CR) 10Volans: "> Patch Set 1:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi) [16:43:08] (03CR) 10Volans: [C: 03+1] "post-merge LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/534153 (owner: 10CDanis) [16:47:12] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:47:21] ah of course [16:47:29] still working on it :) [16:49:26] also i wish the icinga ui was a bit less slow [17:04:19] (03PS1) 10Bstorm: tagging: Add the tag to the templates [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534846 (https://phabricator.wikimedia.org/T229058) [17:07:05] (03CR) 10Bstorm: "When talking yesterday, I realized why we weren't able to get the latest built version with tagging. It's because the ancestor images wer" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534846 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm) [17:14:54] (03CR) 10CRusnov: [V: 03+2 C: 03+2] "your concerns are noted!" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/534841 (owner: 10CRusnov) [17:18:08] (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/534818 (owner: 10CDanis) [17:24:23] !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux [17:24:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:26] T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291 [17:25:33] (03CR) 10Volans: "LGTM in general, I've although a couple of questions and a nit inline." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis) [17:25:52] !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux (duration: 01m 29s) [17:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:58] !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 2 [17:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:35] !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 2 (duration: 00m 37s) [17:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:27:11] (03PS1) 10Phamhi: toollabs: update maintain-kubeusers timer command to use timeout [puppet] - 10https://gerrit.wikimedia.org/r/534848 [17:30:39] (03CR) 10Phamhi: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18204/console" [puppet] - 10https://gerrit.wikimedia.org/r/534848 (owner: 10Phamhi) [17:32:01] (03PS1) 10Andrew Bogott: codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 [17:33:07] (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (owner: 10Andrew Bogott) [17:34:24] (03PS2) 10Andrew Bogott: codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (https://phabricator.wikimedia.org/T229441) [17:35:36] (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott) [17:35:43] (03PS3) 10Andrew Bogott: codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (https://phabricator.wikimedia.org/T229441) [17:37:54] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [17:38:58] !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 3 [17:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:01] T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291 [17:39:19] !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 3 (duration: 00m 21s) [17:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:40:50] !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux [17:40:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [17:43:45] !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux (duration: 02m 55s) [17:43:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:42] (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/534848 (owner: 10Phamhi) [17:48:07] (03CR) 10Phamhi: [C: 03+2] toollabs: update maintain-kubeusers timer command to use timeout [puppet] - 10https://gerrit.wikimedia.org/r/534848 (owner: 10Phamhi) [17:48:31] (03PS2) 10Phamhi: toollabs: update maintain-kubeusers timer command to use timeout [puppet] - 10https://gerrit.wikimedia.org/r/534848 [17:48:36] 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson) [17:49:34] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:50:16] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 7415 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [17:50:50] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 55.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:50:52] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:51:19] (03PS1) 10Andrew Bogott: codfw1dev: update labtest.hiera.yaml to use codfw1dev resources [puppet] - 10https://gerrit.wikimedia.org/r/534851 (https://phabricator.wikimedia.org/T229441) [17:51:38] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:52:14] hmm, https://en.wikipedia.org is not loading [17:52:22] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:52:33] i get "carn't establish a secure connection" [17:53:06] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:53:30] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:53:38] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:53:41] esams link again? [17:53:56] ... [17:53:58] it dosen't work on my mobile either [17:54:10] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:54:11] blargh [17:54:17] and ping is failing. [17:54:19] paladox: it should start working shortly, if link traffic fails over, etc [17:54:23] everything ok? [17:54:24] wikipedia.org not working for me too [17:54:44] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:54:46] there was an excess RX alert too, could be dos [17:54:50] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:54:51] yeah page [17:55:12] I'm around if needed [17:55:39] pushing up a depool dns patch to have ready, not sure if that's the right move yet [17:55:44] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:55:47] bblack: let's take to _security [17:55:52] checking the link [17:56:16] (03PS1) 10BBlack: depool esams in geodns [dns] - 10https://gerrit.wikimedia.org/r/534852 [17:56:39] the packet loss seem to be *after* cr2-esams [17:57:00] no issues on eqiad-esams link [17:57:20] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:57:24] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:57:36] see: [17:57:36] HOST: icinga1001 Loss% Snt Last Avg Best Wrst StDev [17:57:36] 1. AS14907 ae3-1003.cr2-eqiad.wikimedia.org 0.0% 10 0.4 0.5 0.4 0.6 0.0 [17:57:36] 2. AS14907 xe-0-1-3.cr2-esams.wikimedia.org 0.0% 10 83.5 83.6 83.4 84.3 0.0 [17:57:36] 3. AS14907 text-lb.esams.wikimedia.org 80.0% 10 84.5 84.4 84.4 84.5 0.0 [17:58:14] 04Critical Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% [17:58:20] (03CR) 10BBlack: [C: 03+2] depool esams in geodns [dns] - 10https://gerrit.wikimedia.org/r/534852 (owner: 10BBlack) [17:58:21] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 2.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:58:47] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.460 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:00:12] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [18:00:18] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [18:00:48] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:00:50] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1753 days) https://wikitech.wikimedia.org/wiki/Logs [18:01:10] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:01:12] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 715 days) https://wikitech.wikimedia.org/wiki/Logs [18:01:24] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:01:25] !log silence esams pages for 30m [18:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:34] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:01:39] godog: Is Esams down? [18:01:47] I'm unable to reach any site [18:01:58] getting timeouts from time to time [18:02:14] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:02:26] multichill: yeah it is unhappy, being depooled now [18:02:46] Those OSPF/BGP/BFD warnings don't look good [18:03:23] seems to be back [18:03:40] 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80% [18:03:40] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:03:47] 10Operations: ERR_CONNECTION_TIMED_OUT on multiple WikiMedia sites - https://phabricator.wikimedia.org/T232224 (10RhinosF1) [18:04:09] Steinsplitter, esams specifically or the site in general? [18:04:11] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [18:04:21] 10Operations: ERR_CONNECTION_TIMED_OUT on multiple WikiMedia sites - https://phabricator.wikimedia.org/T232224 (10RhinosF1) p:05Triage→03Unbreak! Oh and accessing from the UK [18:04:22] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:04:35] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [18:04:54] Krenair: the site [18:04:59] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [18:05:26] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:05:56] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:07:10] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:14] godog: The depooling fixed it for me, dyna.wikimedia.org switched [18:07:32] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:07:37] 10Operations: ERR_CONNECTION_TIMED_OUT on multiple WikiMedia sites - https://phabricator.wikimedia.org/T232224 (10RhinosF1) Also confirmed by @ShakespeareFan00 so not just me [18:07:49] multichill: sweet! same here now [18:07:54] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 4882 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [18:07:54] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [18:07:54] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [18:07:55] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/ [18:07:55] ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:08:52] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:09:02] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [18:09:10] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 7.641 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:09:11] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:09:14] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&va [18:09:14] server&var-method=GET [18:09:28] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [18:09:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:09:54] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:09:56] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:10:18] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:10:20] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:10:20] PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [18:10:23] PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor [18:10:23] received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api [18:10:42] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:10:43] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:10:43] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/pag [18:10:43] et structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:21:30] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:38:54] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:39:20] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:39:22] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:39:40] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:39:44] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [18:40:30] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:40:34] bblack: btw, I'm a deployer with access, so i see logs anyway, at least which are in logstash [18:40:39] *nda [18:40:48] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.506 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:40:58] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:41:02] RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops [18:41:16] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:42:15] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [18:42:39] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% [18:44:41] So if there is anything i can do to help, would be happy to do [18:46:34] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [18:47:50] Is there a way to see where the traffic is coming from? [18:48:26] ShakespeareFan00, I expect the network engineers are able to do that [18:48:54] am not aware of any public graphs etc. about it [18:48:59] no help needed, we have all the info we need right now, thanks [18:49:00] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:49:14] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:49:16] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:49:28] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:49:52] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:50:08] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:50:25] ugh [18:50:32] ShakespeareFan00: you can explore grafana.wikimedia.org, if you want, but the ability to help without sysadmin power is limited [18:50:37] Lofhi: major outage, engineers work on that [18:50:45] I know [18:50:54] + you can't access Grafana [18:51:03] There are no data points [18:51:08] I can, I'm in `nda` ;) [18:51:14] Lucky [18:51:16] grafana is not a restricted site [18:51:18] ^ [18:51:25] No one said that [18:51:26] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [18:51:26] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [18:51:26] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/ [18:51:26] ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:51:28] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [18:51:31] yeah but it is also affected by network problems [18:51:34] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80% [18:51:34] ^ [18:51:36] yes [18:53:04] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:54:40] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:54:54] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:55:24] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:56:04] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:56:10] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:56:12] Nemo_bis: https://twitter.com/WikimediaItalia/status/1170042749166542849 <- ahum? [18:57:34] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 564 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:58:00] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:58:39] 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80% [18:58:42] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [18:58:42] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [18:58:43] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/ [18:58:43] ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [18:58:44] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:58:51] 04Critical Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% [18:58:56] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:59:00] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:59:08] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.118 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:59:15] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [18:59:22] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 7.114 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [18:59:30] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [18:59:39] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [18:59:50] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:59:52] RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration [19:00:06] RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [19:03:04] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3030.esams.wmnet are marked down but pooled: textlb_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3041.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3032.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marke [19:03:04] : textlb_80: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:03:34] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [19:03:58] PROBLEM - HHVM rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1928 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:04:02] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:04:22] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:04:24] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:04:38] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:06] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [19:05:12] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:05:28] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.246 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:05:42] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:05:50] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 551 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:05:54] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:05:58] PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:06:14] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:06:16] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:16] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal [19:06:20] PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1928 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:06:34] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.500 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:06:44] PROBLEM - Apache HTTP on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1928 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:07:00] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:07:14] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.781 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:09:24] RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal [19:10:20] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:10:38] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 271.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:10:56] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:11:04] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:11:06] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:11:36] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:11:56] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:12:32] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:13:04] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [19:13:04] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [19:13:04] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response w [19:13:04] /rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:13:52] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 1.488 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:13:56] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:14:06] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:15:38] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 1.415 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:17:08] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:17:40] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:18:28] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:53] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80% [19:19:06] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [19:19:06] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [19:19:06] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/ [19:19:06] ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:19:54] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 3596 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [19:20:42] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:21:42] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:21:56] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 562 bytes in 0.787 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:22:00] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 107.6 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [19:22:50] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [19:22:50] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [19:22:50] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/ [19:22:50] ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:23:39] 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80% [19:24:03] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [19:24:53] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [19:25:02] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [19:25:24] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:26:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:26:32] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:26:36] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:26:41] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [19:26:42] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15795 bytes in 1.441 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:27:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:27:14] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:27:22] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:27:32] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:27:34] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [19:27:50] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:28:03] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:04] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:28:04] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:28:10] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:28:10] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:28:12] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:28:14] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:28:14] 04Critical Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80% [19:28:42] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [19:28:44] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:28:52] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:29:02] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [19:29:02] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page [19:29:02] et rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received h [19:29:02] ikimedia.org/wiki/RESTBase [19:29:02] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:29:08] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:29:08] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:29:10] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi [19:29:10] se [19:29:12] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [19:29:12] PROBLEM - SSH on lvs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:29:14] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:29:24] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:29:27] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [19:29:28] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:29:28] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [19:29:33] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:30:22] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:30:26] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:30:31] Any clues about when this might stabilize? [19:30:33] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:30:34] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:30:36] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:30:38] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15798 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:30:39] abian: we are on it [19:30:50] Okay, thanks :) [19:30:52] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:26] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:31:28] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:30] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:48] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:31:54] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:31:54] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:54] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:56] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:56] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [19:31:56] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:31:56] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:32:05] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:32:08] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:32:18] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:32:32] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:32:40] PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:32:42] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:32:44] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:32:46] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 6.193 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:33:02] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:33:04] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:06] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:33:06] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:33:08] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:08] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:33:26] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:33:30] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:33:30] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:33:30] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:32] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:32] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:33] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [19:33:34] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 117 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:33:34] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [19:33:38] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.479 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:33:40] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [19:33:42] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:34:00] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:04] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:34:04] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:34:06] RECOVERY - SSH on lvs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:34:09] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:34:09] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:34:12] RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:34:12] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 7235 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [19:34:26] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:34:40] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:35:38] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 191 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:35:44] RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:36:05] abian, as a general rule it's best not to ask for these sorts of estimates, especially while people are under pressure trying to deal with incidents like this [19:36:21] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [19:36:39] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [19:36:56] if we could estimate we would, but usually if a problem is so tractable that it can be estimated, it would've been fixed long ago :) [19:37:22] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is OK: (C)3200 ge (W)1600 ge 1032 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [19:37:39] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary outbound port utilisation over 80% [19:38:29] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80% [19:38:53] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80% [19:39:08] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 24 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:39:18] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [19:39:29] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80% [19:39:40] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [19:39:52] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [19:41:12] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:41:13] Maybe you had finished the work and now it was up to the system to stabilize, we cannot know [19:41:39] A "we don't know" or "can't tell" is also a valid answer :) [19:41:55] bblack: Thank you for all your hard work. I trust you're doing your best. [19:42:09] +1 [19:42:21] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [19:42:39] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [19:43:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [19:43:15] maybe we should avoid pinging them while this is ongoing [19:43:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [19:48:22] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [19:48:41] <3 [20:00:30] Good news :) [20:00:56] it's still down for me [20:01:25] I also still face problems in Europe [20:01:48] PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100% [20:01:51] marostegui, [20:02:01] Ugh [20:02:02] I'm also having problems getting to a prod wiki [20:02:09] cdanis, bblack [20:02:26] well lvs3001 going down would explain it i guess? [20:02:32] RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 84.31 ms [20:03:11] I can't access the projects yet either [20:03:39] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:03:41] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:03:53] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [20:03:53] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:03:54] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:03:56] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:03:56] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:03:56] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:03:56] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:03:58] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:00] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:04:06] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:04:12] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:04:18] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:04:22] PROBLEM - SSH on lvs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:04:26] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:28] PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:04:28] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:04:30] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:04:36] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:46] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:04:48] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:48] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:04:48] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:04:50] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [20:04:50] formula) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections [20:04:50] ile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:04:52] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:04:56] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:04:58] we are on it [20:05:04] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:08] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:05:08] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:05:08] PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:05:12] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:05:14] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:05:16] PROBLEM - LVS HTTPS IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:05:20] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 5.561 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:05:24] PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:05:24] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:05:30] PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:05:44] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:05:44] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:06:04] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:06:04] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:06:22] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 5.207 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:06:30] PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [20:06:48] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:06:52] RECOVERY - LVS HTTPS IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 2.137 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:06:56] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:07:00] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 296 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [20:07:12] PROBLEM - LVS HTTP IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:07:38] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:07:52] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:08:10] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [20:08:10] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [20:08:10] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/ [20:08:10] ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:08:22] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:08:32] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:08:38] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 6.031 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:08:50] RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 7.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:08:50] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:09:10] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:09:22] PROBLEM - LVS HTTP IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:09:34] RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 4.137 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:09:34] RECOVERY - SSH on lvs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:09:36] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:10:12] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:10:24] RECOVERY - LVS HTTP IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:10:30] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 108 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [20:10:40] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:10:58] RECOVERY - LVS HTTP IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 4.005 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:10:58] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:11:08] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 9351 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [20:11:18] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:11:34] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [20:11:38] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:11:56] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:12:21] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [20:12:22] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:12:23] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:12:26] RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [20:12:58] 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80% [20:13:15] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [20:13:20] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:13:30] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:13:33] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [20:13:40] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.846 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:13:57] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [20:14:15] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80% [20:14:28] PROBLEM - SSH on lvs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:14:33] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [20:14:36] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:14:40] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:15:08] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:15:38] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:15:40] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:15:50] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:16:14] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:16:32] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:16:33] PROBLEM - PyBal connections to etcd on lvs3001 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [20:17:06] PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:17:11] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.803 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:17:14] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:17:22] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:17:32] PROBLEM - LVS HTTP IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:17:40] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:17:40] RECOVERY - SSH on lvs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:17:44] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:17:50] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:18:30] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:18:32] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:18:47] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:18:49] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:19:41] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:20:11] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:20:18] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:20:19] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.512 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:20:41] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:20:45] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:20:50] RECOVERY - LVS HTTP IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 4.013 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:20:53] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:21:15] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:21:51] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:22:11] PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:22:25] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:22:25] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:22:33] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:22:39] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 9391 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [20:22:43] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:22:47] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:23:36] Wikipedia is taking a really long time to load for me. It's happening on my iPhone and MacBook, on Safari. All other websites load fine. [20:23:47] RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 8.480 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:23:47] <|404> Cyberpower678, already known [20:23:49] <|404> and working on [20:23:56] In many instances it won't even load. [20:24:01] It just hangs [20:24:09] Operations, im all but a normal user but if theres anything i can do please feel free to let me know :) [20:24:20] btw #wikimedia-tech is supposed to be used for such reports [20:24:23] Cyberpower678: SRE is working on it [20:24:37] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:24:49] Zppix: Who's SRE? [20:24:59] Nemo_bis: thanks. [20:25:02] ddos or technical trouble? [20:25:04] Site Reliablity Engineers (IIRC) [20:25:05] SREs [20:25:20] TheBanner: Its been confirmed DDoS [20:25:20] TheBanner, DDoS [20:25:37] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:39] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:43] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:43] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:25:45] Zppix: is the source of the DDoS known? [20:26:01] Cyberpower678: I am not sure, I only know as much that has been publicy stated [20:26:08] if it is I imagine they won't be stating it here [20:26:13] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:26:21] Zppix: link? [20:26:22] I'm sure there will be an incident report published in the coming days [20:26:23] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 3274 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [20:26:27] That's not on Wikipedia? [20:26:30] Please keep this channel focused on dealing with the incident. Please see the channel topic: "Status: Incident on-going". No estimates etc as people are busy trying to deal with this incident. There will be an incident report later. Thanks. [20:26:31] in the mean time let's leave them to it huh? [20:26:33] Cyberpower678: bblac.k stated it in here [20:26:45] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:26:48] Please move "curiosity talk" to #wikimedia-tech or such. Thanks. [20:27:03] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:27:05] RECOVERY - PyBal connections to etcd on lvs3001 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [20:27:19] PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:27:29] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:27:40] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:27:45] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:27:51] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:27:51] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet are marked down but pooled: textlb_443: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3043.esams.wmnet, cp3040.esams.wmnet, cp3042.e [20:27:51] 0.esams.wmnet are marked down but pooled: textlb_80: Servers cp3042.esams.wmnet, cp3032.esams.wmnet, cp3040.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:28:04] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 6.027 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:28:23] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:28:33] PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:28:46] PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:29:01] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:29:13] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:29:13] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 5663 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [20:30:09] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:30:17] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.166 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:30:17] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [20:30:17] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:30:19] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm [20:30:19] aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1 [20:30:19] e} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response w [20:30:19] /rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:30:37] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:30:43] PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:30:49] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:30:53] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 3.302 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:31:27] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:31:56] RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 2.516 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:31:57] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 714 days) https://wikitech.wikimedia.org/wiki/Logs [20:31:59] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:32:47] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:32:51] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:33:07] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:33:25] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:33:40] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:33:51] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [20:34:07] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:34:09] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 9044 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [20:34:15] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:34:19] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:34:23] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:34:45] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:35:01] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 83.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:35:09] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:35:19] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:35:29] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [20:35:29] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [20:35:29] le} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response [20:35:29] i/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:35:33] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:35:39] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:36:03] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 90.75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:36:07] PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:36:17] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:36:27] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:36:53] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 562 bytes in 3.792 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:36:56] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.180 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:37:13] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:37:43] RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 2.611 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:38:09] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:38:27] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:38:27] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:38:51] PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid [20:38:55] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:39:19] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:39:19] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:39:25] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:39:31] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi [20:39:31] se [20:39:35] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.516 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:39:51] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:29] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:40:29] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:33] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:33] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:40:35] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:40:35] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:35] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:40:37] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:40:45] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:40:49] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:40:49] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:40:49] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:40:51] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:41:01] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:41:01] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:41:01] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:05] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:41:05] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:41:07] RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:41:13] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:23] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:25] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:41:25] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:41:25] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:41:31] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:41:34] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [20:41:37] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:41:41] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:41:59] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:42:01] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:42:20] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [20:42:57] 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80% [20:43:15] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [20:43:32] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [20:43:37] PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [20:43:37] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [utfa] https://wikitech.wikimedia.org/wiki/RESTBase [20:43:57] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [20:43:57] RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 9.712 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:44:37] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 9175 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [20:45:01] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:45:21] PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:45:35] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:45:35] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:45:35] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:35] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:37] PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:45:39] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [20:45:39] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:45:45] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:45:53] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:45:56] PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:45:57] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:46:03] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:03] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:46:05] PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [20:46:09] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb_443: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.e [20:46:09] 0.esams.wmnet are marked down but pooled: textlb_80: Servers cp3043.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:46:09] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:46:09] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi [20:46:09] se [20:46:17] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:23] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:46:29] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:46:29] PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [20:46:29] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:46:33] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:46:45] PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:46:49] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:46:51] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.459 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:46:57] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:47:17] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:47:17] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:47:19] PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:47:39] RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 7.774 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:47:40] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary outbound port utilisation over 80% [20:47:59] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:11] PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:11] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqiad.wikimedia.org recovered from Memory over 85% [20:48:59] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:48:59] PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [20:49:27] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:49:39] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 2.466 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:50:01] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:50:09] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:50:51] PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:51:09] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb_80: Servers cp3043.es [20:51:09] .esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [20:51:53] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 2.473 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:52:01] RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 4.575 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:52:19] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:52:19] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:21] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80% [20:52:27] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 5941 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [20:52:38] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [20:52:43] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.05 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:52:53] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{d [20:52:53] egated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/pa [20:52:53] le}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out befo [20:52:53] received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:52:59] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:17] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 35.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:53:23] RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:39] 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80% [20:53:53] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:53:57] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80% [20:54:05] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:54:09] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:54:15] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [20:54:15] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:54:23] PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:54:35] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:54:42] PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:55:01] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:05] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:55:21] RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:55:27] RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid [20:55:27] RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:55:29] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:29] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:55:29] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:55:29] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:31] RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:55:33] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:35] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:55:43] RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:55:45] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [20:55:57] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:55:57] RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:59] RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:56:01] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [20:56:03] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:56:17] PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat [20:56:17] formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v [20:56:17] title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response w [20:56:17] /rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [20:56:21] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:56:21] RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:56:21] RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [20:56:21] RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:56:33] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 114 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:56:34] 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80% [20:56:35] RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:56:37] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:56:39] RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [20:56:39] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [20:56:55] PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:56:55] RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid [20:57:37] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 103.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [20:58:23] PROBLEM - PyBal connections to etcd on lvs3001 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [20:58:29] PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:00:03] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:00:11] PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:00:11] PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [21:00:33] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is OK: (C)3200 ge (W)1600 ge 441 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops [21:00:47] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:00:55] PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3032.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [21:01:06] RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15809 bytes in 3.489 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:01:38] RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.815 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:01:40] RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15796 bytes in 2.726 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:01:47] RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1753 days) https://wikitech.wikimedia.org/wiki/Logs [21:01:59] RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:02:13] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:02:20] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80% [21:02:27] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:02:31] RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [21:02:43] PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 5858 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [21:02:51] RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [21:02:53] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:03:29] RECOVERY - PyBal connections to etcd on lvs3001 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [21:05:53] RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is OK: (C)3200 ge (W)1600 ge 1027 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops [21:06:59] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:07:05] PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:07:39] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:09:13] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:11:39] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary outbound port utilisation over 80% [21:12:01] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:12:10] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80% [21:12:28] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:13:59] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:15:11] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 33.57 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:15:11] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:16:47] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 181.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:17:27] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [21:17:33] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:18:27] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [21:18:39] 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80% [21:20:01] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:22:20] 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80% [21:23:03] RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:23:14] 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80% [21:25:01] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:25:01] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:25:02] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:26:21] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:26:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:26:55] !log mw1317 seems corrupted (Fatal error: Class undefined: stdClass in /srv/mediawiki/php-1.34.0-wmf.21/includes/libs/rdbms/database/DatabaseMysqli.php); running scap pull [21:27:18] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:27:18] PROBLEM - Host ncredir-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [21:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:41] RECOVERY - Host ncredir-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 16%, RTA = 0.32 ms [21:27:45] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:28:03] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:28:09] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:28:33] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:29:15] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 23 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:29:23] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [21:29:33] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 77 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:30:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [21:31:23] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 59.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:31:23] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:31:23] PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 78 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:31:45] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 71.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:33:14] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:33:25] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:33:29] !log cdanis@mw1317.eqiad.wmnet ~ 🕠🍺 sudo -i depool [21:33:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:33:37] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:34:33] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:36:31] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 50.05 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:36:57] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:37:59] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:38:09] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:39:41] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [21:40:33] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:40:41] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 23 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [21:44:52] Its going on https://twitter.com/UKDrillas/status/1170089580458065920 [21:45:26] I assumed it was fake [21:45:46] We have been asked to move all chatter to #wikimedia-tech please [21:47:38] found this beauty, might be helpful [21:47:39] https://twitter.com/UKDrillas/status/1170089580458065920 [21:47:51] "We have been asked to move all chatter to #wikimedia-tech please" [21:48:27] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:48:51] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [21:52:29] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:17] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:53:37] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:53:55] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:55:57] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:58:35] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:00:11] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:01:35] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:02:26] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [22:03:27] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [22:04:57] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:06:33] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:09:09] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:09:33] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:09:47] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:14:29] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:15:53] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:16:05] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 46.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:17:01] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:17:27] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:22:17] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:22:27] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [22:22:31] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 72.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:23:27] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [22:23:53] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:27:09] PROBLEM - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:28:43] RECOVERY - PyBal backends health check on lvs3002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:31:49] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:34:33] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:35:01] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:35:11] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:39:17] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:39:45] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:39:55] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:43:01] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 41.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:43:39] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:47:27] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [22:47:45] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 79.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:48:05] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:48:11] 08Warning Alert for device cr1-eqiad.wikimedia.org - Memory over 85% [22:48:47] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:49:23] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:52:29] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:53:21] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [22:53:35] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:54:11] PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [22:54:13] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [22:54:29] PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:55:49] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [22:56:03] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [22:57:17] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:02:13] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:03:39] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:03:49] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:06:47] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 57.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:09:57] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 74.51 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:10:05] PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [23:11:41] RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [23:13:09] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:17:12] 08Warning Alert for device cr2-esams.wikimedia.org - Processor usage over 85% [23:17:35] 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80% [23:19:40] mark: Shit still hitting fan? :-( [23:21:37] yes, although at this current point in time, the damage is limited to esams (EU / EMEA region users, luckily most of whom are in their dark hours!) [23:21:47] yup [23:22:01] I assume you guys are still working on resolving it [23:22:31] as best we can! [23:22:49] Awesome, keep up the great effort! [23:22:58] +1, thanks [23:22:58] what kind of hardware are the AMS core routers? [23:23:22] IRC-Source_77: https://wikitech.wikimedia.org/wiki/Esams_cluster [23:23:27] 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80% [23:26:21] already read those articles. but no infos and the pics are somewhat... "old" ;) [23:28:30] it's back up for me in the EU [23:30:39] PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:32:57] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:35:23] RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:35:27] RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:35:47] RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:36:25] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:42:27] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80% [23:48:27] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80% [23:57:11] 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Processor usage over 85%