[00:09:23] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[00:10:13] <wikibugs>	 (03CR) 10Krinkle: Variant configuration: Write to static (JSON) as well as serialised cache for testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533592 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[00:10:19] <Krinkle>	 night night, James_F 
[00:12:35] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[01:18:19] <logmsgbot>	 !log ayounsi@deploy1001 Started deploy [netbox/deploy@367ca84]: test
[01:18:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:18:21] <logmsgbot>	 !log ayounsi@deploy1001 Finished deploy [netbox/deploy@367ca84]: test (duration: 00m 02s)
[01:18:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[01:23:28] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Please create engprod@lists.wikimedia.org - https://phabricator.wikimedia.org/T232177 (10greg)
[01:24:39] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Please create new team mailing list - https://phabricator.wikimedia.org/T232178 (10Jrbranaa)
[02:04:21] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:05:57] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:18:33] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:20:07] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[02:48:51] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 29194632 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:51:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17429104 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:51:59] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 17468672 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:53:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 5488 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[02:53:33] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 75360 and 35 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[03:07:07] <chaomodus>	 !log restarting keyholder on deploy1001
[03:07:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:25] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (testing)
[03:16:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:16:28] <stashbot>	 T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291
[03:16:46] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (testing) (duration: 00m 20s)
[03:16:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:21:16] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@367ca84]: deploy for netbox split T223291
[03:21:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:21:31] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@367ca84]: deploy for netbox split T223291 (duration: 00m 14s)
[03:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[03:21:46] <stashbot>	 T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291
[03:32:59] <wikibugs>	 (03PS4) 10Andrew Bogott: openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873)
[03:33:23] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[03:33:25] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1021 with 10G interfaces - https://phabricator.wikimedia.org/T229873 (10Andrew) 05Open→03Resolved
[03:33:38] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[03:33:40] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1022 with 10G interfaces - https://phabricator.wikimedia.org/T229872 (10Andrew) 05Open→03Resolved
[03:33:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[03:33:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] openstack scheduler: update comments for cloudvirts [puppet] - 10https://gerrit.wikimedia.org/r/534681 (https://phabricator.wikimedia.org/T229873) (owner: 10Andrew Bogott)
[03:33:58] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: relocate/reimage cloudvirt1023 with 10G interfaces - https://phabricator.wikimedia.org/T229871 (10Andrew) 05Open→03Resolved
[03:34:56] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew)
[03:36:01] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10Andrew)
[03:39:35] <wikibugs>	 10Operations, 10cloud-services-team: Migrate remaining cloudvirt hosts to Stretch/Mitaka - https://phabricator.wikimedia.org/T224561 (10Andrew)
[04:54:33] <_joe_>	 !log run systemctl reset-failed on kafka1001 to clear a 13 hours icinga alert
[04:54:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:04:54] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui) I have reserved the window on the Deployments page.
[05:05:02] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Switchover s8 (wikidata) primary database master db1104 -> db1109 - 10th Sept @05:00 UTC - https://phabricator.wikimedia.org/T230762 (10Marostegui) I have reserved the window on the Deployments page.
[05:09:09] <wikibugs>	 10Operations, 10DBA: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui)
[05:11:32] <marostegui>	 !log Remove db2046 from tendril and zarcillo - T231767
[05:11:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:11:35] <stashbot>	 T231767: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767
[05:11:36] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Decommission db2046 [puppet] - 10https://gerrit.wikimedia.org/r/534725 (https://phabricator.wikimedia.org/T231767)
[05:13:25] <icinga-wm>	 PROBLEM - Check systemd state on netbox2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:16:24] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Decommission db2046 [puppet] - 10https://gerrit.wikimedia.org/r/534725 (https://phabricator.wikimedia.org/T231767) (owner: 10Marostegui)
[05:18:05] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui)
[05:21:01] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:fcgi://127.0.0.1:9000 method=GET https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver
[05:22:37] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[05:24:39] <wikibugs>	 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Krd) accountsecurity@wikimedia.org contrib@wikimedia.org donate-fwd@wikimedia.org educacao@wikimedia.org foundation@wikimedia.org helpdesk-l@wikimedia.org pers@wikimedia.org orange-...
[05:31:20] <marostegui>	 !log Stop MySQL on db2046 - T231767
[05:31:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:31:24] <stashbot>	 T231767: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767
[05:32:11] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) a:05Marostegui→03RobH
[05:32:26] <wikibugs>	 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2046.codfw.wmnet - https://phabricator.wikimedia.org/T231767 (10Marostegui) This host is ready for #dc-ops to decommission
[05:36:30] <wikibugs>	 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Dzahn) I checked the email addresses provided and they are all routed to OTRS except these:  helpdesk-l@lists.wikiemdia.org - This is a mailman list.  pers@wikimedia.org is undelive...
[05:37:10] <wikibugs>	 10Operations, 10DBA, 10Patch-For-Review, 10User-notice: Switchover m1 primary master: db1063 to db1135: Tuesday 10th September at 16:00 UTC - https://phabricator.wikimedia.org/T231403 (10Marostegui)
[05:41:27] <icinga-wm>	 PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-09-02 05:29:42 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[05:58:17] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] lvs: add restbase-ssl [puppet] - 10https://gerrit.wikimedia.org/r/534462 (https://phabricator.wikimedia.org/T210411) (owner: 10Ema)
[06:00:10] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857)
[06:02:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] scap: restart php-fpm if needed when doing a full deploy [puppet] - 10https://gerrit.wikimedia.org/r/534584 (https://phabricator.wikimedia.org/T224857) (owner: 10Giuseppe Lavagetto)
[06:03:01] <mutante>	 !log puppetmaster1001 - copying cassandra-ca-manager to /usr/local/bin - deleting expired restbase-dev1004 certs - running cassandra-ca-manager services-dev.yaml T224554
[06:03:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:03:04] <stashbot>	 T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554
[06:09:47] <mutante>	 !log puppetmaster1001 - same for restbase-dev1005 and restbase-dev1006 (T224554)
[06:09:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:09:50] <stashbot>	 T224554: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554
[06:14:20] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Dzahn) @Eevans I recreated the certs for restbase-dev1004 through restbase-dev1006 and committed in the private...
[06:17:33] <wikibugs>	 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Dzahn) `  @restbase-dev1004 :  keytool -list -v -keystore /etc/cassandra-a/tls/server.key 2>/dev/null | grep "Va...
[06:25:54] <wikibugs>	 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Krd) Done.
[06:26:11] <wikibugs>	 10Operations, 10Mail, 10OTRS: check OTRS wiki for email addresses no longer used - https://phabricator.wikimedia.org/T230243 (10Krd) 05Open→03Resolved a:03Krd
[06:26:16] <_joe_>	 I'm going to do a null deployment to check scap o the deployment servers
[06:29:41] <logmsgbot>	 !log oblivian@deploy1001 Synchronized README: testing php conditional restarts (duration: 00m 55s)
[06:29:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:36:58] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[06:37:11] <_joe_>	 wut?
[06:37:17] <_joe_>	 I just synced the readme :P
[06:38:32] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[06:48:53] <wikibugs>	 (03PS2) 10Muehlenhoff: Add partman config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534609
[06:52:06] <wikibugs>	 (03Abandoned) 10Dzahn: tlsproxy/envoy: limit connections on 443 to cache servers [puppet] - 10https://gerrit.wikimedia.org/r/534421 (owner: 10Dzahn)
[06:52:26] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add partman config for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534609 (owner: 10Muehlenhoff)
[06:52:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] remove parsoid-vd/parsoid-rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/534554 (https://phabricator.wikimedia.org/T229356) (owner: 10Dzahn)
[06:52:41] <wikibugs>	 (03PS2) 10Dzahn: remove parsoid-vd/parsoid-rt.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/534554 (https://phabricator.wikimedia.org/T229356)
[07:02:51] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Please create private "testeng" team mailing list - https://phabricator.wikimedia.org/T232178 (10Aklapper)
[07:09:09] <wikibugs>	 (03PS3) 10Dzahn: releases: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411)
[07:13:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18194/" [puppet] - 10https://gerrit.wikimedia.org/r/534594 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn)
[07:14:00] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/533563 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron)
[07:14:54] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Should work ok, best to wait on Ibd3e53b7fd58 first IMHO" [puppet] - 10https://gerrit.wikimedia.org/r/530442 (https://phabricator.wikimedia.org/T230570) (owner: 10Herron)
[07:16:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, although not a service-checker/swagger expert" [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite)
[07:19:22] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[07:24:04] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[07:28:22] <wikibugs>	 (03PS23) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072)
[07:28:43] <wikibugs>	 (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[07:36:10] <wikibugs>	 10Operations, 10Product-Analytics, 10Wikidata, 10Wikidata-Query-Service, and 3 others: MIgrate WDQS to new logging pipeline - https://phabricator.wikimedia.org/T232184 (10Mathew.onipe)
[07:36:22] <wikibugs>	 10Operations, 10Product-Analytics, 10Wikidata, 10Wikidata-Query-Service, and 3 others: MIgrate WDQS to new logging pipeline - https://phabricator.wikimedia.org/T232184 (10Mathew.onipe) p:05Triage→03Normal
[07:37:23] <akosiaris>	 ema: yeah leftover from https://phabricator.wikimedia.org/T232007. I 've re-enabled it 
[07:40:18] <wikibugs>	 (03PS1) 10Dzahn: ATS/varnish: switch backend for releases.wm.org to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/534759 (https://phabricator.wikimedia.org/T210411)
[07:44:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ATS/varnish: switch backend for releases.wm.org to use TLS [puppet] - 10https://gerrit.wikimedia.org/r/534759 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn)
[07:48:16] <mutante>	 !log running puppet on cp-text_eqiad / cp1075 - switching releases.wikimedia.org to TLS to backend
[07:48:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:49:19] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn)
[07:50:22] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn)
[07:50:51] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[07:51:25] <wikibugs>	 10Operations, 10Traffic, 10serviceops, 10Patch-For-Review: Applayer services without TLS - https://phabricator.wikimedia.org/T210411 (10Dzahn) - releases.wikimedia.org switched to TLS - releases-jenkins remains todo - parsoid-vd / parsoid-rt tests on ruthenium - directors and DNS records removed - users wi...
[08:05:19] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18196/webperf1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/534597 (https://phabricator.wikimedia.org/T210411) (owner: 10Dzahn)
[08:05:31] <wikibugs>	 (03PS2) 10Dzahn: webperf: add envoy for TLS termination [puppet] - 10https://gerrit.wikimedia.org/r/534597 (https://phabricator.wikimedia.org/T210411)
[08:08:12] <mutante>	 chaomodus: netbox1001 - internal server error | netbox2001/netboxdb2001 - systemdstate
[08:21:17] <wikibugs>	 (03PS1) 10Dzahn: ssl/webperf: fix certificate file extension [puppet] - 10https://gerrit.wikimedia.org/r/534764
[08:26:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] ssl/webperf: fix certificate file extension [puppet] - 10https://gerrit.wikimedia.org/r/534764 (owner: 10Dzahn)
[08:27:02] <wikibugs>	 (03PS2) 10Dzahn: ssl/webperf: fix certificate file extension [puppet] - 10https://gerrit.wikimedia.org/r/534764
[08:27:02] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[08:27:19] <mutante>	 hrmm
[08:27:25] <mutante>	 dashboard not found?
[08:28:36] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[08:31:49] <wikibugs>	 (03CR) 10Petar.petkovic: Add Draft and Draft_talk aliases for wikis that define draft namespace (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/510780 (https://phabricator.wikimedia.org/T223472) (owner: 10Petar.petkovic)
[08:35:08] <wikibugs>	 10Operations, 10DBA: Decommission db2043-db2069 - https://phabricator.wikimedia.org/T228258 (10Marostegui)
[08:39:25] <wikibugs>	 (03PS5) 10Gehel: elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[08:40:27] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] elasticsearch: switch relforge to new logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/534399 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe)
[08:41:53] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe)
[08:42:53] <mutante>	 !log webperf* - /usr/local/sbin/build-envoy-config -c /etc/envoy | rm /etc/envoy/listeners.d/00-tls_terminator_443.yaml | run puppet - envoy now listening on 443 (T210411)
[08:43:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:08] <stashbot>	 T210411: Applayer services without TLS - https://phabricator.wikimedia.org/T210411
[08:56:20] <icinga-wm>	 PROBLEM - snapshot of s3 in eqiad on db1115 is CRITICAL: snapshot for s3 at eqiad taken more than 4 days ago: Most recent backup 2019-09-02 08:38:15 https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[09:30:56] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[10:08:20] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: number_of_nodes: 1, relocating_shards: 0, active_shards: 6, active_primary_shards: 6, initializing_shards: 0, unassigned_shards: 5, number_of_in_flight_fetch: 0, number_of_data_nodes: 1, timed_out: False, active_shards_percent_as_number: 54.54545454545454, task_max_waiting_in_queu
[10:08:20] <icinga-wm>	 ter_name: relforge-eqiad-small-alpha, status: yellow, delayed_unassigned_shards: 0, number_of_pending_tasks: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:08:57] <onimisionipe>	 ^ oops
[10:08:59] <onimisionipe>	 that's me
[10:09:00] <onimisionipe>	 sorry
[10:11:28] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1002 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: task_max_waiting_in_queue_millis: 0, delayed_unassigned_shards: 0, relocating_shards: 0, active_shards_percent_as_number: 100.0, unassigned_shards: 0, number_of_in_flight_fetch: 0, number_of_nodes: 2, number_of_pending_tasks: 0, timed_out: False, active_shards: 12, number_of_data_nodes: 2, active_p
[10:11:28] <icinga-wm>	  status: green, cluster_name: relforge-eqiad-small-alpha, initializing_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:14:55] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[10:17:33] <moritzm>	 !log installing exim4 security updates
[10:17:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:19:54] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[10:28:02] <icinga-wm>	 RECOVERY - snapshot of s3 in eqiad on db1115 is OK: snapshot for s3 at eqiad taken less than 4 days ago and larger than 90 GB: Last one 2019-09-06 08:32:24 from db1095.eqiad.wmnet:3313 (830 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[10:34:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I like the idea a lot, but I would go a different way with the implementation. See my comments inline for my suggestions." (035 comments) [software/service-checker] - 10https://gerrit.wikimedia.org/r/532807 (owner: 10Cwhite)
[10:40:57] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[10:45:56] <wikibugs>	 10Operations, 10Graphite, 10Performance-Team (Radar): Improve graphite failover - https://phabricator.wikimedia.org/T88997 (10fgiunchedi)
[10:49:13] <wikibugs>	 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: LDF service does not Vary responses by Content-Type, sending incorrect cached responses to clients - https://phabricator.wikimedia.org/T232006 (10Lucas_Werkmeister_WMDE)
[10:50:18] <wikibugs>	 10Operations, 10observability, 10Availability, 10Performance-Team (Radar): Perform a statsd and Graphite switch - https://phabricator.wikimedia.org/T206963 (10fgiunchedi) 05Open→03Invalid Resolving in favor of {T88997} though please reopen if needed!
[10:55:28] <icinga-wm>	 PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - dewiki_content_1566659363[4](2019-09-02T23:06:21.576Z) https://wikitech.wikimedia.org/wiki/Search%23Administration
[10:56:17] <onimisionipe>	 looking
[10:59:01] <Amir1>	 !log ladsgroup@mwmaint1002:~$ time mwscript extensions/Wikibase/repo/maintenance/rebuildItemTerms.php --wiki=testwikidatawiki (T225056)
[10:59:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:59:04] <stashbot>	 T225056: Run Item Terms Rebuild script - https://phabricator.wikimedia.org/T225056
[10:59:36] <onimisionipe>	 !log force shard allocation - chi eqiad
[10:59:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:01:14] <wikibugs>	 10Operations, 10Traffic: ats-tls is performing 3k DNS queries per second on cp5001 - https://phabricator.wikimedia.org/T232209 (10Vgutierrez)
[11:06:53] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209)
[11:09:07] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209)
[11:10:55] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18198/" [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) (owner: 10Vgutierrez)
[11:13:04] <wikibugs>	 (03PS1) 10Gehel: Revert "elasticsearch: switch relforge to new logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/534786
[11:14:14] <wikibugs>	 (03CR) 10Gehel: [C: 03+2] Revert "elasticsearch: switch relforge to new logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/534786 (owner: 10Gehel)
[11:19:42] <wikibugs>	 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, 10Discovery-Search (Current work): Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Mathew.onipe) JsonLayout requires other dependencies for log4...
[11:39:58] <icinga-wm>	 PROBLEM - Disk space on phab1001 is CRITICAL: DISK CRITICAL - /var/spool/exim4/scan is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops
[11:50:39] <moritzm>	 ^ fixing, that's some left over of the exim-heavy Puppet class phab1001 used to have
[11:59:37] <wikibugs>	 (03PS1) 10Kosta Harlan: WIP: Enable GrowthExperiments for euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/534789 (https://phabricator.wikimedia.org/T232060)
[12:02:00] <icinga-wm>	 RECOVERY - Disk space on phab1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=phab1001&var-datasource=eqiad+prometheus/ops
[12:04:00] <wikibugs>	 (03PS1) 10Ladsgroup: mediawiki: Add rebuildItemTerms for Wikidata [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056)
[12:07:01] <wikibugs>	 (03CR) 10Ladsgroup: "It needs one of SREs to start "/var/log/wikidata/wikidata-rebuildItemTerms.log" file with:" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup)
[12:08:37] <wikibugs>	 (03CR) 10Marostegui: "Should we maybe start it after the s8 failover on Tuesday?" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup)
[12:11:45] <wikibugs>	 (03CR) 10Ladsgroup: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup)
[12:13:01] <wikibugs>	 (03CR) 10Marostegui: "> > Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/534790 (https://phabricator.wikimedia.org/T225056) (owner: 10Ladsgroup)
[12:28:42] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1285 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[12:30:06] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1285 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 591 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[12:35:08] <wikibugs>	 (03CR) 10Ema: [C: 03+1] ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) (owner: 10Vgutierrez)
[12:36:31] <moritzm>	 !log fix permissions on /var/spool/exim on krypton (hosts used to run the exim heavy role which uses different permissions than the light role)
[12:36:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:37:51] <ema>	 !log cp5001: restart trafficserver-tls.service to clear icinga alert after segfault
[12:37:54] <wikibugs>	 10Operations, 10media-storage: Have swift metrics available in Prometheus - https://phabricator.wikimedia.org/T187991 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Done, followup in {T205870}
[12:38:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:38:44] <icinga-wm>	 RECOVERY - traffic_server tls process restarted on cp5001 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=eqsin+prometheus/ops&var-instance=cp5001&var-layer=tls
[12:44:12] <wikibugs>	 (03PS1) 10Muehlenhoff: Add initial site.pp entry for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534797 (https://phabricator.wikimedia.org/T231015)
[12:46:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445 (owner: 10Giuseppe Lavagetto)
[12:46:56] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: restart-appservers: fix to the cli args, some other cosmetic changes [cookbooks] - 10https://gerrit.wikimedia.org/r/534445
[12:46:57] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[12:49:11] <wikibugs>	 (03PS2) 10Muehlenhoff: Add initial site.pp entry for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534797 (https://phabricator.wikimedia.org/T231015)
[12:50:09] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add initial site.pp entry for ldap-corp* [puppet] - 10https://gerrit.wikimedia.org/r/534797 (https://phabricator.wikimedia.org/T231015) (owner: 10Muehlenhoff)
[13:00:32] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 (owner: 10Giuseppe Lavagetto)
[13:04:00] <wikibugs>	 (03Merged) 10jenkins-bot: kvobject: fix some class property ordering [software/conftool] - 10https://gerrit.wikimedia.org/r/527565 (owner: 10Giuseppe Lavagetto)
[13:07:17] <wikibugs>	 10Operations, 10observability, 10Patch-For-Review, 10Performance-Team (Radar): Fully migrate >= 30% of producers off statsd - https://phabricator.wikimedia.org/T205870 (10fgiunchedi)
[13:17:27] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209) (owner: 10Vgutierrez)
[13:17:35] <wikibugs>	 (03PS3) 10Vgutierrez: ATS: Disable DNS resolution for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534783 (https://phabricator.wikimedia.org/T232209)
[13:20:49] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803
[13:22:28] <wikibugs>	 (03PS4) 10Jcrespo: WMFReplication: Parallelize slaves() [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232
[13:22:30] <wikibugs>	 (03PS1) 10Jcrespo: testing stuff, not to be deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534804
[13:22:32] <wikibugs>	 (03PS1) 10Jcrespo: [WIP] Add optional sanity checks to check mediawiki configuration [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805
[13:22:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] testing stuff, not to be deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534804 (owner: 10Jcrespo)
[13:23:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add optional sanity checks to check mediawiki configuration [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo)
[13:23:24] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto)
[13:23:42] <wikibugs>	 (03Abandoned) 10Jcrespo: testing stuff, not to be deployed [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534804 (owner: 10Jcrespo)
[13:25:49] <wikibugs>	 (03CR) 10CDanis: [C: 03+1] Fix configuration file lookup when running with sudo (031 comment) [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto)
[13:27:10] <wikibugs>	 (03CR) 10Jcrespo: "This is not a blocker or a dependency for switchover.py ,but it helps make it faster, specially for things like replication_tree.py and re" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/521232 (owner: 10Jcrespo)
[13:27:45] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: ats-tls is performing 3k DNS queries per second on cp5001 - https://phabricator.wikimedia.org/T232209 (10Vgutierrez) 05Open→03Resolved p:05Triage→03Normal
[13:27:48] <wikibugs>	 10Operations, 10Traffic: Move cache upload cluster from nginx to ats-tls - https://phabricator.wikimedia.org/T231433 (10Vgutierrez)
[13:28:20] <wikibugs>	 (03CR) 10Jcrespo: "This change is ready for review." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/534805 (owner: 10Jcrespo)
[13:37:31] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803
[13:38:27] <wikibugs>	 (03CR) 10Volans: "Shouldn't we get this info from Netbox instead of by trial and error on the Ganeti side?" [software/spicerack] - 10https://gerrit.wikimedia.org/r/533984 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov)
[13:41:30] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto)
[13:44:06] <wikibugs>	 (03Merged) 10jenkins-bot: Fix configuration file lookup when running with sudo [software/conftool] - 10https://gerrit.wikimedia.org/r/534803 (owner: 10Giuseppe Lavagetto)
[13:44:14] <wikibugs>	 (03CR) 10Volans: "Some question inline" (035 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/533987 (https://phabricator.wikimedia.org/T231068) (owner: 10CRusnov)
[13:46:24] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, AFAIK they are already not in prod anymore since a couple of weeks." [puppet] - 10https://gerrit.wikimedia.org/r/534017 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff)
[13:47:15] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, AFAIK all IP references in puppet/dns repos have already been replaced by the new hosts." [dns] - 10https://gerrit.wikimedia.org/r/534019 (https://phabricator.wikimedia.org/T224559) (owner: 10Muehlenhoff)
[13:50:28] <icinga-wm>	 RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-09-06 10:27:35 from db2098.codfw.wmnet:3313 (774 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups
[14:02:39] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Two things that I think need to be fixed, the rest are all optional/nits" (038 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/530096 (https://phabricator.wikimedia.org/T225297) (owner: 10Elukey)
[14:06:04] <wikibugs>	 10Operations, 10Pybal, 10Traffic: Migrate pybal-test2001 away from jessie - https://phabricator.wikimedia.org/T224570 (10MoritzMuehlenhoff) More generally speaking: Are the pybal-test* servers still used for testing/developing? Is there a specific reason they are in production and not in something like a "py...
[14:43:34] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm
[14:43:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:22] <wikibugs>	 10Operations: Track remaining jessie systems in production - https://phabricator.wikimedia.org/T224549 (10MoritzMuehlenhoff)
[14:48:52] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[14:48:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:49:59] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[14:50:12] <wikibugs>	 (03PS1) 10CDanis: dbctl: use explicit keyword arguments for the callback [software/conftool] - 10https://gerrit.wikimedia.org/r/534818
[14:50:14] <wikibugs>	 (03PS1) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819
[14:50:27] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (owner: 10CDanis)
[14:51:07] <wikibugs>	 (03CR) 10CDanis: "recheck" [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (owner: 10CDanis)
[14:51:11] <logmsgbot>	 !log jmm@cumin2001 START - Cookbook sre.ganeti.makevm
[14:51:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:55:33] <wikibugs>	 (03PS2) 10Jhedden: openstack: Add codfw1dev glance API to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907)
[14:56:15] <logmsgbot>	 !log jmm@cumin2001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0)
[14:56:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:57:46] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] openstack: Add codfw1dev glance API to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/534680 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[14:58:26] <wikibugs>	 10Operations, 10vm-requests: eqiad/codfw: 2 VMs for corp LDAP replicas - https://phabricator.wikimedia.org/T231015 (10MoritzMuehlenhoff) VMs have been created (but not yet installed)
[15:01:20] <wikibugs>	 (03CR) 10Muehlenhoff: "Good catch! Amended the patch" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff)
[15:01:27] <wikibugs>	 (03PS2) 10Muehlenhoff: Restrict NTP servers to production networks (including frack and network gear) [puppet] - 10https://gerrit.wikimedia.org/r/531808
[15:03:58] <wikibugs>	 (03PS2) 10CDanis: dbctl: add set-candidate-master subcommand on instance [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677)
[15:05:13] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, please ping me after merging it so I can check network devices are still happy." [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff)
[15:09:09] <wikibugs>	 (03CR) 10Muehlenhoff: "Thanks, I'll ping you next week for this" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff)
[15:10:20] <wikibugs>	 (03Abandoned) 10Ayounsi: Fix dependencies [debs/pynetbox] - 10https://gerrit.wikimedia.org/r/534263 (owner: 10Ayounsi)
[15:11:35] <wikibugs>	 10Operations, 10ops-eqiad, 10netops: (Need By: Sept 30) update RE-S-X6-64G-S in cr[12]-eqiad - https://phabricator.wikimedia.org/T226424 (10ayounsi) Postponed to Thursday Sept 12th, 8am PST, 11am local time, 15:00 UTC. 3h
[15:16:46] <wikibugs>	 (03CR) 10Krinkle: Variant configuration: Read from JSON, not serialised PHP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/533593 (https://phabricator.wikimedia.org/T223602) (owner: 10Jforrester)
[15:18:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[15:22:02] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[15:22:47] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Disable keep-alive on outgoing connections using native config options [puppet] - 10https://gerrit.wikimedia.org/r/534828
[15:24:29] <wikibugs>	 (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1002/18202/" [puppet] - 10https://gerrit.wikimedia.org/r/534828 (owner: 10Vgutierrez)
[15:29:12] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[15:32:22] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST
[15:35:31] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: b4-eqiad pdu refresh (Thursday 10/24 @11am UTC) - https://phabricator.wikimedia.org/T227540 (10RobH) a:05RobH→03None
[15:36:48] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:38:24] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:42:30] <wikibugs>	 (03PS1) 10Vgutierrez: ATS: Disable server session sharing across clients [puppet] - 10https://gerrit.wikimedia.org/r/534831
[15:42:58] <wikibugs>	 (03PS1) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534832 (https://phabricator.wikimedia.org/T223907)
[15:44:35] <wikibugs>	 (03PS1) 10CRusnov: python-build: add buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/534833
[15:45:49] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Disable keep-alive on outgoing connections on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534828
[15:45:51] <wikibugs>	 (03PS2) 10Vgutierrez: ATS: Disable server session sharing across clients on TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/534831
[15:55:50] <wikibugs>	 (03CR) 10Phamhi: [C: 03+2] openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534832 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[15:56:38] <wikibugs>	 (03PS2) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534832 (https://phabricator.wikimedia.org/T223907)
[15:58:30] <icinga-wm>	 PROBLEM - Apache HTTP on mw1276 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers
[15:59:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[15:59:56] <icinga-wm>	 RECOVERY - Apache HTTP on mw1276 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 590 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[16:01:26] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[16:02:21] <wikibugs>	 (03PS1) 10Jhedden: Revert "openstack: add haproxy health check path support" [puppet] - 10https://gerrit.wikimedia.org/r/534837
[16:03:24] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] Revert "openstack: add haproxy health check path support" [puppet] - 10https://gerrit.wikimedia.org/r/534837 (owner: 10Jhedden)
[16:03:29] <wikibugs>	 (03CR) 10Jhedden: [V: 03+2 C: 03+2] Revert "openstack: add haproxy health check path support" [puppet] - 10https://gerrit.wikimedia.org/r/534837 (owner: 10Jhedden)
[16:10:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] python-build: add buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/534833 (owner: 10CRusnov)
[16:11:03] <wikibugs>	 (03PS1) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534839 (https://phabricator.wikimedia.org/T223907)
[16:12:41] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] python-build: add buster [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/534833 (owner: 10CRusnov)
[16:14:52] <wikibugs>	 10Operations, 10ops-eqiad, 10DC-Ops: backup1001 can't address the disk shelf's drives - https://phabricator.wikimedia.org/T227335 (10akosiaris) @Cmjohnson @wiki_willy What can we do to help get this unstuck? I am not at all sure why something like this would happen.   Output of  ` sudo megacli -AdpAllInfo -a...
[16:16:25] <James_F>	 paladox: Is there a Phabricator project for "blockers to Buster migration"?
[16:16:48] <paladox>	 I doin't think so, i just experenced this issue when upgrading a server that ran debian 9
[16:16:52] <paladox>	 (upgraded to 10)
[16:16:53] <James_F>	 Right.
[16:17:08] <paladox>	 my work around is to install the deb from puppet :)
[16:22:10] <wikibugs>	 (03CR) 10Jhedden: [C: 03+2] openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534839 (https://phabricator.wikimedia.org/T223907) (owner: 10Jhedden)
[16:22:24] <wikibugs>	 (03PS2) 10Jhedden: openstack: add haproxy health check path support [puppet] - 10https://gerrit.wikimedia.org/r/534839 (https://phabricator.wikimedia.org/T223907)
[16:27:51] <wikibugs>	 (03PS1) 10CRusnov: Update to buster & upstream 2.6.3 (via v2.6.3-wmf1) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/534841
[16:35:02] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:37:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] "LGTM, but I'm concerned about doing a software upgrade at the same time of the migration (eg. DB migration complications)." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/534841 (owner: 10CRusnov)
[16:40:44] <wikibugs>	 (03CR) 10Volans: "> Patch Set 1:" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/534507 (owner: 10Ayounsi)
[16:43:08] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "post-merge LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/534153 (owner: 10CDanis)
[16:47:12] <icinga-wm>	 PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[16:47:21] <chaomodus>	 ah of course
[16:47:29] <chaomodus>	 still working on it :)
[16:49:26] <chaomodus>	 also i wish the icinga ui was a bit less slow
[17:04:19] <wikibugs>	 (03PS1) 10Bstorm: tagging: Add the tag to the templates [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534846 (https://phabricator.wikimedia.org/T229058)
[17:07:05] <wikibugs>	 (03CR) 10Bstorm: "When talking yesterday, I realized why we weren't able to get the latest built version with tagging.  It's because the ancestor images wer" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/534846 (https://phabricator.wikimedia.org/T229058) (owner: 10Bstorm)
[17:14:54] <wikibugs>	 (03CR) 10CRusnov: [V: 03+2 C: 03+2] "your concerns are noted!" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/534841 (owner: 10CRusnov)
[17:18:08] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/conftool] - 10https://gerrit.wikimedia.org/r/534818 (owner: 10CDanis)
[17:24:23] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux
[17:24:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:24:26] <stashbot>	 T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291
[17:25:33] <wikibugs>	 (03CR) 10Volans: "LGTM in general, I've although a couple of questions and a nit inline." (033 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/534819 (https://phabricator.wikimedia.org/T229677) (owner: 10CDanis)
[17:25:52] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux (duration: 01m 29s)
[17:25:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:58] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 2
[17:26:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:26:35] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 2 (duration: 00m 37s)
[17:26:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:27:11] <wikibugs>	 (03PS1) 10Phamhi: toollabs: update maintain-kubeusers timer command to use timeout [puppet] - 10https://gerrit.wikimedia.org/r/534848
[17:30:39] <wikibugs>	 (03CR) 10Phamhi: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/18204/console" [puppet] - 10https://gerrit.wikimedia.org/r/534848 (owner: 10Phamhi)
[17:32:01] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850
[17:33:07] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (owner: 10Andrew Bogott)
[17:34:24] <wikibugs>	 (03PS2) 10Andrew Bogott: codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (https://phabricator.wikimedia.org/T229441)
[17:35:36] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (https://phabricator.wikimedia.org/T229441) (owner: 10Andrew Bogott)
[17:35:43] <wikibugs>	 (03PS3) 10Andrew Bogott: codfw1dev: disable the mwyaml backend [puppet] - 10https://gerrit.wikimedia.org/r/534850 (https://phabricator.wikimedia.org/T229441)
[17:37:54] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[17:38:58] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 3
[17:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:39:01] <stashbot>	 T223291: Netbox: move it to dedicated Ganeti VMs - https://phabricator.wikimedia.org/T223291
[17:39:19] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux 3 (duration: 00m 21s)
[17:39:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:40:50] <logmsgbot>	 !log crusnov@deploy1001 Started deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux
[17:40:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:42:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[17:43:45] <logmsgbot>	 !log crusnov@deploy1001 Finished deploy [netbox/deploy@dea254a]: deploy for netbox split T223291 - buster redux (duration: 02m 55s)
[17:43:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:47:42] <wikibugs>	 (03CR) 10Bstorm: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/534848 (owner: 10Phamhi)
[17:48:07] <wikibugs>	 (03CR) 10Phamhi: [C: 03+2] toollabs: update maintain-kubeusers timer command to use timeout [puppet] - 10https://gerrit.wikimedia.org/r/534848 (owner: 10Phamhi)
[17:48:31] <wikibugs>	 (03PS2) 10Phamhi: toollabs: update maintain-kubeusers timer command to use timeout [puppet] - 10https://gerrit.wikimedia.org/r/534848
[17:48:36] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-Services, 10cloud-services-team (Kanban): rack/setup/install cloudcephmon100[123] - https://phabricator.wikimedia.org/T228102 (10Cmjohnson)
[17:49:34] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:50:16] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 7415 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[17:50:50] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 55.7 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[17:50:52] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:51:19] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev: update labtest.hiera.yaml to use codfw1dev resources [puppet] - 10https://gerrit.wikimedia.org/r/534851 (https://phabricator.wikimedia.org/T229441)
[17:51:38] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:52:14] <paladox>	 hmm, https://en.wikipedia.org is not loading
[17:52:22] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:52:33] <paladox>	 i get "carn't establish a secure connection"
[17:53:06] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[17:53:30] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:53:38] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:53:41] <bblack>	 esams link again?
[17:53:56] <XioNoX>	 ...
[17:53:58] <paladox>	 it dosen't work on my mobile either
[17:54:10] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:54:11] <chaomodus>	 blargh
[17:54:17] <paladox>	 and ping is failing.
[17:54:19] <bblack>	 paladox: it should start working shortly, if link traffic fails over, etc
[17:54:23] <vgutierrez>	 everything ok?
[17:54:24] <onimisionipe>	 wikipedia.org not working for me too
[17:54:44] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:54:46] <bblack>	 there was an excess RX alert too, could be dos
[17:54:50] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:54:51] <godog>	 yeah page
[17:55:12] <marostegui>	 I'm around if needed
[17:55:39] <bblack>	 pushing up a depool dns patch to have ready, not sure if that's the right move yet
[17:55:44] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:55:47] <cdanis>	 bblack: let's take to _security
[17:55:52] <XioNoX>	 checking the link
[17:56:16] <wikibugs>	 (03PS1) 10BBlack: depool esams in geodns [dns] - 10https://gerrit.wikimedia.org/r/534852
[17:56:39] <XioNoX>	 the packet loss seem to be *after* cr2-esams
[17:57:00] <XioNoX>	 no issues on eqiad-esams link
[17:57:20] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[17:57:24] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:57:36] <XioNoX>	 see:
[17:57:36] <XioNoX>	 HOST: icinga1001                                Loss%   Snt   Last   Avg  Best  Wrst StDev
[17:57:36] <XioNoX>	   1. AS14907  ae3-1003.cr2-eqiad.wikimedia.org   0.0%    10    0.4   0.5   0.4   0.6   0.0
[17:57:36] <XioNoX>	   2. AS14907  xe-0-1-3.cr2-esams.wikimedia.org   0.0%    10   83.5  83.6  83.4  84.3   0.0
[17:57:36] <XioNoX>	   3. AS14907  text-lb.esams.wikimedia.org       80.0%    10   84.5  84.4  84.4  84.5   0.0
[17:58:14] <librenms-wmf>	 04Critical Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%
[17:58:20] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] depool esams in geodns [dns] - 10https://gerrit.wikimedia.org/r/534852 (owner: 10BBlack)
[17:58:21] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 2.105 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[17:58:47] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.460 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:00:12] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[18:00:18] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[18:00:48] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:00:50] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1753 days) https://wikitech.wikimedia.org/wiki/Logs
[18:01:10] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:01:12] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 715 days) https://wikitech.wikimedia.org/wiki/Logs
[18:01:24] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:01:25] <godog>	 !log silence esams pages for 30m
[18:01:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:01:34] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:01:39] <multichill>	 godog: Is Esams down?
[18:01:47] <multichill>	 I'm unable to reach any site
[18:01:58] <sjoerddebruin>	 getting timeouts from time to time
[18:02:14] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:02:26] <godog>	 multichill: yeah it is unhappy, being depooled now
[18:02:46] <multichill>	 Those OSPF/BGP/BFD warnings don't look good
[18:03:23] <Steinsplitter>	 seems to be back
[18:03:40] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80%
[18:03:40] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:03:47] <wikibugs>	 10Operations: ERR_CONNECTION_TIMED_OUT on multiple WikiMedia sites - https://phabricator.wikimedia.org/T232224 (10RhinosF1)
[18:04:09] <Krenair>	 Steinsplitter, esams specifically or the site in general?
[18:04:11] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[18:04:21] <wikibugs>	 10Operations: ERR_CONNECTION_TIMED_OUT on multiple WikiMedia sites - https://phabricator.wikimedia.org/T232224 (10RhinosF1) p:05Triage→03Unbreak! Oh and accessing from the UK
[18:04:22] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:04:35] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[18:04:54] <Steinsplitter>	 Krenair: the site
[18:04:59] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[18:05:26] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:05:56] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[18:07:10] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:07:14] <multichill>	 godog: The depooling fixed it for me, dyna.wikimedia.org switched
[18:07:32] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:07:37] <wikibugs>	 10Operations: ERR_CONNECTION_TIMED_OUT on multiple WikiMedia sites - https://phabricator.wikimedia.org/T232224 (10RhinosF1) Also confirmed by @ShakespeareFan00 so not just me
[18:07:49] <godog>	 multichill: sweet! same here now
[18:07:54] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 4882 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[18:07:54] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[18:07:54] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[18:07:55] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/
[18:07:55] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[18:08:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:09:02] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[18:09:10] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 7.641 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:09:11] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:09:14] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&va
[18:09:14] <icinga-wm>	 server&var-method=GET
[18:09:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[18:09:52] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[18:09:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:09:56] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:10:18] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:10:20] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[18:10:20] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on icinga1001 is CRITICAL: job=varnish-text site=eqsin https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[18:10:23] <icinga-wm>	 PROBLEM - recommendation_api endpoints health on scb2004 is CRITICAL: /{domain}/v1/caption/addition/{target} (Caption addition suggestions) timed out before a response was received: /{domain}/v1/caption/translation/from/{source}/to/{target} (Caption translation suggestions) timed out before a response was received: /{domain}/v1/description/translation/from/{source}/to/{target} (Description translation suggestions) timed out befor
[18:10:23] <icinga-wm>	 received: /{domain}/v1/description/addition/{target} (Description addition suggestions) timed out before a response was received: /{domain}/v1/article/creation/morelike/{seed} (article.creation.morelike - good article title) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/recommendation_api
[18:10:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:10:43] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:10:43] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200): /en.wikipedia.org/v1/pag
[18:10:43] <icinga-wm>	 et structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[18:21:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:38:54] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:39:20] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:39:22] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:39:40] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:39:44] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 5 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[18:40:30] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:40:34] <Urbanecm>	 bblack: btw, I'm a deployer with access, so i see logs anyway, at least which are in logstash
[18:40:39] <Urbanecm>	 *nda
[18:40:48] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.506 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:40:58] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:41:02] <icinga-wm>	 RECOVERY - rsyslog in eqiad is failing to deliver messages on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Rsyslog https://grafana.wikimedia.org/d/000000596/rsyslog?var-datasource=eqiad+prometheus/ops
[18:41:16] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:42:15] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80%
[18:42:39] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%
[18:44:41] <Urbanecm>	 So if there is anything i can do to help, would be happy to do
[18:46:34] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[18:47:50] <ShakespeareFan00>	 Is there a way to see where the traffic is coming from?
[18:48:26] <Krenair>	 ShakespeareFan00, I expect the network engineers are able to do that
[18:48:54] <Krenair>	 am not aware of any public graphs etc. about it
[18:48:59] <greg-g>	 no help needed, we have all the info we need right now, thanks
[18:49:00] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:49:14] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:49:16] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:49:28] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:49:52] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:50:08] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:50:25] <Lofhi>	 ugh
[18:50:32] <Urbanecm>	 ShakespeareFan00: you can explore grafana.wikimedia.org, if you want, but the ability to help without sysadmin power is limited
[18:50:37] <Urbanecm>	 Lofhi: major outage, engineers work on that
[18:50:45] <Lofhi>	 I know
[18:50:54] <Lofhi>	 + you can't access Grafana 
[18:51:03] <Lofhi>	 There are no data points
[18:51:08] <Urbanecm>	 I can, I'm in `nda` ;)
[18:51:14] <Lofhi>	 Lucky
[18:51:16] <Krenair>	 grafana is not a restricted site
[18:51:18] <paladox>	 ^
[18:51:25] <Lofhi>	 No one said that
[18:51:26] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[18:51:26] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[18:51:26] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/
[18:51:26] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[18:51:28] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[18:51:31] <bennofs[m]>	 yeah but it is also affected by network problems
[18:51:34] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80%
[18:51:34] <Lofhi>	 ^
[18:51:36] <Krenair>	 yes
[18:53:04] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:54:40] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:54:54] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:55:24] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:56:04] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:56:10] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:56:12] <multichill>	 Nemo_bis: https://twitter.com/WikimediaItalia/status/1170042749166542849 <- ahum?
[18:57:34] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 564 bytes in 1.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:58:00] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:58:39] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80%
[18:58:42] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[18:58:42] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[18:58:43] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/
[18:58:43] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[18:58:44] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:58:51] <librenms-wmf>	 04Critical Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%
[18:58:56] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:59:00] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS1299/IPv4: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[18:59:08] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.118 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[18:59:15] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[18:59:22] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 7.114 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[18:59:30] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[18:59:39] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[18:59:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[18:59:52] <icinga-wm>	 RECOVERY - Check if active EventStreams endpoint is delivering messages. on icinga1001 is OK: OK: An EventStreams message was consumed from https://stream.wikimedia.org/v2/stream/recentchange within 10 seconds. https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams/Administration
[19:00:06] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[19:03:04] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3030.esams.wmnet are marked down but pooled: textlb_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3041.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3032.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marke
[19:03:04] <icinga-wm>	 : textlb_80: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[19:03:34] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[19:03:58] <icinga-wm>	 PROBLEM - HHVM rendering on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1928 bytes in 0.033 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:04:02] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:04:22] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:04:24] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:04:38] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:05:06] <icinga-wm>	 RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netbox1001 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers
[19:05:12] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:05:28] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: No response from remote host 91.198.174.246 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:05:42] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:05:50] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 551 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:05:54] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:05:58] <icinga-wm>	 PROBLEM - pybal on lvs3001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:06:14] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:06:16] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:06:16] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - Bad Response from pybal: 500 Cant connect to localhost:9090 https://wikitech.wikimedia.org/wiki/PyBal
[19:06:20] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1928 bytes in 0.060 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:06:34] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.500 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:06:44] <icinga-wm>	 PROBLEM - Apache HTTP on mw1317 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1928 bytes in 0.048 second response time https://wikitech.wikimedia.org/wiki/Application_servers
[19:07:00] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:07:14] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.781 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:09:24] <icinga-wm>	 RECOVERY - pybal on lvs3001 is OK: PROCS OK: 1 process with UID = 0 (root), args /usr/sbin/pybal https://wikitech.wikimedia.org/wiki/PyBal
[19:10:20] <icinga-wm>	 RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:10:38] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 271.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[19:10:56] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:11:04] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:11:06] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:11:36] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:11:56] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:12:32] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:13:04] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[19:13:04] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[19:13:04] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response w
[19:13:04] <icinga-wm>	 /rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[19:13:52] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 1.488 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[19:13:56] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:14:06] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:15:38] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 1.415 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:17:08] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:17:40] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:18:28] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:18:53] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80%
[19:19:06] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[19:19:06] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[19:19:06] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/
[19:19:06] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[19:19:54] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 3596 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[19:20:42] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:21:42] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[19:21:56] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 562 bytes in 0.787 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:22:00] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 107.6 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[19:22:50] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[19:22:50] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[19:22:50] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/
[19:22:50] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[19:23:39] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80%
[19:24:03] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[19:24:53] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[19:25:02] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[19:25:24] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is CRITICAL: cluster=cache_text site=eqsin https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:26:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:26:32] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:26:36] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:26:41] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[19:26:42] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15795 bytes in 1.441 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:27:03] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:27:14] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:27:22] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:27:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:27:34] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[19:27:50] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:28:03] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:28:04] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:28:04] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:28:10] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:28:10] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[19:28:12] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:28:14] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[19:28:14] <librenms-wmf>	 04Critical Alert for device cr1-esams.wikimedia.org - Primary inbound port utilisation over 80%
[19:28:42] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[19:28:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:28:52] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:28:58] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:29:02] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[19:29:02] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page
[19:29:02] <icinga-wm>	 et rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received h
[19:29:02] <icinga-wm>	 ikimedia.org/wiki/RESTBase
[19:29:02] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:29:08] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:29:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:29:10] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi
[19:29:10] <icinga-wm>	 se
[19:29:12] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[19:29:12] <icinga-wm>	 PROBLEM - SSH on lvs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:29:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:29:24] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:29:27] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[19:29:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:29:28] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[19:29:33] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:30:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:30:26] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:30:31] <abian>	 Any clues about when this might stabilize?
[19:30:33] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:30:34] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[19:30:36] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:30:38] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15798 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:30:39] <marostegui>	 abian: we are on it
[19:30:50] <abian>	 Okay, thanks :)
[19:30:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:26] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:31:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:48] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:31:54] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:31:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:56] <icinga-wm>	 PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[19:31:56] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:31:56] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:32:05] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:32:08] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:32:18] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:32:32] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:32:40] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:32:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:32:44] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[19:32:46] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 6.193 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:33:02] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:33:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:06] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:33:06] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[19:33:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:08] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:33:26] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:33:30] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:33:30] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:33:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:33] <icinga-wm>	 RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[19:33:34] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 117 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[19:33:34] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[19:33:38] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.479 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:33:40] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[19:33:42] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:34:00] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:34:04] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:34:04] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/media/image/featured/{year}/{month}/{day} (retrieve featured image data for April 29, 2016) is CRITICAL: Test retrieve featured image data for April 29, 2016 returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:34:06] <icinga-wm>	 RECOVERY - SSH on lvs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:34:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:34:09] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:34:12] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 0.152 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[19:34:12] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 7235 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[19:34:26] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:34:40] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:35:38] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 191 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[19:35:44] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[19:36:05] <Krenair>	 abian, as a general rule it's best not to ask for these sorts of estimates, especially while people are under pressure trying to deal with incidents like this
[19:36:21] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80%
[19:36:39] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[19:36:56] <bblack>	 if we could estimate we would, but usually if a problem is so tractable that it can be estimated, it would've been fixed long ago :)
[19:37:22] <icinga-wm>	 RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is OK: (C)3200 ge (W)1600 ge 1032 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[19:37:39] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary outbound port utilisation over 80%
[19:38:29] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80%
[19:38:53] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[19:39:08] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 24 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[19:39:18] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[19:39:29] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[19:39:40] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[19:39:52] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[19:41:12] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[19:41:13] <abian>	 Maybe you had finished the work and now it was up to the system to stabilize, we cannot know
[19:41:39] <abian>	 A "we don't know" or "can't tell" is also a valid answer :)
[19:41:55] <Waggie>	 bblack: Thank you for all your hard work. I trust you're doing your best.
[19:42:09] <abian>	 +1
[19:42:21] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80%
[19:42:39] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[19:43:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[19:43:15] <Krenair>	 maybe we should avoid pinging them while this is ongoing
[19:43:26] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[19:48:22] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[19:48:41] <cdanis>	 <3
[20:00:30] <Lofhi>	 Good news :)
[20:00:56] <paladox>	 it's still down for me
[20:01:25] <andre__>	 I also still face problems in Europe
[20:01:48] <icinga-wm>	 PROBLEM - Host lvs3001 is DOWN: PING CRITICAL - Packet loss = 100%
[20:01:51] <Krenair>	 marostegui, 
[20:02:01] <Lofhi>	 Ugh
[20:02:02] <Krenair>	 I'm also having problems getting to a prod wiki
[20:02:09] <Krenair>	 cdanis, bblack 
[20:02:26] <paladox>	 well lvs3001 going down would explain it i guess?
[20:02:32] <icinga-wm>	 RECOVERY - Host lvs3001 is UP: PING WARNING - Packet loss = 93%, RTA = 84.31 ms
[20:03:11] <abian>	 I can't access the projects yet either
[20:03:39] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:03:41] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:03:53] <icinga-wm>	 PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[20:03:53] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:03:54] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:03:56] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:03:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:03:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:03:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:03:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:04:00] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:04:06] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:04:12] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:04:18] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:04:22] <icinga-wm>	 PROBLEM - SSH on lvs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:04:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:04:28] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:04:28] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:04:30] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:04:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:04:46] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:04:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:04:48] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[20:04:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:04:50] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[20:04:50] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections
[20:04:50] <icinga-wm>	 ile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:04:52] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:04:56] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:04:58] <marostegui>	 we are on it 
[20:05:04] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:05:08] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:05:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:05:08] <icinga-wm>	 PROBLEM - proton endpoints health on proton2001 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[20:05:12] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:05:14] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:05:16] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:05:20] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 5.561 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:05:24] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:05:24] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:05:30] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2003 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:05:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:05:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:06:04] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:06:04] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:06:22] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 5.207 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:06:30] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal
[20:06:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:06:52] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 2.137 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:06:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:07:00] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 296 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[20:07:12] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:07:38] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 1.257 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:07:52] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) is CRITICAL: Test retrieve test page via mobile-sections returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:08:10] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[20:08:10] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[20:08:10] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/
[20:08:10] <icinga-wm>	 ons/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:08:22] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:08:32] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:08:38] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 6.031 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:08:50] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 7.187 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:08:50] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:09:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:09:22] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:09:34] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 4.137 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:09:34] <icinga-wm>	 RECOVERY - SSH on lvs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:09:36] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:10:12] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:10:24] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on ncredir-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 0.075 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:10:30] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 108 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[20:10:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:10:58] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 4.005 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:10:58] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:11:08] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 9351 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[20:11:18] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:11:34] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[20:11:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:11:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:12:21] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80%
[20:12:22] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:12:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:12:26] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal
[20:12:58] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80%
[20:13:15] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[20:13:20] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:13:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:13:33] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:13:40] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.846 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:13:57] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[20:14:15] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:14:28] <icinga-wm>	 PROBLEM - SSH on lvs2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:14:33] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:14:36] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:14:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:15:08] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:15:38] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:15:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:15:50] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:16:14] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:16:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:16:33] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs3001 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[20:17:06] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2001 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:17:11] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 0.803 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:17:14] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:17:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:17:32] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:17:40] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:17:40] <icinga-wm>	 RECOVERY - SSH on lvs2001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:17:44] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 1 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:17:50] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:18:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:18:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:18:47] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:18:49] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:19:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:20:11] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:20:18] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.167 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:20:19] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.512 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:20:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:20:45] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:20:50] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 159 bytes in 4.013 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:20:53] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:21:15] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:21:51] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:22:11] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:22:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:22:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:22:33] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:22:39] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 9391 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[20:22:43] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:22:47] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:23:36] <Cyberpower678>	 Wikipedia is taking a really long time to load for me.  It's happening on my iPhone and MacBook, on Safari.  All other websites load fine.
[20:23:47] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 8.480 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:23:47] <|404>	 Cyberpower678, already known
[20:23:49] <|404>	 and working on
[20:23:56] <Cyberpower678>	 In many instances it won't even load.
[20:24:01] <Cyberpower678>	 It just hangs
[20:24:09] <Zppix>	 Operations, im all but a normal user but if theres anything i can do please feel free to let me know :)
[20:24:20] <Nemo_bis>	 btw #wikimedia-tech is supposed to be used for such reports
[20:24:23] <Zppix>	 Cyberpower678: SRE is working on it
[20:24:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:24:49] <Cyberpower678>	 Zppix: Who's SRE?
[20:24:59] <Cyberpower678>	 Nemo_bis: thanks.
[20:25:02] <TheBanner>	 ddos or technical trouble?
[20:25:04] <Zppix>	 Site Reliablity Engineers (IIRC)
[20:25:05] <onimisionipe>	 SREs
[20:25:20] <Zppix>	 TheBanner:  Its been confirmed DDoS
[20:25:20] <Krenair>	 TheBanner, DDoS
[20:25:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:43] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:25:45] <Cyberpower678>	 Zppix: is the source of the DDoS known?
[20:26:01] <Zppix>	 Cyberpower678:  I am not sure, I only know as much that has been publicy stated
[20:26:08] <Krenair>	 if it is I imagine they won't be stating it here
[20:26:13] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:26:21] <Cyberpower678>	 Zppix: link?
[20:26:22] <Krenair>	 I'm sure there will be an incident report published in the coming days
[20:26:23] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 3274 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[20:26:27] <Cyberpower678>	 That's not on Wikipedia?
[20:26:30] <andre__>	 Please keep this channel focused on dealing with the incident. Please see the channel topic: "Status: Incident on-going". No estimates etc as people are busy trying to deal with this incident. There will be an incident report later. Thanks.
[20:26:31] <Krenair>	 in the mean time let's leave them to it huh?
[20:26:33] <Zppix>	 Cyberpower678:  bblac.k stated it in here
[20:26:45] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:26:48] <andre__>	 Please move "curiosity talk" to #wikimedia-tech or such. Thanks.
[20:27:03] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:27:05] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs3001 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[20:27:19] <icinga-wm>	 PROBLEM - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:27:29] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:27:40] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:27:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:27:51] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:27:51] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet are marked down but pooled: textlb_443: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3043.esams.wmnet, cp3040.esams.wmnet, cp3042.e
[20:27:51] <icinga-wm>	 0.esams.wmnet are marked down but pooled: textlb_80: Servers cp3042.esams.wmnet, cp3032.esams.wmnet, cp3040.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:28:04] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 6.027 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:28:23] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:28:33] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:28:46] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:29:01] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:29:13] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:29:13] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 5663 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[20:30:09] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:30:17] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.166 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:30:17] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs
[20:30:17] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:30:19] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mathoid - check test formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm
[20:30:19] <icinga-wm>	 aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/metadata/{title} (Get metadata from storage) timed out before a response was received: /api/rest_v1
[20:30:19] <icinga-wm>	 e} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response w
[20:30:19] <icinga-wm>	 /rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:30:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:30:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:30:49] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:30:53] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 3.302 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:31:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:31:56] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 2.516 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:31:57] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 714 days) https://wikitech.wikimedia.org/wiki/Logs
[20:31:59] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:32:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:32:51] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:33:07] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:33:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:33:40] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:33:51] <icinga-wm>	 RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[20:34:07] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:34:09] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 9044 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[20:34:15] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:34:19] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:34:23] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:34:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:35:01] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 83.05 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:35:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:35:19] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:35:29] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[20:35:29] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[20:35:29] <icinga-wm>	 le} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response 
[20:35:29] <icinga-wm>	 i/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:35:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:35:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:36:03] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 90.75 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:36:07] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:36:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:36:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:36:53] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 562 bytes in 3.792 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:36:56] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 4.180 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:37:13] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:37:43] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on ncredir-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 230 bytes in 2.611 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:38:09] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 9 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:38:27] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:38:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:38:51] <icinga-wm>	 PROBLEM - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Graphoid
[20:38:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:39:19] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:39:19] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:39:25] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:39:31] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi
[20:39:31] <icinga-wm>	 se
[20:39:35] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 0.516 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:39:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:29] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:40:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:33] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:40:35] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:40:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:40:37] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:40:45] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:40:49] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 0.248 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:40:49] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:40:49] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:40:51] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:41:01] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:41:01] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:41:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:41:05] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:41:05] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:41:07] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[20:41:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:41:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:41:25] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:41:25] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[20:41:25] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:41:31] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:41:34] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[20:41:37] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:41:41] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:41:59] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:42:01] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:42:20] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80%
[20:42:57] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80%
[20:43:15] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[20:43:32] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:43:37] <icinga-wm>	 PROBLEM - Restbase edge codfw on text-lb.codfw.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[20:43:37] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is WARNING: Test Retrieve aggregated feed content for April 29, 2016 responds with unexpected value at path = Missing keys: [utfa] https://wikitech.wikimedia.org/wiki/RESTBase
[20:43:57] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[20:43:57] <icinga-wm>	 RECOVERY - LVS HTTP IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 550 bytes in 9.712 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:44:37] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 9175 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[20:45:01] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:45:21] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:45:35] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:45:35] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:45:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:37] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2006 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:45:39] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[20:45:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:45:45] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:45:53] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:45:56] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:45:57] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:46:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:46:03] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:46:05] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs2001 is CRITICAL: CRITICAL: 0 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal
[20:46:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb6_80: Servers cp3043.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb_443: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.e
[20:46:09] <icinga-wm>	 0.esams.wmnet are marked down but pooled: textlb_80: Servers cp3043.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:46:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:46:09] <icinga-wm>	 PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedi
[20:46:09] <icinga-wm>	 se
[20:46:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:46:23] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:46:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:46:29] <icinga-wm>	 PROBLEM - proton endpoints health on proton2002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro
[20:46:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:46:33] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:46:45] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2002 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:46:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:46:51] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.459 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:46:57] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:47:17] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:47:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:47:19] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2005 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:47:39] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.codfw.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 7.774 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:47:40] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary outbound port utilisation over 80%
[20:47:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2016 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:11] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr1-eqiad.wikimedia.org recovered from Memory over 85%
[20:48:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:48:59] <icinga-wm>	 PROBLEM - SSH on lvs3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[20:49:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:49:39] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 2.466 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:50:01] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:50:09] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:50:51] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb2004 is CRITICAL: /{domain}/v1/{format}/{title}/{revid}/{id} (retrieve PNG from mediawiki.org) is CRITICAL: Test retrieve PNG from mediawiki.org returned the unexpected status 400 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:51:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3043.esams.wmnet, cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb_80: Servers cp3043.es
[20:51:09] <icinga-wm>	 .esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[20:51:53] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15797 bytes in 2.473 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:52:01] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.codfw.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15810 bytes in 4.575 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:52:19] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:52:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:52:21] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary outbound port utilisation over 80%
[20:52:27] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is CRITICAL: 5941 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[20:52:38] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:52:43] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 52.05 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:52:53] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{d
[20:52:53] <icinga-wm>	 egated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /api/rest_v1/pa
[20:52:53] <icinga-wm>	 le}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /api/rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out befo
[20:52:53] <icinga-wm>	  received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:52:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:17] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 35.79 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:53:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:39] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary outbound port utilisation over 80%
[20:53:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:53:57] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:54:05] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:54:09] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[20:54:15] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[20:54:15] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:54:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:54:35] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:54:42] <icinga-wm>	 PROBLEM - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:55:01] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:05] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:55:21] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:55:27] <icinga-wm>	 RECOVERY - Graphoid LVS codfw on graphoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Graphoid
[20:55:27] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:55:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:29] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:55:29] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:55:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:31] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:55:33] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:35] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[20:55:43] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2004 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:55:45] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:55:57] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:55:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:59] <icinga-wm>	 RECOVERY - Restbase edge codfw on text-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:56:01] <icinga-wm>	 RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[20:56:03] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:56:17] <icinga-wm>	 PROBLEM - Restbase edge esams on text-lb.esams.wikimedia.org is CRITICAL: /api/rest_v1/page/summary/{title} (Get summary from storage) timed out before a response was received: /api/rest_v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /api/rest_v1/page/references/{title} (Get references from storage) timed out before a response was received: /api/rest_v1/media/math/check/{type} (Mat
[20:56:17] <icinga-wm>	  formula) timed out before a response was received: /api/rest_v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received: /api/rest_v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received: /api/rest_v1/transform/wikitext/to/html/{title} (Transform wikitext to html) timed out before a response was received: /api/rest_v
[20:56:17] <icinga-wm>	 title} (Get metadata from storage) timed out before a response was received: /api/rest_v1/page/graph/png/{title}/{revision}/{graph_id} (Get a graph from Graphoid) timed out before a response was received: /api/rest_v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /api/rest_v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response w
[20:56:17] <icinga-wm>	 /rest_v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /api/rest_v1/feed/announcements (Retrieve announcements) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[20:56:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:56:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:56:21] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs2001 is OK: OK: 8 connections established with conf2001.codfw.wmnet:2379 (min=8) https://wikitech.wikimedia.org/wiki/PyBal
[20:56:21] <icinga-wm>	 RECOVERY - proton endpoints health on proton2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[20:56:33] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 114 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:56:34] <librenms-wmf>	 04Critical Alert for device asw-esams.mgmt.esams.wmnet - Primary inbound port utilisation over 80%
[20:56:35] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:56:37] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:56:39] <icinga-wm>	 RECOVERY - proton endpoints health on proton2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton
[20:56:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps
[20:56:55] <icinga-wm>	 PROBLEM - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL - No data received from host https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[20:56:55] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb2001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/graphoid
[20:57:37] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 103.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[20:58:23] <icinga-wm>	 PROBLEM - PyBal connections to etcd on lvs3001 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[20:58:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqdfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:00:03] <icinga-wm>	 RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:00:11] <icinga-wm>	 PROBLEM - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:00:11] <icinga-wm>	 PROBLEM - rsyslog TLS listener on port 6514 on centrallog1001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs
[21:00:33] <icinga-wm>	 RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs2001 is OK: (C)3200 ge (W)1600 ge 441 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs2001&var-datasource=codfw+prometheus/ops
[21:00:47] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:00:55] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3001 is CRITICAL: PYBAL CRITICAL - CRITICAL - textlb_443: Servers cp3032.esams.wmnet, cp3040.esams.wmnet, cp3030.esams.wmnet are marked down but pooled: textlb6_443: Servers cp3032.esams.wmnet, cp3033.esams.wmnet, cp3040.esams.wmnet, cp3042.esams.wmnet, cp3041.esams.wmnet, cp3030.esams.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[21:01:06] <icinga-wm>	 RECOVERY - LVS HTTPS IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15809 bytes in 3.489 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:01:38] <icinga-wm>	 RECOVERY - LVS HTTP IPv6 #page on text-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 563 bytes in 0.815 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:01:40] <icinga-wm>	 RECOVERY - LVS HTTPS IPv4 #page on text-lb.esams.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15796 bytes in 2.726 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
[21:01:47] <icinga-wm>	 RECOVERY - rsyslog TLS listener on port 6514 on centrallog1001 is OK: SSL OK - Certificate centrallog1001.eqiad.wmnet valid until 2024-06-25 15:42:33 +0000 (expires in 1753 days) https://wikitech.wikimedia.org/wiki/Logs
[21:01:59] <icinga-wm>	 RECOVERY - SSH on lvs3001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[21:02:13] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:02:20] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary outbound port utilisation over 80%
[21:02:27] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:02:31] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3001 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[21:02:43] <icinga-wm>	 PROBLEM - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is CRITICAL: 5858 ge 3200 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[21:02:51] <icinga-wm>	 RECOVERY - Restbase edge esams on text-lb.esams.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[21:02:53] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:03:29] <icinga-wm>	 RECOVERY - PyBal connections to etcd on lvs3001 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal
[21:05:53] <icinga-wm>	 RECOVERY - Excessive RX traffic on an LVS -units megabits/sec- on lvs3001 is OK: (C)3200 ge (W)1600 ge 1027 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/000000377/host-overview?var-server=lvs3001&var-datasource=esams+prometheus/ops
[21:06:59] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:07:05] <icinga-wm>	 PROBLEM - BFD status on cr4-ulsfo is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:07:39] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:09:13] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:11:39] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary outbound port utilisation over 80%
[21:12:01] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:12:10] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw-esams.mgmt.esams.wmnet recovered from Primary inbound port utilisation over 80%
[21:12:28] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[21:13:59] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:15:11] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 33.57 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:15:11] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:16:47] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 181.8 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:17:27] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[21:17:33] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:18:27] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[21:18:39] <librenms-wmf>	 04Critical Alert for device cr1-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[21:20:01] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:22:20] <librenms-wmf>	 04Critical Alert for device cr2-eqdfw.wikimedia.org - Primary inbound port utilisation over 80%
[21:23:03] <icinga-wm>	 RECOVERY - BFD status on cr4-ulsfo is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:23:14] <librenms-wmf>	 04Critical Alert for device cr2-codfw.wikimedia.org - Primary inbound port utilisation over 80%
[21:25:01] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:25:01] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:25:02] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:26:21] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:26:45] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:26:55] <James_F>	 !log mw1317 seems corrupted (Fatal error: Class undefined: stdClass in /srv/mediawiki/php-1.34.0-wmf.21/includes/libs/rdbms/database/DatabaseMysqli.php); running scap pull
[21:27:18] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-text site=ulsfo https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[21:27:18] <icinga-wm>	 PROBLEM - Host ncredir-lb.eqiad.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:41] <icinga-wm>	 RECOVERY - Host ncredir-lb.eqiad.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 16%, RTA = 0.32 ms
[21:27:45] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-text site=codfw https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[21:28:03] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:28:09] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:28:33] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[21:29:15] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 23 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1791212/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:29:23] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d
[21:29:33] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 77 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:30:05] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[21:31:23] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 59.32 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:31:23] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1791210/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:31:23] <icinga-wm>	 PROBLEM - IPv4 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 78 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:31:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 71.68 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:33:14] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr1-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[21:33:25] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-codfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[21:33:29] <cdanis>	 !log  cdanis@mw1317.eqiad.wmnet ~ 🕠🍺 sudo -i depool
[21:33:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:33:37] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqdfw.wikimedia.org recovered from Primary inbound port utilisation over 80%
[21:34:33] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:36:31] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 50.05 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:36:57] <icinga-wm>	 RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 497 (alerts on 35) - https://atlas.ripe.net/measurements/1790945/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:37:59] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:38:09] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:39:41] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 70.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[21:40:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:40:41] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 23 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts
[21:44:52] <wAtNeuwe4>	 Its going on https://twitter.com/UKDrillas/status/1170089580458065920
[21:45:26] <abian>	 I assumed it was fake
[21:45:46] <Zppix>	 We have been asked to move all chatter to #wikimedia-tech please
[21:47:38] <koyu>	 found this beauty, might be helpful
[21:47:39] <koyu>	 https://twitter.com/UKDrillas/status/1170089580458065920
[21:47:51] <Lofhi>	 "We have been asked to move all chatter to #wikimedia-tech please"
[21:48:27] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[21:48:51] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[21:52:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:53:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:53:37] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:53:55] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:55:57] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Connect https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[21:58:35] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:00:11] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:01:35] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:02:26] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[22:03:27] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[22:04:57] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53_udp: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:06:33] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:09:09] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:09:33] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:09:47] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:14:29] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:15:53] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:16:05] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 46.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:17:01] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:17:27] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:22:17] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:22:27] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[22:22:31] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 72.7 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:23:27] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[22:23:53] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:27:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3002 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled: dns_rec6_53_udp: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:28:43] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3002 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:31:49] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:34:33] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:35:01] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:35:11] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:39:17] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:39:45] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:39:55] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:43:01] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 41.52 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:43:39] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[22:47:27] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[22:47:45] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 79.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[22:48:05] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:48:11] <librenms-wmf>	 08Warning Alert for device cr1-eqiad.wikimedia.org - Memory over 85%
[22:48:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:49:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:52:29] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 8 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:53:21] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[22:53:35] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:54:11] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[22:54:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers maerlant.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[22:54:29] <icinga-wm>	 PROBLEM - OSPF status on cr2-esams is CRITICAL: OSPFv2: 3/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[22:55:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[22:56:03] <icinga-wm>	 RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration
[22:57:17] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:02:13] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec_53: Servers nescio.wikimedia.org are marked down but pooled: dns_rec_53_udp: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:03:39] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 9 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:03:49] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:06:47] <icinga-wm>	 PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 57.73 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:09:57] <icinga-wm>	 RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 74.51 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1
[23:10:05] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs3004 is CRITICAL: PYBAL CRITICAL - CRITICAL - dns_rec6_53: Servers nescio.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[23:11:41] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs3004 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[23:13:09] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:17:12] <librenms-wmf>	 08Warning Alert for device cr2-esams.wikimedia.org - Processor usage over 85%
[23:17:35] <librenms-wmf>	 04Critical Alert for device cr2-knams.wikimedia.org - Primary inbound port utilisation over 80%
[23:19:40] <multichill>	 mark: Shit still hitting fan? :-(
[23:21:37] <bblack>	 yes, although at this current point in time, the damage is limited to esams (EU / EMEA region users, luckily most of whom are in their dark hours!)
[23:21:47] <mark>	 yup
[23:22:01] <Zppix>	 I assume you guys are still working on resolving it
[23:22:31] <bblack>	 as best we can!
[23:22:49] <Zppix>	 Awesome, keep up the great effort!
[23:22:58] <andre__>	 +1, thanks
[23:22:58] <IRC-Source_77>	 what kind of hardware are the AMS core routers?
[23:23:22] <andre__>	 IRC-Source_77: https://wikitech.wikimedia.org/wiki/Esams_cluster
[23:23:27] <librenms-wmf>	 04Critical Alert for device cr2-esams.wikimedia.org - Primary inbound port utilisation over 80%
[23:26:21] <IRC-Source_77>	 already read those articles. but no infos and the pics are somewhat... "old" ;)
[23:28:30] <paladox>	 it's back up for me in the EU
[23:30:39] <icinga-wm>	 PROBLEM - BFD status on cr2-esams is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:32:57] <icinga-wm>	 PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - AS6939/IPv6: Active https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:35:23] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:35:27] <icinga-wm>	 RECOVERY - BFD status on cr2-esams is OK: OK: UP: 11 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:35:47] <icinga-wm>	 RECOVERY - OSPF status on cr2-esams is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:36:25] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:42:27] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-knams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[23:48:27] <librenms-wmf>	 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-esams.wikimedia.org recovered from Primary inbound port utilisation over 80%
[23:57:11] <librenms-wmf>	 08̶W̶a̶r̶n̶i̶n̶g Device cr2-esams.wikimedia.org recovered from Processor usage over 85%