[00:26:30] ACKNOWLEDGEMENT - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. Ayounsi Cas is aware - The acknowledgement expires at: 2019-07-23 00:26:06. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [00:26:48] chaomodus: ^ [00:28:28] roger roger [00:28:35] thanks! [00:51:17] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts [01:00:36] at least the alert is valid [01:01:59] Tks4Fish: that is going to be hard to find. your best bet is probably 'git bisect' :\ [01:02:05] seems like afrinic rsync server is not happy, but only from codfw [01:02:48] bah, I don't have shell access :/ [01:03:10] I'm doing it by hand, going from commit to commit and trying to pinpoint it :/ [01:17:53] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts [02:02:23] (03Abandoned) 10Ayounsi: pmacct: add tags to aggregated netflow based on the source device [puppet] - 10https://gerrit.wikimedia.org/r/410369 (owner: 10Ayounsi) [03:02:16] (03PS1) 10Ayounsi: pmacct, send more netflow data to analytics [puppet] - 10https://gerrit.wikimedia.org/r/524628 [03:03:06] (03CR) 10jerkins-bot: [V: 04-1] pmacct, send more netflow data to analytics [puppet] - 10https://gerrit.wikimedia.org/r/524628 (owner: 10Ayounsi) [03:04:42] (03CR) 10Ayounsi: "> Patch Set 1: Verified-1" [puppet] - 10https://gerrit.wikimedia.org/r/524628 (owner: 10Ayounsi) [03:05:56] (03CR) 10Ayounsi: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/524628 (owner: 10Ayounsi) [03:06:44] (03CR) 10jerkins-bot: [V: 04-1] pmacct, send more netflow data to analytics [puppet] - 10https://gerrit.wikimedia.org/r/524628 (owner: 10Ayounsi) [03:09:18] (03PS2) 10Ayounsi: pmacct, send more netflow data to analytics [puppet] - 10https://gerrit.wikimedia.org/r/524628 [03:34:19] 10Operations, 10Analytics, 10Traffic: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10faidon) Note that they do not say that we will stop getting updates but merely that we won't be able to benefit from this "security feature". It does sound scary o... [03:42:45] PROBLEM - puppet last run on db1123 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [04:10:59] RECOVERY - puppet last run on db1123 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [04:28:21] PROBLEM - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is CRITICAL: 39.12 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:29:05] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 31.74 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:29:39] PROBLEM - Varnish traffic drop between 30min ago and now at esams on icinga1001 is CRITICAL: 36.81 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:30:17] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 58.97 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:31:21] RECOVERY - Varnish traffic drop between 30min ago and now at esams on icinga1001 is OK: (C)60 le (W)70 le 115.4 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:31:41] RECOVERY - Varnish traffic drop between 30min ago and now at eqiad on icinga1001 is OK: (C)60 le (W)70 le 90.92 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:31:59] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 102 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:32:25] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 101.9 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [04:36:43] Hello, can I get my account back? https://en.wikipedia.org/wiki/User_talk:Benjaminzyg [05:33:54] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10SRE-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10Nuria) [05:34:39] 10Operations, 10Analytics, 10LDAP-Access-Requests, 10wikimediafoundation.org: Access to WikimediaFoundation.org analytics for Deb - https://phabricator.wikimedia.org/T227496 (10Nuria) [05:51:20] (03PS1) 10ArielGlenn: replace all hiera clls with lookup() for dumps generation manifests [puppet] - 10https://gerrit.wikimedia.org/r/524632 (https://phabricator.wikimedia.org/T227742) [05:53:11] RECOVERY - Maps - OSM synchronization lag - eqiad on icinga1001 is OK: (C)2.592e+05 ge (W)1.764e+05 ge 2.119e+04 https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=11&fullscreen&orgId=1 [06:19:29] PROBLEM - puppet last run on kraz is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:32:29] PROBLEM - puppet last run on elastic1047 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:47:45] RECOVERY - puppet last run on kraz is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [06:54:43] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 30225536 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [06:58:37] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Azwiki Admins - https://phabricator.wikimedia.org/T228560 (10Mardetanha) [07:00:43] RECOVERY - puppet last run on elastic1047 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [07:01:17] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Azwiki Admins - https://phabricator.wikimedia.org/T228560 (10Mardetanha) it is duplicate of this [[ https://phabricator.wikimedia.org/T228542 | task ]] [07:03:27] 10Operations, 10Wikimedia-Mailing-lists: Mailing list for Azwiki Admins - https://phabricator.wikimedia.org/T228560 (10Peachey88) [07:03:31] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Peachey88) [07:11:19] RECOVERY - Postgres Replication Lag on maps1003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1047080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [12:44:37] PROBLEM - Host cp5004 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:41] PROBLEM - Host cp5001 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:49] PROBLEM - Host cp5006 is DOWN: PING CRITICAL - Packet loss = 100% [12:44:59] PROBLEM - Host cp5010 is DOWN: PING CRITICAL - Packet loss = 100% [12:47:07] PROBLEM - Host cp5004.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:47:19] PROBLEM - Host cp5003.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:47:19] PROBLEM - Host cp5006.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:47:33] PROBLEM - Host cp5007.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [12:47:51] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:48:19] RECOVERY - Host cp5001 is UP: PING WARNING - Packet loss = 93%, RTA = 235.90 ms [12:48:21] PROBLEM - BFD status on cr1-eqsin is CRITICAL: CRIT: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:48:23] RECOVERY - Host cp5004 is UP: PING WARNING - Packet loss = 54%, RTA = 231.28 ms [12:48:23] RECOVERY - Host cp5010 is UP: PING WARNING - Packet loss = 66%, RTA = 231.24 ms [12:48:25] PROBLEM - OSPF status on cr1-eqsin is CRITICAL: OSPFv2: 3/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:48:29] RECOVERY - Host cp5006 is UP: PING OK - Packet loss = 0%, RTA = 232.29 ms [12:49:33] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:50:01] RECOVERY - BFD status on cr1-eqsin is OK: OK: UP: 4 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:50:05] RECOVERY - OSPF status on cr1-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [12:51:15] RECOVERY - Host cp5004.mgmt is UP: PING OK - Packet loss = 0%, RTA = 232.04 ms [12:53:03] RECOVERY - Host cp5003.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.94 ms [12:53:03] RECOVERY - Host cp5006.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.89 ms [12:53:17] RECOVERY - Host cp5007.mgmt is UP: PING OK - Packet loss = 0%, RTA = 231.84 ms [14:00:23] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:00:45] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:04:13] 10Operations, 10Wikimedia-Mailing-lists: New Mailing lists for AzWiki sysops - https://phabricator.wikimedia.org/T228542 (10Eldarado) * requested name of the mailing list, ending in @lists.wikimedia.org. Wikimedia-AZ@lists.wikimedia.org * reasoning/explanation of purpose (and link to community consensus, if a... [14:10:27] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:18:43] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:25:17] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:27:19] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [16:09:05] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [16:14:03] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [16:26:23] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [16:27:53] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 76180 bytes in 0.344 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:04:11] (03CR) 10Effie Mouzeli: "> is this the expected behaviour for random host mw1307?" [puppet] - 10https://gerrit.wikimedia.org/r/524336 (https://phabricator.wikimedia.org/T219148) (owner: 10Effie Mouzeli) [19:27:14] hi Niharika - you there? [19:49:17] PROBLEM - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 177 bytes in 0.145 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [19:50:47] :-/ [20:00:31] RECOVERY - toolschecker: All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.136 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [20:19:35] PROBLEM - puppet last run on lvs1016 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [20:47:51] RECOVERY - puppet last run on lvs1016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link [22:56:32] (03PS1) 10QChris: Add .gitreview [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524661 [22:56:34] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524661 (owner: 10QChris) [23:56:35] PROBLEM - HHVM rendering on mw1230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [23:58:05] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 76200 bytes in 0.776 second response time https://wikitech.wikimedia.org/wiki/Application_servers