[01:15:19] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[01:15:49] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[01:19:49] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[01:21:49] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[01:21:59] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[01:22:28] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[01:26:29] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[01:28:39] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[02:05:59] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:06:49] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.180 second response time
[03:13:09] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:27:59] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 824.28 seconds
[03:30:39] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[03:34:09] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdl1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdl1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops
[03:34:49] <icinga-wm>	 PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test]
[03:56:29] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.22 seconds
[04:05:18] <icinga-wm>	 RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[05:32:18] <icinga-wm>	 PROBLEM - HHVM rendering on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[05:33:09] <icinga-wm>	 RECOVERY - HHVM rendering on mw2261 is OK: HTTP OK: HTTP/1.1 200 OK - 76323 bytes in 1.168 second response time
[06:00:59] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:03:41] <wikibugs>	 (03PS1) 10Legoktm: sre.switchdc.mediawiki: Use localized read-only message [cookbooks] - 10https://gerrit.wikimedia.org/r/460730
[06:05:16] <wikibugs>	 (03PS1) 10Legoktm: Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731
[06:05:19] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:22:29] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:24:48] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[06:27:58] <icinga-wm>	 PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100%
[06:29:38] <icinga-wm>	 PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R]
[06:59:49] <icinga-wm>	 RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[07:26:09] <wikibugs>	 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10Legoktm) I'm seeing slowdowns in autocomplete menus and previews, instead of being instantaneous like normal, it's taking 3-5 seconds to get a response.
[08:55:59] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received
[08:56:59] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[09:32:49] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:35:08] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:50:28] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:52:38] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[10:23:29] <wikibugs>	 (03PS5) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504)
[10:41:29] <icinga-wm>	 PROBLEM - HHVM rendering on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:42:29] <icinga-wm>	 RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 76961 bytes in 0.258 second response time
[12:18:45] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10MarcoAurelio) If you need another one, I can volunteer as well. I have experience at some public and private Wikimedia mailing lists too. Thanks.
[12:20:01] <wikibugs>	 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10MarcoAurelio) CCing the current admin.
[12:23:40] <wikibugs>	 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10MarcoAurelio) For me it is working fine, but if it is not for others indeed this is worth investigating.
[12:30:49] <wikibugs>	 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10Zoranzoki21) >>! In T204421#4586666, @MarcoAurelio wrote: > For me it is working fine, but if it is not for others indeed this is worth investigating.
[14:57:28] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3640 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[15:26:58] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be2042 is CRITICAL: cluster=swift device=/dev/sdn1 fstype=xfs instance=ms-be2042:9100 job=node mountpoint=/srv/swift-storage/sdn1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops
[15:33:51] <wikibugs>	 (03Abandoned) 10Krinkle: webperf: Add 'Server: <fqdn>' header to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle)
[16:27:19] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3658 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[16:50:00] <wikibugs>	 (03PS3) 10MarcoAurelio: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769)
[17:00:00] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 031] Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm)
[17:06:14] <wikibugs>	 (03CR) 10MarcoAurelio: [C: 031] "Needed for git-review." [software/keyholder] - 10https://gerrit.wikimedia.org/r/460698 (owner: 10Hashar)
[17:13:09] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3647 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:15:29] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3644 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:30:39] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3600 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[17:32:58] <icinga-wm>	 PROBLEM - High lag on wdqs2003 is CRITICAL: 3665 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen
[18:02:26] <wikibugs>	 10Operations, 10Puppet: Why doesn't profile::mediawiki::nutcracker create /var/run/nutcracker/ ? - https://phabricator.wikimedia.org/T204450 (10Krenair)
[18:32:19] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[18:36:39] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[18:40:10] <wikibugs>	 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) @Marostegui So to confirm, recentchanges db hosts are the same within and between eqiad/codfw. But the api db hos...
[18:44:03] <wikibugs>	 (03PS1) 10Mathew.onipe: icinga check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361)
[18:48:20] <wikibugs>	 (03PS2) 10Mathew.onipe: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361)
[18:50:08] <wikibugs>	 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4587022, @Krinkle wrote: > @Marostegui So to confirm, recentchanges db hosts are the same withi...
[19:27:14] <wikibugs>	 (03PS27) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885)
[19:50:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[19:57:28] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[20:03:59] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[20:06:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[20:32:49] <wikibugs>	 (03CR) 10Gehel: "Minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe)
[20:41:26] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] Elasticsearch module is coming up. (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe)
[21:12:19] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,7 instance=db1069:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops
[21:59:19] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdj1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdj1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops
[22:41:55] <wikibugs>	 (03PS1) 10Odder: Update logos for Lezgian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460773