[01:15:19] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:15:49] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [01:19:49] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:21:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:21:59] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:22:28] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [01:26:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [01:28:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [02:05:59] PROBLEM - Nginx local proxy to apache on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:06:49] RECOVERY - Nginx local proxy to apache on mw2223 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.180 second response time [03:13:09] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:27:59] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 824.28 seconds [03:30:39] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [03:34:09] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sdl1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sdl1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [03:34:49] PROBLEM - puppet last run on mw1346 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:56:29] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 263.22 seconds [04:05:18] RECOVERY - puppet last run on mw1346 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [05:32:18] PROBLEM - HHVM rendering on mw2261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:09] RECOVERY - HHVM rendering on mw2261 is OK: HTTP OK: HTTP/1.1 200 OK - 76323 bytes in 1.168 second response time [06:00:59] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:03:41] (03PS1) 10Legoktm: sre.switchdc.mediawiki: Use localized read-only message [cookbooks] - 10https://gerrit.wikimedia.org/r/460730 [06:05:16] (03PS1) 10Legoktm: Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 [06:05:19] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:22:29] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:24:48] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [06:27:58] PROBLEM - Host ms-be2030 is DOWN: PING CRITICAL - Packet loss = 100% [06:29:38] PROBLEM - puppet last run on analytics1048 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:59:49] RECOVERY - puppet last run on analytics1048 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:26:09] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10Legoktm) I'm seeing slowdowns in autocomplete menus and previews, instead of being instantaneous like normal, it's taking 3-5 seconds to get a response. [08:55:59] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received [08:56:59] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [09:32:49] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:35:08] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:50:28] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:52:38] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [10:23:29] (03PS5) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) [10:41:29] PROBLEM - HHVM rendering on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:42:29] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 76961 bytes in 0.258 second response time [12:18:45] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10MarcoAurelio) If you need another one, I can volunteer as well. I have experience at some public and private Wikimedia mailing lists too. Thanks. [12:20:01] 10Operations, 10Wikimedia-Mailing-lists: Add admins to mailing list engineering@ - https://phabricator.wikimedia.org/T204393 (10MarcoAurelio) CCing the current admin. [12:23:40] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10MarcoAurelio) For me it is working fine, but if it is not for others indeed this is worth investigating. [12:30:49] 10Operations, 10Phabricator: Phabricator is slow - https://phabricator.wikimedia.org/T204421 (10Zoranzoki21) >>! In T204421#4586666, @MarcoAurelio wrote: > For me it is working fine, but if it is not for others indeed this is worth investigating. [14:57:28] PROBLEM - High lag on wdqs2003 is CRITICAL: 3640 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [15:26:58] PROBLEM - Filesystem available is greater than filesystem size on ms-be2042 is CRITICAL: cluster=swift device=/dev/sdn1 fstype=xfs instance=ms-be2042:9100 job=node mountpoint=/srv/swift-storage/sdn1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2042&var-datasource=codfw%2520prometheus%252Fops [15:33:51] (03Abandoned) 10Krinkle: webperf: Add 'Server: ' header to performance.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/452689 (https://phabricator.wikimedia.org/T158837) (owner: 10Krinkle) [16:27:19] PROBLEM - High lag on wdqs2003 is CRITICAL: 3658 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [16:50:00] (03PS3) 10MarcoAurelio: Use translated MetaNamespace for fy.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/455249 (https://phabricator.wikimedia.org/T202769) [17:00:00] (03CR) 10MarcoAurelio: [C: 031] Use standard version of plain-text GPL [cookbooks] - 10https://gerrit.wikimedia.org/r/460731 (owner: 10Legoktm) [17:06:14] (03CR) 10MarcoAurelio: [C: 031] "Needed for git-review." [software/keyholder] - 10https://gerrit.wikimedia.org/r/460698 (owner: 10Hashar) [17:13:09] PROBLEM - High lag on wdqs2003 is CRITICAL: 3647 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:15:29] PROBLEM - High lag on wdqs2003 is CRITICAL: 3644 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:30:39] PROBLEM - High lag on wdqs2003 is CRITICAL: 3600 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [17:32:58] PROBLEM - High lag on wdqs2003 is CRITICAL: 3665 ge 3600 https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [18:02:26] 10Operations, 10Puppet: Why doesn't profile::mediawiki::nutcracker create /var/run/nutcracker/ ? - https://phabricator.wikimedia.org/T204450 (10Krenair) [18:32:19] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:36:39] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [18:40:10] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Krinkle) @Marostegui So to confirm, recentchanges db hosts are the same within and between eqiad/codfw. But the api db hos... [18:44:03] (03PS1) 10Mathew.onipe: icinga check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) [18:48:20] (03PS2) 10Mathew.onipe: Icinga disk space check for old elasticsearch servers [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) [18:50:08] 10Operations, 10DBA, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata produces a lot of failed requests for recentchanges API - https://phabricator.wikimedia.org/T202764 (10Marostegui) >>! In T202764#4587022, @Krinkle wrote: > @Marostegui So to confirm, recentchanges db hosts are the same withi... [19:27:14] (03PS27) 10Mathew.onipe: Elasticsearch module is coming up. [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) [19:50:48] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:57:28] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:03:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:06:08] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:32:49] (03CR) 10Gehel: "Minor comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/460763 (https://phabricator.wikimedia.org/T204361) (owner: 10Mathew.onipe) [20:41:26] (03CR) 10Gehel: [C: 04-1] Elasticsearch module is coming up. (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/456322 (https://phabricator.wikimedia.org/T202885) (owner: 10Mathew.onipe) [21:12:19] PROBLEM - Device not healthy -SMART- on db1069 is CRITICAL: cluster=mysql device=megaraid,7 instance=db1069:9100 job=node site=eqiad https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1069&var-datasource=eqiad%2520prometheus%252Fops [21:59:19] PROBLEM - Filesystem available is greater than filesystem size on ms-be2040 is CRITICAL: cluster=swift device=/dev/sdj1 fstype=xfs instance=ms-be2040:9100 job=node mountpoint=/srv/swift-storage/sdj1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2040&var-datasource=codfw%2520prometheus%252Fops [22:41:55] (03PS1) 10Odder: Update logos for Lezgian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/460773