[02:57:34] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Evad37) {T137939} (and then displaying the changes promptly) probably needs to be a higher priority... otherwise this issue will just repeat itself next time OSM gets vandalised. See... [03:26:40] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.76 seconds [03:33:50] PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz] [03:33:50] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:40:00] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 217.94 seconds [04:04:00] RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:04:00] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:15:51] 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10SJu) Cross-wiki uploads from uk.wikipedia.org are also affected: I found 3 images uploaded on 2017-09-27 and 2017-09-28: - [[ https://commons.wikimedia... [06:28:21] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints] [06:29:20] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/targets/mysql-dbstore_esams.yaml] [06:58:30] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:59:21] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:05:00] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active [07:06:00] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 85 probes of 313 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:06:21] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 76 probes of 339 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [07:11:00] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 15 probes of 313 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [07:11:30] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 6 probes of 339 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [07:20:31] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 41, down: 57, shutdown: 4 [07:23:03] 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10SJu) [[ https://commons.wikimedia.org/wiki/Category:Incomplete_JPG_files_(5_MB_interruption) | Category:Incomplete JPG files (5 MB interruption) ]] create... [08:03:25] 10Operations, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Legoktm) p:05Triage>03High [08:20:22] 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10Legoktm) p:05Triage>03Unbreak! [08:23:36] 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10Legoktm) This is unbreak now from a CI perspective, I can't deploy or pull any new images. The only recent puppet change I could find mentioni... [08:44:18] 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10ema) p:05Triage>03Normal The increase is very visible in the tests performed from [[https://grafana.wikimedia.org/dashboard/db/... [08:56:02] 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10ema) This is due to the move of cache_misc sites to cache_text T164609. There seems to be some type of ACL on darmstadtium.eqiad.wmnet for do... [09:06:02] (03PS1) 10Ema: profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737) [09:08:24] (03PS2) 10Ema: profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737) [09:08:51] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) Logs indicate that the previous update ran without issue. As seen in [[ https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 | grafana ]],... [09:09:08] (03CR) 10Legoktm: [C: 031] profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737) (owner: 10Ema) [09:10:03] (03CR) 10Ema: [C: 032] profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737) (owner: 10Ema) [09:16:39] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10ema) 05Open>03Resolved a:03ema @Legoktm confirmed that the issue is now solved, closing. [09:31:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:32:50] PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Time to live exceeded (216.117.46.36) [09:35:50] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [09:38:00] RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 34.98 ms [10:55:51] PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:57:00] RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:41:21] PROBLEM - HHVM rendering on mw2268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:42:20] RECOVERY - HHVM rendering on mw2268 is OK: HTTP OK: HTTP/1.1 200 OK - 73022 bytes in 0.304 second response time [12:32:59] 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Krinkle) The increase on 8/9 can be seen in data from WebPageTest runs in both Chrome and Firefox, on all enwiki page urls I checke... [12:44:17] 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Aklapper) [13:21:21] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:22:20] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 73010 bytes in 0.297 second response time [13:38:11] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 50404 MB (10% inode=99%) [13:38:36] 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10Urbanecm) [13:57:41] RECOVERY - Disk space on elastic1018 is OK: DISK OK [14:40:41] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:41:41] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [15:41:41] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) It looks like the cache is starting to invalidate. I don't have a precise timeline on when this issue happened, when we synced the problematic data and when we synced the corr... [15:43:37] !log full cache invalidation of maps tiles - T201772 [15:43:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:44] T201772: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 [16:07:59] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) Invalidating varnish cache (see P7451) seems to work. Browser cache might need refreshing, but not much we can do about that. Full tile invalidation did generate high load on... [16:37:46] (03CR) 10Andrew Bogott: [C: 031] "This looks good. My one concern is that there may be places where 'nova_controller' and 'nova_controller_standby' are still getting used " [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez) [16:59:53] (03PS3) 10Andrew Bogott: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha) [17:02:31] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:04:40] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [17:08:27] (03CR) 10Andrew Bogott: [C: 032] Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha) [17:09:09] (03Merged) 10jenkins-bot: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha) [17:56:16] (03CR) 10Andrew Bogott: [C: 04-1] Removing gridengine as default and encouraging the use of Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [18:07:32] (03PS4) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) [18:07:38] (03CR) 10jerkins-bot: [V: 04-1] Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha) [19:07:13] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10TheDJ) Why do we cache 24 hours ? That seems like a lot for clients to cache. 1 hour would seem more than sufficient shouldn't it ? varnish could even use stale-while-revalidate to k... [19:28:43] 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Bawolff) Once people realize what an effective vandalism vector this is, I think we can expect to see a lot more of this. [21:37:22] (03PS1) 10Legoktm: Add php-imagick and php-redis to thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) [21:38:11] (03CR) 10Legoktm: "I'm basing this on the other entries in this file." [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) (owner: 10Legoktm) [23:05:41] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [23:07:41] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen