[02:57:34] <wikibugs_>	 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Evad37) {T137939} (and then displaying the changes promptly) probably needs to be a higher priority... otherwise this issue will just repeat itself next time OSM gets vandalised. See...
[03:26:40] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.76 seconds
[03:33:50] <icinga-wm>	 PROBLEM - puppet last run on analytics1055 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-ISP.mmdb.gz]
[03:33:50] <icinga-wm>	 PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:40:00] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 217.94 seconds
[04:04:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1055 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:04:00] <icinga-wm>	 RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[06:15:51] <wikibugs_>	 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10SJu) Cross-wiki uploads from uk.wikipedia.org are also affected: I found 3 images uploaded on 2017-09-27 and 2017-09-28:    - [[ https://commons.wikimedia...
[06:28:21] <icinga-wm>	 PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/gen_fingerprints]
[06:29:20] <icinga-wm>	 PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/prometheus/ops/targets/mysql-dbstore_esams.yaml]
[06:58:30] <icinga-wm>	 RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[06:59:21] <icinga-wm>	 RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[07:05:00] <icinga-wm>	 PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS6939/IPv4: Connect, AS6939/IPv6: Active
[07:06:00] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 85 probes of 313 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[07:06:21] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 76 probes of 339 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[07:11:00] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 15 probes of 313 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[07:11:30] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 6 probes of 339 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[07:20:31] <icinga-wm>	 RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 41, down: 57, shutdown: 4
[07:23:03] <wikibugs_>	 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10SJu) [[ https://commons.wikimedia.org/wiki/Category:Incomplete_JPG_files_(5_MB_interruption) | Category:Incomplete JPG files (5 MB interruption) ]] create...
[08:03:25] <wikibugs_>	 10Operations, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Legoktm) p:05Triage>03High
[08:20:22] <wikibugs_>	 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10Legoktm) p:05Triage>03Unbreak!
[08:23:36] <wikibugs_>	 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10Legoktm) This is unbreak now from a CI perspective, I can't deploy or pull any new images.  The only recent puppet change I could find mentioni...
[08:44:18] <wikibugs_>	 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10ema) p:05Triage>03Normal The increase is very visible in the tests performed from [[https://grafana.wikimedia.org/dashboard/db/...
[08:56:02] <wikibugs_>	 10Operations, 10Continuous-Integration-Infrastructure: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10ema) This is due to the move of cache_misc sites to cache_text T164609.  There seems to be some type of ACL on  darmstadtium.eqiad.wmnet for do...
[09:06:02] <wikibugs_>	 (03PS1) 10Ema: profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737)
[09:08:24] <wikibugs_>	 (03PS2) 10Ema: profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737)
[09:08:51] <wikibugs_>	 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) Logs indicate that the previous update ran without issue. As seen in [[ https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 | grafana ]],...
[09:09:08] <wikibugs_>	 (03CR) 10Legoktm: [C: 031] profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737) (owner: 10Ema)
[09:10:03] <wikibugs_>	 (03CR) 10Ema: [C: 032] profile::docker::registry: whitelist cache_text nodes [puppet] - 10https://gerrit.wikimedia.org/r/452182 (https://phabricator.wikimedia.org/T201737) (owner: 10Ema)
[09:16:39] <wikibugs_>	 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review: docker-registry is returnning HTTP 403 Forbidden for all requests - https://phabricator.wikimedia.org/T201737 (10ema) 05Open>03Resolved a:03ema @Legoktm confirmed that the issue is now solved, closing.
[09:31:41] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:32:50] <icinga-wm>	 PROBLEM - Host mr1-codfw.oob is DOWN: CRITICAL - Time to live exceeded (216.117.46.36)
[09:35:50] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[09:38:00] <icinga-wm>	 RECOVERY - Host mr1-codfw.oob is UP: PING OK - Packet loss = 0%, RTA = 34.98 ms
[10:55:51] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1003 is CRITICAL: instance=kubernetes1003.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[10:57:00] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[11:41:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw2268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[11:42:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2268 is OK: HTTP OK: HTTP/1.1 200 OK - 73022 bytes in 0.304 second response time
[12:32:59] <wikibugs_>	 10Operations, 10Performance-Team, 10Traffic: Significant increase in Time To First Byte on 2018-08-08, between 16:00 and 20:00 UTC - https://phabricator.wikimedia.org/T201769 (10Krinkle) The increase on 8/9 can be seen in data from WebPageTest runs in both Chrome and Firefox, on all enwiki page urls I checke...
[12:44:17] <wikibugs_>	 10Operations, 10LDAP, 10Security: Have a check to prevent non-existent accounts from being added to LDAP groups - https://phabricator.wikimedia.org/T201779 (10Aklapper)
[13:21:21] <icinga-wm>	 PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[13:22:20] <icinga-wm>	 RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 73010 bytes in 0.297 second response time
[13:38:11] <icinga-wm>	 PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 50404 MB (10% inode=99%)
[13:38:36] <wikibugs_>	 10Operations, 10Commons, 10Multimedia, 10media-storage: Damaged uploads interrupted with reaching of 5 MB - https://phabricator.wikimedia.org/T201379 (10Urbanecm)
[13:57:41] <icinga-wm>	 RECOVERY - Disk space on elastic1018 is OK: DISK OK
[14:40:41] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: instance=kubernetes1004.eqiad.wmnet operation_type={create_container,start_container} https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[14:41:41] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[15:41:41] <wikibugs_>	 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) It looks like the cache is starting to invalidate. I don't have a precise timeline on when this issue happened, when we synced the problematic data and when we synced the corr...
[15:43:37] <gehel>	 !log full cache invalidation of maps tiles - T201772
[15:43:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:43:44] <stashbot>	 T201772: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772
[16:07:59] <wikibugs_>	 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Gehel) Invalidating varnish cache (see P7451) seems to work. Browser cache might need refreshing, but not much we can do about that.  Full tile invalidation did generate high load on...
[16:37:46] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 031] "This looks good.  My one concern is that there may be places where 'nova_controller' and 'nova_controller_standby' are still getting used " [puppet] - 10https://gerrit.wikimedia.org/r/451314 (https://phabricator.wikimedia.org/T201504) (owner: 10Arturo Borrero Gonzalez)
[16:59:53] <wikibugs_>	 (03PS3) 10Andrew Bogott: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha)
[17:02:31] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:04:40] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[17:08:27] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 032] Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha)
[17:09:09] <wikibugs_>	 (03Merged) 10jenkins-bot: Providing users more clue when kuberenetes is unable to delete all the objects [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/444879 (https://phabricator.wikimedia.org/T140415) (owner: 10Nehajha)
[17:56:16] <wikibugs_>	 (03CR) 10Andrew Bogott: [C: 04-1] Removing gridengine as default and encouraging the use of Kubernetes (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha)
[18:07:32] <wikibugs_>	 (03PS4) 10Nehajha: Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504)
[18:07:38] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] Removing gridengine as default and encouraging the use of Kubernetes [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/443190 (https://phabricator.wikimedia.org/T154504) (owner: 10Nehajha)
[19:07:13] <wikibugs_>	 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10TheDJ) Why do we cache 24 hours ? That seems like a lot for clients to cache. 1 hour would seem more than sufficient shouldn't it ? varnish could even use stale-while-revalidate to k...
[19:28:43] <wikibugs_>	 10Operations, 10Maps: maps.wikimedia.org is showing old vandalized version of OSM - https://phabricator.wikimedia.org/T201772 (10Bawolff) Once people realize what an effective vandalism vector this is, I think we can expect to see a lot more of this.
[21:37:22] <wikibugs_>	 (03PS1) 10Legoktm: Add php-imagick and php-redis to thirdparty/php72 [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666)
[21:38:11] <wikibugs_>	 (03CR) 10Legoktm: "I'm basing this on the other entries in this file." [puppet] - 10https://gerrit.wikimedia.org/r/452274 (https://phabricator.wikimedia.org/T200666) (owner: 10Legoktm)
[23:05:41] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[23:07:41] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen