[01:01:21] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511932 (owner: 10Legoktm) [01:33:45] PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle. [02:00:59] RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [02:17:49] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 44566136 and 1 seconds [02:19:15] RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 52 seconds [03:34:13] PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:35:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:36:13] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:36:55] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:37:11] PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test] [03:38:17] PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [04:01:25] RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [04:04:05] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [04:04:21] RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:27] RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [05:05:57] PROBLEM - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,5 instance=db2049:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [05:43:18] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:50:55] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [05:51:41] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [05:56:05] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:56:39] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:14:55] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:15:37] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [06:38:21] PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [06:55:45] RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker [07:15:52] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10TheDJ) Pretty sure this is a wider operations issue for esams connections. I noticed i... [07:17:06] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10TheDJ) [07:17:27] Krenair: another report, I reworked the ticket into one for operations [07:54:47] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10Aklapper) >>! In T226318#5276594, @TheDJ wrote: > Pretty sure this is a wider operations issue for esams connections. > [...] > We cou... [07:56:42] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,5 instance=db2049:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [09:13:05] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman [09:17:31] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman [10:18:53] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Wurgl) 05Resolved→03Open Sorry to reopen that issue, but the behaviour is back :-( I see the slown... [10:37:28] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Paladox) I just experienced this too. One minute it’s fast, the next it’s really slow. [10:56:28] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10PM3) I don't experience any problems since Thursday. Today also everything is running smoothly. One ou... [10:56:58] 10Operations, 10Wikidata, 10wikidata-tech-focus: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Ladsgroup) >>! In T193733#5276330, @Addshore wrote: > Going to mark this as stalled. > > Also we havn't had performance issues with dispatching for quite some time... [11:00:08] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Aklapper) For the records, {T226318} might be a duplicate of this task. [13:20:04] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) New reports coming in also at 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Krinkle) >>! In T225998#5264757, @Gilles wrote: > loadEventEnd seems to have regressed around the time the change was deployed. In t... [14:05:04] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Wurgl) It is strange, really strange. I have seen that slowness three times within a few minutes on my... [17:35:45] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10MaxBioHazard) [17:39:20] 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Did not read all comments, but want to say that this problem is way older than several weeks. It... [19:31:03] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:31:27] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:48:31] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [20:17:51] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:18:15] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:16:41] PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:44] (03CR) 10Paladox: [V: 03+2 C: 03+2] Update plugins for stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507991 (owner: 10Paladox) [21:32:24] (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/518447 [21:34:27] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:02:32] (03Abandoned) 10Reedy: Prevent $wgFlaggedRevsNamespaces from having NS listed twice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516443 (https://phabricator.wikimedia.org/T225276) (owner: 10Reedy)