[01:01:21] <wikibugs>	 (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/511932 (owner: 10Legoktm)
[01:33:45] <icinga-wm>	 PROBLEM - puppet last run on lvs4006 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. Could be an interrupted request or a dependency cycle.
[02:00:59] <icinga-wm>	 RECOVERY - puppet last run on lvs4006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[02:17:49] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 44566136 and 1 seconds
[02:19:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 52 seconds
[03:34:13] <icinga-wm>	 PROBLEM - puppet last run on mw2137 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[03:35:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:36:13] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[03:36:55] <icinga-wm>	 PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz]
[03:37:11] <icinga-wm>	 PROBLEM - puppet last run on analytics1067 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz],File[/usr/share/GeoIP/GeoIP2-City.mmdb.test]
[03:38:17] <icinga-wm>	 PROBLEM - puppet last run on mw2234 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz]
[04:01:25] <icinga-wm>	 RECOVERY - puppet last run on mw2137 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[04:04:05] <icinga-wm>	 RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[04:04:21] <icinga-wm>	 RECOVERY - puppet last run on analytics1067 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures
[04:05:27] <icinga-wm>	 RECOVERY - puppet last run on mw2234 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures
[05:05:57] <icinga-wm>	 PROBLEM - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,5 instance=db2049:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops
[05:43:18] <wikibugs>	 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui)
[05:50:55] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 30.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[05:51:41] <icinga-wm>	 PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 40.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[05:56:05] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 55, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:56:39] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:14:55] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[06:15:37] <icinga-wm>	 RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1
[06:38:21] <icinga-wm>	 PROBLEM - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/cron - 177 bytes in 0.003 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[06:55:45] <icinga-wm>	 RECOVERY - toolschecker: check mtime mod from tools cron job on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 158 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Portal:Toolforge/Admin/Toolschecker
[07:15:52] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Thumbnail rendering of complex SVG file leads to Error 500 or Error 429 instead of Error 408 - https://phabricator.wikimedia.org/T226318 (10TheDJ) Pretty sure this is a wider operations issue for esams connections. I noticed i...
[07:17:06] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10TheDJ)
[07:17:27] <thedj>	 Krenair: another report, I reworked the ticket into one for operations
[07:54:47] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, 10Thumbor: Increased failure rate of varnish be fetches - https://phabricator.wikimedia.org/T226318 (10Aklapper) >>! In T226318#5276594, @TheDJ wrote: > Pretty sure this is a wider operations issue for esams connections. > [...] > We cou...
[07:56:42] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,5 instance=db2049:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops
[09:13:05] <icinga-wm>	 PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 virgin: 25) https://wikitech.wikimedia.org/wiki/Mailman
[09:17:31] <icinga-wm>	 RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. https://wikitech.wikimedia.org/wiki/Mailman
[10:18:53] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Wurgl) 05Resolved→03Open Sorry to reopen that issue, but the behaviour is back :-(  I see the slown...
[10:37:28] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Paladox) I just experienced this too. One minute it’s fast, the next it’s really slow.
[10:56:28] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10PM3) I don't experience any problems since Thursday. Today also everything is running smoothly.  One ou...
[10:56:58] <wikibugs>	 10Operations, 10Wikidata, 10wikidata-tech-focus: Move dispatching of wikidata to a dedicated node - https://phabricator.wikimedia.org/T193733 (10Ladsgroup) >>! In T193733#5276330, @Addshore wrote: > Going to mark this as stalled. >  > Also we havn't had performance issues with dispatching for quite some time...
[11:00:08] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Aklapper) For the records, {T226318} might be a duplicate of this task.
[13:20:04] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Krinkle) New reports coming in also at <https://meta.wikimedia.org/w/index.php?oldid=19166798#extreme_s...
[13:44:36] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Study performance impact of disabling TCP selective acknowledgments - https://phabricator.wikimedia.org/T225998 (10Krinkle) >>! In T225998#5264757, @Gilles wrote: > loadEventEnd seems to have regressed around the time the change was deployed. In t...
[14:05:04] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Wurgl) It is strange, really strange. I have seen that slowness three times within a few minutes on my...
[17:35:45] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10MaxBioHazard)
[17:39:20] <wikibugs>	 10Operations, 10Performance-Team, 10Traffic, 10Performance: Sometimes pages load slowly for European users (due to some factor outside of Wikimedia cluster) - https://phabricator.wikimedia.org/T226048 (10Vort) Did not read all comments, but want to say that this problem is way older than several weeks. It...
[19:31:03] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 229, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[19:31:27] <icinga-wm>	 PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[19:48:31] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[20:17:51] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 231, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:18:15] <icinga-wm>	 RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[21:16:41] <icinga-wm>	 PROBLEM - Host analytics1060 is DOWN: PING CRITICAL - Packet loss = 100%
[21:29:44] <wikibugs>	 (03CR) 10Paladox: [V: 03+2 C: 03+2] Update plugins for stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507991 (owner: 10Paladox)
[21:32:24] <wikibugs>	 (03PS1) 10Paladox: Merge branch 'stable-2.16' into wmf/stable-2.16 [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/518447
[21:34:27] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[22:02:32] <wikibugs>	 (03Abandoned) 10Reedy: Prevent $wgFlaggedRevsNamespaces from having NS listed twice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/516443 (https://phabricator.wikimedia.org/T225276) (owner: 10Reedy)