[00:25:35] <icinga-wm>	 PROBLEM - configured eth on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:36] <icinga-wm>	 PROBLEM - SSH on ms-be1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:25:45] <icinga-wm>	 PROBLEM - MD RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:25:45] <icinga-wm>	 PROBLEM - puppet last run on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:27:25] <icinga-wm>	 PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 184.39, 110.14, 56.82
[00:32:05] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[00:42:25] <icinga-wm>	 PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[01:10:25] <icinga-wm>	 RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures
[01:23:00] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10Pokefan95) 404 error: https://commons.wikimedia.org/wiki/File:School_Gyrls_at_Paramount_Studios.jpg
[01:27:00] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10GermanJoe) Possibly another related case reported by a different editor: https://commons.wikimedia.org/wiki/File:Yar...
[01:28:07] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147812 (10Pokefan95) p:05High>03Unbreak! Setting back to UBN!, due to the number of duplicate tasks and that there may be...
[01:29:45] <icinga-wm>	 PROBLEM - NTP on ms-be1016 is CRITICAL: NTP CRITICAL: No response from NTP server
[01:29:55] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 81605.599124 Seconds
[01:30:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 81607.166043 Seconds
[01:30:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 81637.348128 Seconds
[01:32:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82364.681359 Seconds
[01:32:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82364.686004 Seconds
[01:32:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 82364.686071 Seconds
[01:34:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:35:55] <icinga-wm>	 PROBLEM - swift-container-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:05] <icinga-wm>	 PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:05] <icinga-wm>	 PROBLEM - swift-account-reaper on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:05] <icinga-wm>	 PROBLEM - dhclient process on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:06] <icinga-wm>	 PROBLEM - swift-object-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:06] <icinga-wm>	 PROBLEM - Check size of conntrack table on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:06] <icinga-wm>	 PROBLEM - DPKG on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:15] <icinga-wm>	 PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:16] <icinga-wm>	 PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:25] <icinga-wm>	 PROBLEM - swift-container-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:26] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:26] <icinga-wm>	 PROBLEM - swift-container-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:35] <icinga-wm>	 PROBLEM - swift-object-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:35] <icinga-wm>	 PROBLEM - swift-account-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:35] <icinga-wm>	 PROBLEM - swift-account-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:36:45] <icinga-wm>	 PROBLEM - salt-minion processes on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[01:37:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82027.368571 Seconds
[01:40:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:41:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:43:05] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82387.120761 Seconds
[01:44:35] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 82477.398481 Seconds
[01:49:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds
[01:49:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds
[01:49:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds
[01:52:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83564.601739 Seconds
[01:52:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83564.602942 Seconds
[01:52:15] <icinga-wm>	 PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83564.602264 Seconds
[01:55:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 38.869574 Seconds
[01:55:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 38.871256 Seconds
[01:55:15] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 38.875634 Seconds
[01:55:35] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 13.950231 Seconds
[01:55:55] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 41.998827 Seconds
[01:56:05] <icinga-wm>	 RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 43.714312 Seconds
[01:59:38] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10De728631) This seems to be a gradual process. When I [[ https://commons.wikimedia.org/w/index.php?title=Commons:Admi...
[02:03:26] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[02:05:15] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time
[02:07:35] <icinga-wm>	 PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[02:32:11] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 13m 27s)
[02:32:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:35:35] <icinga-wm>	 RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[02:37:31] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Apr  1 02:37:30 UTC 2017 (duration 5m 20s)
[02:37:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:40:15] <icinga-wm>	 PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[03:05:26] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:06:26] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[03:07:16] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.005 second response time
[03:08:15] <icinga-wm>	 RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[03:08:16] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time
[03:30:42] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147879 (10zhuyifei1999) Also https://commons.wikimedia.org/wiki/Commons:Village_pump#Something_funny_with_this_image https://c...
[03:59:25] <icinga-wm>	 PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[04:27:25] <icinga-wm>	 RECOVERY - puppet last run on elastic1050 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures
[04:46:42] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10MaxBioHazard) Also https://commons.wikimedia.org/wiki/File:Vladimir_Frolochkin.JPG
[04:52:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/3/3: down - Peering: Equinix Ashburn Exchange {#2648} [10Gbps]BR
[04:57:35] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 85 probes of 429 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[04:58:06] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 47 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[05:02:35] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 429 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map
[05:03:06] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[05:08:35] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0
[06:24:15] <icinga-wm>	 PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[06:25:50] <wikibugs>	 (03PS1) 10ArielGlenn: disable full dumps cron job for a bit [puppet] - 10https://gerrit.wikimedia.org/r/345956
[06:29:01] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] disable full dumps cron job for a bit [puppet] - 10https://gerrit.wikimedia.org/r/345956 (owner: 10ArielGlenn)
[06:30:15] <elukey>	 sigh now mw1259 alarms more than mw1169..
[06:34:31] <wikibugs_>	 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147929 (10elukey) >>! In T161918#3147750, @brion wrote: > Note there are probably a lot of jobs in the non-prioritized queue still backed up last I looked; Note I also have a f...
[06:53:15] <icinga-wm>	 RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[07:04:46] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147934 (10Ladsgroup)
[07:14:22] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147937 (10Ankry) >>! In T161836#3147844, @De728631 wrote: > Update: on the other hand, https://commons.wikimedia.org/wiki/File...
[08:02:15] <icinga-wm>	 PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:16:09] <Josve05a>	 Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/e/ef/Ubra_aviao.JPG".        :/
[08:16:33] <Revent>	 Hey, just throwing this out there, but I recently reset a bunch of ‘stuck’ transcodes… ones that had been supposedly running anywhere from 10-20 hours… most of them should have run in a manner of seconds. Many of them then failed with ‘source not found’ in the error field, and when checked the ‘original file’ gave a 404.
[08:17:33] <Revent>	 It seems however, that after a half-hour or so that at least some of them are re-appearing, so I’m throwing them back in again once the ‘original’ is playable.
[08:19:08] <Revent>	 (and they indeed seem to be actually running, if they go into the ‘priority’ queue…. looks like maybe something related to the occasional bug that Josve05a is nagging about. :P
[08:19:43] <Josve05a>	 I don't nag...to much.... :P
[08:19:50] <Revent>	 Teasing.
[08:20:32] <Revent>	 I kinda messed up about a week ago, and threw ‘way’ too many of the unitialized transcodes on the fire at once.
[08:24:25] <icinga-wm>	 PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:31:15] <icinga-wm>	 RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures
[08:41:35] <icinga-wm>	 PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[08:52:25] <icinga-wm>	 RECOVERY - puppet last run on mc1032 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[08:57:47] <praviin>	 hi
[09:08:25] <icinga-wm>	 PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[09:10:35] <icinga-wm>	 RECOVERY - puppet last run on ms-be1038 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[09:14:25] <wikibugs>	 (03PS1) 10ArielGlenn: update dumpadmin to use the new json reporting routines [dumps] - 10https://gerrit.wikimedia.org/r/345958
[09:36:25] <icinga-wm>	 RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[09:40:50] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] update dumpadmin to use the new json reporting routines [dumps] - 10https://gerrit.wikimedia.org/r/345958 (owner: 10ArielGlenn)
[09:50:40] <wikibugs>	 (03PS1) 10ArielGlenn: convert RunInfoFile to RunInfo for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345961
[09:53:10] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] convert RunInfoFile to RunInfo for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345961 (owner: 10ArielGlenn)
[09:56:35] <wikibugs_>	 (03PS1) 10ArielGlenn: convert "NoticeFile" to "Notice" for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345962
[09:57:56] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] convert "NoticeFile" to "Notice" for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345962 (owner: 10ArielGlenn)
[10:05:06] <wikibugs>	 (03PS2) 10ArielGlenn: move xml content dump jobs out to separate module [dumps] - 10https://gerrit.wikimedia.org/r/343539
[10:15:27] <wikibugs_>	 (03PS3) 10ArielGlenn: move xml content dump jobs out to separate module [dumps] - 10https://gerrit.wikimedia.org/r/343539
[10:16:35] <icinga-wm>	 PROBLEM - parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[10:17:25] <icinga-wm>	 RECOVERY - parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.014 second response time
[10:18:15] <wikibugs>	 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3148017 (10Esc3300)
[10:18:53] <wikibugs_>	 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10Esc3300) https://www.wikidata.org/wiki/Wikidata:Bot_requests#Import_interwikis_from_Doteli_Wikipedia
[10:20:09] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] move xml content dump jobs out to separate module [dumps] - 10https://gerrit.wikimedia.org/r/343539 (owner: 10ArielGlenn)
[10:21:25] <icinga-wm>	 PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:23:31] <wikibugs>	 (03PS2) 10ArielGlenn: move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540
[10:23:51] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540 (owner: 10ArielGlenn)
[10:32:15] <wikibugs>	 (03PS3) 10ArielGlenn: move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540
[10:39:31] <wikibugs_>	 (03PS1) 10Urbanecm: Add rollback user group in fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345964 (https://phabricator.wikimedia.org/T161946)
[10:44:06] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540 (owner: 10ArielGlenn)
[10:46:50] <wikibugs_>	 (03PS1) 10Urbanecm: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919)
[10:48:01] <wikibugs>	 (03PS2) 10ArielGlenn: move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541
[10:48:28] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541 (owner: 10ArielGlenn)
[10:49:25] <icinga-wm>	 RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[10:50:37] <wikibugs>	 (03PS3) 10ArielGlenn: move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541
[10:58:31] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148059 (10Paladox)
[10:58:37] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148063 (10Aklapper)
[11:01:06] <wikibugs>	 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3132721 (10Paladox) Could this be related to T161836 ?
[11:03:09] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541 (owner: 10ArielGlenn)
[11:07:42] <wikibugs>	 (03PS1) 10ArielGlenn: DumpFile class name change; too confusing when we also have DumpFilename [dumps] - 10https://gerrit.wikimedia.org/r/345966
[11:10:05] <Revent>	 mutante: Since you is on duty, poking you…
[11:10:10] <Revent>	 Perhaps you can explain.
[11:10:43] <Revent>	 https://commons.wikimedia.org/wiki/File:Wolno%C5%9B%C4%87_panoramy.webm for example...
[11:11:48] <Revent>	 The 480p transcode had failed, as ‘source file not found’… an hour or two ago, when trying to play the ‘original file’, it gave a 404… now it has reappeared, and the transcode workes.
[11:12:36] <Revent>	 This… (the original disappearing, and causing a transcode to fail, and then reappearing) has happened with about 40 files that I have seen so far.
[11:12:54] <Revent>	 Presumably it has also happened with others that did not have pending transcodes.
[11:13:25] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] DumpFile class name change; too confusing when we also have DumpFilename [dumps] - 10https://gerrit.wikimedia.org/r/345966 (owner: 10ArielGlenn)
[11:14:11] <Revent>	 https://commons.wikimedia.org/wiki/File:X5Flare_AIA193.webm is still broken
[11:14:43] <Revent>	 https://commons.wikimedia.org/wiki/File:Theodor_Kallifatides_fr.webm is another one that was ‘broken’, but has now reappeared.
[11:15:24] <Revent>	 https://commons.wikimedia.org/wiki/File:T%C5%AFn%C4%9B_pod_mal%C3%BDm_vodop%C3%A1dem_Peri%C4%8Dn%C3%ADk_a_koup%C3%A1n%C3%AD.webm is another that is still broken.
[11:16:09] <Revent>	 https://commons.wikimedia.org/wiki/File:%D0%A1%D0%92-%D0%94%D0%9D%D0%A0-564._%D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B3%D0%BE%D0%B4_%D0%B2_%D0%94%D0%BE%D0%BD%D0%B5%D1%86%D0%BA%D0%B5.webm also still broken.
[11:16:31] <Revent>	 https://commons.wikimedia.org/wiki/File:%D0%A1%D0%92-%D0%94%D0%9D%D0%A0-564._%D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B3%D0%BE%D0%B4_%D0%B2_%D0%94%D0%BE%D0%BD%D0%B5%D1%86%D0%BA%D0%B5.webm also broken.
[11:31:09] <paladox>	 Revent hi it's a known issue
[11:31:24] <Revent>	 paladox: Good.
[11:31:25] <paladox>	 oh sorry your talking about webm
[11:31:32] <paladox>	 those havent been reported yet
[11:31:43] <Revent>	 paladox: It’s not specific to those, afaik
[11:31:46] <paladox>	 Revent see https://phabricator.wikimedia.org/T161836
[11:32:18] <Revent>	 paladox: They seem to reappear after a while.
[11:32:27] <paladox>	 Revent the video your reporting as borken is working for me https://commons.wikimedia.org/wiki/File:СВ-ДНР-564._Новый_год_в_Донецке.webm
[11:33:17] <Revent>	 paladox: Don’t just ‘play it’, you’ll see one of the transcodes (per whatever your options are). Hit the ‘original file’ link
[11:33:32] <Revent>	 https://upload.wikimedia.org/wikipedia/commons/9/98/%D0%A1%D0%92-%D0%94%D0%9D%D0%A0-564._%D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B3%D0%BE%D0%B4_%D0%B2_%D0%94%D0%BE%D0%BD%D0%B5%D1%86%D0%BA%D0%B5.webm
[11:34:34] <paladox>	 Revent ah, thats being reported in https://phabricator.wikimedia.org/T161836
[11:34:39] <paladox>	 could you also add it there too
[11:34:48] <Revent>	 paladox: When I woke up earlier, I checked on TMH… there were a bunch of files that had 10, 20 hour transcode times.
[11:34:56] <paladox>	 oh
[11:35:02] <Revent>	 I reset them. and they failed.
[11:35:10] <paladox>	 there was high load last night on two of the video transcoders i think
[11:35:24] <Revent>	 When I investigated, they were all ‘source file missing’ in the db...
[11:35:46] <Revent>	 ‘most’ have since reappeared, and I then reset them to
[11:35:46] <paladox>	 Ah, yeh that sound similar to whats being reported in https://phabricator.wikimedia.org/T161836
[11:36:12] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148077 (10Paladox) Users are reporting webm videos as breaking too when trying to re transcode them.
[11:36:44] <Revent>	 (nods) I think is what interesting is that they seem to (eventually) reappear… afaik without operator intervention.
[11:37:10] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148078 (10Paladox)
[11:37:39] <Revent>	 It seems similar to what people have in the past reported about errors when deleting and then restoring files, that they are ‘not found’, but if you wait a while it works.
[11:37:45] <paladox>	 Revent yep. We have had a ton of duplicate tasks  because it seems files disappear then re appear
[11:38:26] <Revent>	 paladox: I try to bug you guys before actually opening tasks, for that reason.
[11:38:53] <Revent>	 As far as the scaler load...
[11:38:55] <paladox>	 ok thanks :)
[11:38:56] <Revent>	 Umm…
[11:39:13] <Revent>	 I messed up about a week ago, the ‘not priority’ backlog is my fault.
[11:39:14] <paladox>	 Theres a task for the transcodes being high load
[11:39:16] <paladox>	 let me find it
[11:39:33] <Revent>	 In that I reset way too many of the ‘uninitialized’ at once.
[11:39:35] <paladox>	 Revent https://phabricator.wikimedia.org/T161918
[11:40:29] <Revent>	 They have been running, but I’ve been trying to somewhat babysit them, as in watching for any that got stuck or anything.
[11:40:44] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148080 (10Paladox) @Revent reported these urls  https://commons.wikimedia.org/wiki/File:T%C5%AFn%C4%9B_pod_mal%C3%BDm_vo...
[11:41:04] <paladox>	 Revent thanks :)
[11:41:16] <Revent>	 Hence noticing the ’10-20 hour’ tasks, which I strongly suspect got stuck due to the source files going *poof* temporarily.
[11:41:42] <paladox>	 Yep I guess thats one of the reasons for high load
[11:42:03] <Revent>	 https://quarry.wmflabs.org/query/17726
[11:42:05] <Revent>	 FYI
[11:42:25] <Revent>	 Everying above “Circumcision” is some other problem.
[11:42:43] <Revent>	 The ones below that are due to this 404 bug.
[11:43:16] <paladox>	 oh
[11:43:32] <Revent>	 Hey, the list ‘was’ like 90 or so… :P
[11:43:56] <wikibugs>	 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3148083 (10Paladox) The high load may be caused by T161836 as @Revent has notice transcodes showing as 10 - 20 hours. He also found some videos was failing to transcode with err...
[11:43:57] <Revent>	 Like I said, they seem to come back after a while.
[11:44:08] <paladox>	 Yep other users have notice that too
[11:44:34] <paladox>	 Revent it may be related to maint work here https://phabricator.wikimedia.org/T161836#3146271
[11:47:09] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148087 (10Revent) The video 'works' because when you simply play it, you view some transcode based on your preferences....
[11:47:16] <Revent>	 ^ that
[11:48:22] <paladox>	 thanks
[11:48:59] <Revent>	 paladox: Since they reappeared, I was suspected it was simply a transient issue due to some aspect of how the complete file store is split between physical drives, that showed up because of the continuous load.
[11:49:19] <paladox>	 Yep, i think it's something to do with swift.
[11:49:28] <Revent>	 But… I don’t know any details about how the system is set up, so that’s just a guess.
[11:49:31] <paladox>	 per someone saying that on the task
[11:50:23] <paladox>	 Revent i wonder should a notice go on commons for files not showing up?
[11:51:26] <wikibugs>	 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3148090 (10Revent) I reset the transcodes that had been running for egregiously long periods... this included ones running for 10+ hours that should have run in a minute or two....
[11:52:15] <icinga-wm>	 PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[11:52:32] <Revent>	 Umm… there have been a number of comments, over time. I think most people are aware that if you just wait a while, it then works.
[11:55:01] <paladox>	 ok
[11:55:39] <Revent>	 I suspect this happens to lots of files where it is never noticed, since the thumbs still show up.
[11:56:45] <paladox>	 oh
[12:20:15] <icinga-wm>	 RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures
[12:23:05] <icinga-wm>	 PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[12:43:20] <Revent>	 paladox: https://commons.wikimedia.org/wiki/File:Wolno%C5%9B%C4%87_panoramy.webm
[12:43:33] <paladox>	 oh
[12:43:45] <Revent>	 Has ‘reappeared’, as the original file.
[12:44:04] <Revent>	 It was 404 earlier.
[12:44:13] <paladox>	 yep
[12:47:27] <Revent>	 https://commons.wikimedia.org/wiki/File:Theodor_Kallifatides_fr.webm has also recently reappeared.
[12:50:23] <Revent>	 paladox: I see you is poking them. :P
[12:50:45] <paladox>	 Oh yep
[12:52:00] <Revent>	 Any idea what the problem is?
[12:52:05] <icinga-wm>	 RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[12:52:20] <paladox>	 Nope, but most likly something with swift?
[12:52:59] <Revent>	 Something I know nothing at all about. :P
[13:05:21] <wikibugs>	 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#3148177 (10Revent) @ankry Just now, for me at least, that link works. The errors seem to be intermittent.
[13:08:46] <icinga-wm>	 PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 311 bytes in 3.087 second response time
[13:09:05] <icinga-wm>	 PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.120 second response time
[13:10:35] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:11:45] <icinga-wm>	 RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.201 second response time
[13:13:05] <icinga-wm>	 RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.120 second response time
[13:17:15] <icinga-wm>	 PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[13:20:59] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148189 (10Revent) The current list of affected files (at least, of ones with a failed transcode that make them apparent)...
[13:21:16] <Revent>	 paladox: ^ anonther one came back.
[13:21:35] <Revent>	 Wierd-ass issue
[13:22:21] <paladox>	 oh
[13:22:59] <Revent>	 That specific file was 404 for at least 5-6 hours...
[13:23:08] <paladox>	 oh
[13:23:13] <Revent>	 Yeah.
[13:26:46] <Revent>	 At least the ‘queued’ backlog is going down at a noticable rate again.
[13:28:26] <paladox>	 yep
[13:29:09] <Revent>	 And, yeah, shoving 7k on it at once was a stupid move on my part, I admit that.
[13:29:46] <Revent>	 I was expecting that most would be short.
[13:38:35] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[13:40:15] <Revent>	 paladox: https://wikitech.wikimedia.org/wiki/Media_storage <- OMFG, have fun with that. :P
[13:40:39] <paladox>	 lol
[13:44:20] <wikibugs>	 (03PS1) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972
[13:44:44] <wikibugs_>	 (03CR) 10jerkins-bot: [V: 04-1] get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 (owner: 10ArielGlenn)
[13:46:15] <icinga-wm>	 RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures
[13:50:27] <wikibugs>	 (03PS2) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972
[13:59:25] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[13:59:45] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[14:01:16] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[14:01:36] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:16:58] <wikibugs_>	 (03PS1) 10Rush: tools: persistent iowait issues [puppet] - 10https://gerrit.wikimedia.org/r/345975
[14:23:53] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 031] tools: persistent iowait issues [puppet] - 10https://gerrit.wikimedia.org/r/345975 (owner: 10Rush)
[14:25:09] <wikibugs_>	 (03CR) 10Rush: [C: 032] tools: persistent iowait issues [puppet] - 10https://gerrit.wikimedia.org/r/345975 (owner: 10Rush)
[15:09:08] <wikibugs_>	 (03PS3) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972
[15:17:28] <wikibugs_>	 (03PS4) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972
[15:34:05] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 (owner: 10ArielGlenn)
[15:45:48] <elukey>	 Revent: thanks for T161918
[15:45:48] <stashbot>	 T161918: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918
[15:46:48] <elukey>	 load on the videoscalers seems good from early this morning EU time, good :)
[15:49:48] <elukey>	 one thing that we should work on is a reliable setting for all the videoscalers to avoid hhvm choking after saturating all its threads
[15:54:18] <wikibugs_>	 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3148299 (10Reedy)
[16:03:37] <wikibugs_>	 (03PS1) 10ArielGlenn: clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977
[16:05:05] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 25 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[16:10:15] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4741.40 Read Requests/Sec=3191.20 Write Requests/Sec=279.10 KBytes Read/Sec=30852.40 KBytes_Written/Sec=3872.00
[16:21:48] <wikibugs>	 (03PS2) 10ArielGlenn: clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977
[16:23:40] <wikibugs_>	 (03PS3) 10ArielGlenn: clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977
[16:24:15] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=118.40 Read Requests/Sec=90.90 Write Requests/Sec=0.70 KBytes Read/Sec=706.50 KBytes_Written/Sec=42.00
[16:29:25] <icinga-wm>	 PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[16:36:41] <wikibugs_>	 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3148359 (10mmodell) Relatedly: when an @wikimedia.org email account gets disabled, phabricator mail queues get backed up with attempts to deliver notifications.  Example:  it appea...
[16:41:42] <wikibugs_>	 (03CR) 10ArielGlenn: [C: 032] clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977 (owner: 10ArielGlenn)
[16:45:52] <wikibugs_>	 (03PS1) 10ArielGlenn: clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980
[16:46:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980 (owner: 10ArielGlenn)
[16:47:59] <wikibugs>	 (03PS2) 10ArielGlenn: clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980
[16:52:04] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980 (owner: 10ArielGlenn)
[16:57:25] <icinga-wm>	 RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[17:07:25] <icinga-wm>	 PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:18:25] <icinga-wm>	 PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:33:15] <icinga-wm>	 PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[17:36:25] <icinga-wm>	 RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures
[17:42:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[17:46:25] <icinga-wm>	 RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures
[17:47:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[17:52:32] <wikibugs_>	 (03PS2) 10ArielGlenn: convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542
[17:52:49] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 (owner: 10ArielGlenn)
[17:55:29] <wikibugs_>	 (03PS3) 10ArielGlenn: convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542
[18:01:15] <icinga-wm>	 RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures
[18:29:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[18:33:45] <icinga-wm>	 PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:34:35] <icinga-wm>	 PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[18:34:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[18:37:25] <icinga-wm>	 RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient
[18:37:36] <icinga-wm>	 RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[18:59:15] <icinga-wm>	 PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:01:33] <elukey>	 !log restart hhvm on mw1191 (dump debug in /tmp/hhvm.16619.bt.) - threads stuck in HPHP::Treadmill::getAgeOldestRequest
[19:01:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:03:45] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.033 second response time
[19:03:45] <icinga-wm>	 RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 71701 bytes in 0.175 second response time
[19:03:45] <icinga-wm>	 RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.166 second response time
[19:09:55] <wikibugs>	 (03PS1) 10Dereckson: Enable NewUserMessage on tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345982 (https://phabricator.wikimedia.org/T161962)
[19:14:14] <wikibugs_>	 (03PS1) 10Dereckson: Enable AbuseFilter blocks on tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345983 (https://phabricator.wikimedia.org/T161960)
[19:27:15] <icinga-wm>	 RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures
[19:37:55] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:41:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:46:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[19:48:32] <wikibugs>	 (03PS4) 10ArielGlenn: convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 (https://phabricator.wikimedia.org/T160507)
[19:49:32] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 (https://phabricator.wikimedia.org/T160507) (owner: 10ArielGlenn)
[19:53:49] <wikibugs>	 (03PS1) 10ArielGlenn: retry failed page content pieces immediately after page content step completes [dumps] - 10https://gerrit.wikimedia.org/r/345985 (https://phabricator.wikimedia.org/T160507)
[20:03:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:04:45] <icinga-wm>	 PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:05:55] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures
[20:13:51] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:32:45] <icinga-wm>	 RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures
[20:33:15] <icinga-wm>	 PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[20:35:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[20:40:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:02:15] <icinga-wm>	 RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures
[21:35:05] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[21:44:06] <paladox>	 RainbowSprinkles hi, the index on gerrit seems inconsitent as my patch on https://gerrit.wikimedia.org/r/#/q/status:open shows merge conflict but i just rebased it and + it dosent show merge conflict when viewing the patch.
[21:49:23] <wikibugs_>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit Index inconsistent - https://phabricator.wikimedia.org/T161966#3148604 (10Paladox)
[21:50:27] <wikibugs_>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit Index inconsistent - https://phabricator.wikimedia.org/T161966#3148617 (10Paladox)
[21:50:35] <icinga-wm>	 PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:52:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[21:54:42] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit Index inconsistent - https://phabricator.wikimedia.org/T161966#3148622 (10Paladox) 05Open>03declined It seems gerrit has supper powers and managed to figure out a merge conflict just as a change was being merged.
[21:57:05] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:02:05] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:09:05] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:14:05] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:18:26] <icinga-wm>	 PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:18:45] <icinga-wm>	 RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures
[22:22:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[22:31:05] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:36:05] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map
[22:47:25] <icinga-wm>	 RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[23:01:26] <wikibugs>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148639 (10GermanJoe) Welcome back: https://commons.wikimedia.org/wiki/File:Yaroslava_Shvedova.JPG
[23:39:45] <icinga-wm>	 PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[23:44:45] <icinga-wm>	 RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map
[23:54:15] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=766.40 Read Requests/Sec=5490.70 Write Requests/Sec=13.80 KBytes Read/Sec=22134.80 KBytes_Written/Sec=229.20
[23:54:49] <wikibugs_>	 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148644 (10Revent) Of the ones I listed before (videos), these are now back... https://commons.wikimedia.org/wiki/File:X5...