[00:25:35] PROBLEM - configured eth on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:36] PROBLEM - SSH on ms-be1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:25:45] PROBLEM - MD RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:25:45] PROBLEM - puppet last run on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:27:25] PROBLEM - very high load average likely xfs on ms-be1016 is CRITICAL: CRITICAL - load average: 184.39, 110.14, 56.82 [00:32:05] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:42:25] PROBLEM - puppet last run on mc1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:10:25] RECOVERY - puppet last run on mc1001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [01:23:00] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10Pokefan95) 404 error: https://commons.wikimedia.org/wiki/File:School_Gyrls_at_Paramount_Studios.jpg [01:27:00] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10GermanJoe) Possibly another related case reported by a different editor: https://commons.wikimedia.org/wiki/File:Yar... [01:28:07] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147812 (10Pokefan95) p:05High>03Unbreak! Setting back to UBN!, due to the number of duplicate tasks and that there may be... [01:29:45] PROBLEM - NTP on ms-be1016 is CRITICAL: NTP CRITICAL: No response from NTP server [01:29:55] PROBLEM - Postgres Replication Lag on maps1003 is CRITICAL: CRITICAL - Rep Delay is: 81605.599124 Seconds [01:30:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 81607.166043 Seconds [01:30:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 81637.348128 Seconds [01:32:15] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 82364.681359 Seconds [01:32:15] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 82364.686004 Seconds [01:32:15] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 82364.686071 Seconds [01:34:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:35:55] PROBLEM - swift-container-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:05] PROBLEM - swift-object-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:05] PROBLEM - swift-account-reaper on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:05] PROBLEM - dhclient process on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:06] PROBLEM - swift-object-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:06] PROBLEM - Check size of conntrack table on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:06] PROBLEM - DPKG on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:15] PROBLEM - swift-account-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:16] PROBLEM - swift-container-updater on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:25] PROBLEM - swift-container-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:26] PROBLEM - swift-object-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:26] PROBLEM - swift-container-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:35] PROBLEM - swift-object-replicator on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:35] PROBLEM - swift-account-auditor on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:35] PROBLEM - swift-account-server on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:36:45] PROBLEM - salt-minion processes on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:37:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82027.368571 Seconds [01:40:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 0.0 Seconds [01:41:35] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 0.0 Seconds [01:43:05] PROBLEM - Postgres Replication Lag on maps1004 is CRITICAL: CRITICAL - Rep Delay is: 82387.120761 Seconds [01:44:35] PROBLEM - Postgres Replication Lag on maps1002 is CRITICAL: CRITICAL - Rep Delay is: 82477.398481 Seconds [01:49:15] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 0.0 Seconds [01:49:15] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 0.0 Seconds [01:49:15] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 0.0 Seconds [01:52:15] PROBLEM - Postgres Replication Lag on maps2003 is CRITICAL: CRITICAL - Rep Delay is: 83564.601739 Seconds [01:52:15] PROBLEM - Postgres Replication Lag on maps2004 is CRITICAL: CRITICAL - Rep Delay is: 83564.602942 Seconds [01:52:15] PROBLEM - Postgres Replication Lag on maps2002 is CRITICAL: CRITICAL - Rep Delay is: 83564.602264 Seconds [01:55:15] RECOVERY - Postgres Replication Lag on maps2004 is OK: OK - Rep Delay is: 38.869574 Seconds [01:55:15] RECOVERY - Postgres Replication Lag on maps2002 is OK: OK - Rep Delay is: 38.871256 Seconds [01:55:15] RECOVERY - Postgres Replication Lag on maps2003 is OK: OK - Rep Delay is: 38.875634 Seconds [01:55:35] RECOVERY - Postgres Replication Lag on maps1002 is OK: OK - Rep Delay is: 13.950231 Seconds [01:55:55] RECOVERY - Postgres Replication Lag on maps1003 is OK: OK - Rep Delay is: 41.998827 Seconds [01:56:05] RECOVERY - Postgres Replication Lag on maps1004 is OK: OK - Rep Delay is: 43.714312 Seconds [01:59:38] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10De728631) This seems to be a gradual process. When I [[ https://commons.wikimedia.org/w/index.php?title=Commons:Admi... [02:03:26] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:05:15] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.002 second response time [02:07:35] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:32:11] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.18) (duration: 13m 27s) [02:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:35:35] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [02:37:31] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Apr 1 02:37:30 UTC 2017 (duration 5m 20s) [02:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:15] PROBLEM - puppet last run on puppetmaster2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:26] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:06:26] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:07:16] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.005 second response time [03:08:15] RECOVERY - puppet last run on puppetmaster2002 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [03:08:16] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.003 second response time [03:30:42] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147879 (10zhuyifei1999) Also https://commons.wikimedia.org/wiki/Commons:Village_pump#Something_funny_with_this_image https://c... [03:59:25] PROBLEM - puppet last run on elastic1050 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:27:25] RECOVERY - puppet last run on elastic1050 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [04:46:42] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3144779 (10MaxBioHazard) Also https://commons.wikimedia.org/wiki/File:Vladimir_Frolochkin.JPG [04:52:25] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-3/3/3: down - Peering: Equinix Ashburn Exchange {#2648} [10Gbps]BR [04:57:35] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 85 probes of 429 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [04:58:06] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 47 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:02:35] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 429 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [05:03:06] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 16 probes of 273 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [05:08:35] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [06:24:15] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:25:50] (03PS1) 10ArielGlenn: disable full dumps cron job for a bit [puppet] - 10https://gerrit.wikimedia.org/r/345956 [06:29:01] (03CR) 10ArielGlenn: [C: 032] disable full dumps cron job for a bit [puppet] - 10https://gerrit.wikimedia.org/r/345956 (owner: 10ArielGlenn) [06:30:15] sigh now mw1259 alarms more than mw1169.. [06:34:31] 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3147929 (10elukey) >>! In T161918#3147750, @brion wrote: > Note there are probably a lot of jobs in the non-prioritized queue still backed up last I looked; Note I also have a f... [06:53:15] RECOVERY - puppet last run on netmon1001 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [07:04:46] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147934 (10Ladsgroup) [07:14:22] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3147937 (10Ankry) >>! In T161836#3147844, @De728631 wrote: > Update: on the other hand, https://commons.wikimedia.org/wiki/File... [08:02:15] PROBLEM - puppet last run on analytics1037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:16:09] Error deleting file: Could not delete file "mwstore://local-swift-eqiad/local-public/e/ef/Ubra_aviao.JPG". :/ [08:16:33] Hey, just throwing this out there, but I recently reset a bunch of ‘stuck’ transcodes… ones that had been supposedly running anywhere from 10-20 hours… most of them should have run in a manner of seconds. Many of them then failed with ‘source not found’ in the error field, and when checked the ‘original file’ gave a 404. [08:17:33] It seems however, that after a half-hour or so that at least some of them are re-appearing, so I’m throwing them back in again once the ‘original’ is playable. [08:19:08] (and they indeed seem to be actually running, if they go into the ‘priority’ queue…. looks like maybe something related to the occasional bug that Josve05a is nagging about. :P [08:19:43] I don't nag...to much.... :P [08:19:50] Teasing. [08:20:32] I kinda messed up about a week ago, and threw ‘way’ too many of the unitialized transcodes on the fire at once. [08:24:25] PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:31:15] RECOVERY - puppet last run on analytics1037 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [08:41:35] PROBLEM - puppet last run on ms-be1038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:52:25] RECOVERY - puppet last run on mc1032 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [08:57:47] hi [09:08:25] PROBLEM - puppet last run on elastic1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:10:35] RECOVERY - puppet last run on ms-be1038 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:14:25] (03PS1) 10ArielGlenn: update dumpadmin to use the new json reporting routines [dumps] - 10https://gerrit.wikimedia.org/r/345958 [09:36:25] RECOVERY - puppet last run on elastic1030 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [09:40:50] (03CR) 10ArielGlenn: [C: 032] update dumpadmin to use the new json reporting routines [dumps] - 10https://gerrit.wikimedia.org/r/345958 (owner: 10ArielGlenn) [09:50:40] (03PS1) 10ArielGlenn: convert RunInfoFile to RunInfo for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345961 [09:53:10] (03CR) 10ArielGlenn: [C: 032] convert RunInfoFile to RunInfo for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345961 (owner: 10ArielGlenn) [09:56:35] (03PS1) 10ArielGlenn: convert "NoticeFile" to "Notice" for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345962 [09:57:56] (03CR) 10ArielGlenn: [C: 032] convert "NoticeFile" to "Notice" for class, methods, attrs [dumps] - 10https://gerrit.wikimedia.org/r/345962 (owner: 10ArielGlenn) [10:05:06] (03PS2) 10ArielGlenn: move xml content dump jobs out to separate module [dumps] - 10https://gerrit.wikimedia.org/r/343539 [10:15:27] (03PS3) 10ArielGlenn: move xml content dump jobs out to separate module [dumps] - 10https://gerrit.wikimedia.org/r/343539 [10:16:35] PROBLEM - parsoid on wtp1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:17:25] RECOVERY - parsoid on wtp1015 is OK: HTTP OK: HTTP/1.1 200 OK - 1014 bytes in 0.014 second response time [10:18:15] 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3148017 (10Esc3300) [10:18:53] 06Operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 13Patch-For-Review: Create Wikipedia Doteli - https://phabricator.wikimedia.org/T161529#3134452 (10Esc3300) https://www.wikidata.org/wiki/Wikidata:Bot_requests#Import_interwikis_from_Doteli_Wikipedia [10:20:09] (03CR) 10ArielGlenn: [C: 032] move xml content dump jobs out to separate module [dumps] - 10https://gerrit.wikimedia.org/r/343539 (owner: 10ArielGlenn) [10:21:25] PROBLEM - puppet last run on cp1062 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:23:31] (03PS2) 10ArielGlenn: move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540 [10:23:51] (03CR) 10jerkins-bot: [V: 04-1] move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540 (owner: 10ArielGlenn) [10:32:15] (03PS3) 10ArielGlenn: move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540 [10:39:31] (03PS1) 10Urbanecm: Add rollback user group in fawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345964 (https://phabricator.wikimedia.org/T161946) [10:44:06] (03CR) 10ArielGlenn: [C: 032] move prefetch-finding code out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343540 (owner: 10ArielGlenn) [10:46:50] (03PS1) 10Urbanecm: Add autopatrolled group to svwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345965 (https://phabricator.wikimedia.org/T161919) [10:48:01] (03PS2) 10ArielGlenn: move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541 [10:48:28] (03CR) 10jerkins-bot: [V: 04-1] move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541 (owner: 10ArielGlenn) [10:49:25] RECOVERY - puppet last run on cp1062 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [10:50:37] (03PS3) 10ArielGlenn: move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541 [10:58:31] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148059 (10Paladox) [10:58:37] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148063 (10Aklapper) [11:01:06] 06Operations, 06Commons, 10media-storage: Commons File:Assemblea_Costituente_1946_(2).svg missing after file move - https://phabricator.wikimedia.org/T161476#3132721 (10Paladox) Could this be related to T161836 ? [11:03:09] (03CR) 10ArielGlenn: [C: 032] move stub-related code in page content dumps out to separate class [dumps] - 10https://gerrit.wikimedia.org/r/343541 (owner: 10ArielGlenn) [11:07:42] (03PS1) 10ArielGlenn: DumpFile class name change; too confusing when we also have DumpFilename [dumps] - 10https://gerrit.wikimedia.org/r/345966 [11:10:05] mutante: Since you is on duty, poking you… [11:10:10] Perhaps you can explain. [11:10:43] https://commons.wikimedia.org/wiki/File:Wolno%C5%9B%C4%87_panoramy.webm for example... [11:11:48] The 480p transcode had failed, as ‘source file not found’… an hour or two ago, when trying to play the ‘original file’, it gave a 404… now it has reappeared, and the transcode workes. [11:12:36] This… (the original disappearing, and causing a transcode to fail, and then reappearing) has happened with about 40 files that I have seen so far. [11:12:54] Presumably it has also happened with others that did not have pending transcodes. [11:13:25] (03CR) 10ArielGlenn: [C: 032] DumpFile class name change; too confusing when we also have DumpFilename [dumps] - 10https://gerrit.wikimedia.org/r/345966 (owner: 10ArielGlenn) [11:14:11] https://commons.wikimedia.org/wiki/File:X5Flare_AIA193.webm is still broken [11:14:43] https://commons.wikimedia.org/wiki/File:Theodor_Kallifatides_fr.webm is another one that was ‘broken’, but has now reappeared. [11:15:24] https://commons.wikimedia.org/wiki/File:T%C5%AFn%C4%9B_pod_mal%C3%BDm_vodop%C3%A1dem_Peri%C4%8Dn%C3%ADk_a_koup%C3%A1n%C3%AD.webm is another that is still broken. [11:16:09] https://commons.wikimedia.org/wiki/File:%D0%A1%D0%92-%D0%94%D0%9D%D0%A0-564._%D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B3%D0%BE%D0%B4_%D0%B2_%D0%94%D0%BE%D0%BD%D0%B5%D1%86%D0%BA%D0%B5.webm also still broken. [11:16:31] https://commons.wikimedia.org/wiki/File:%D0%A1%D0%92-%D0%94%D0%9D%D0%A0-564._%D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B3%D0%BE%D0%B4_%D0%B2_%D0%94%D0%BE%D0%BD%D0%B5%D1%86%D0%BA%D0%B5.webm also broken. [11:31:09] Revent hi it's a known issue [11:31:24] paladox: Good. [11:31:25] oh sorry your talking about webm [11:31:32] those havent been reported yet [11:31:43] paladox: It’s not specific to those, afaik [11:31:46] Revent see https://phabricator.wikimedia.org/T161836 [11:32:18] paladox: They seem to reappear after a while. [11:32:27] Revent the video your reporting as borken is working for me https://commons.wikimedia.org/wiki/File:СВ-ДНР-564._Новый_год_в_Донецке.webm [11:33:17] paladox: Don’t just ‘play it’, you’ll see one of the transcodes (per whatever your options are). Hit the ‘original file’ link [11:33:32] https://upload.wikimedia.org/wikipedia/commons/9/98/%D0%A1%D0%92-%D0%94%D0%9D%D0%A0-564._%D0%9D%D0%BE%D0%B2%D1%8B%D0%B9_%D0%B3%D0%BE%D0%B4_%D0%B2_%D0%94%D0%BE%D0%BD%D0%B5%D1%86%D0%BA%D0%B5.webm [11:34:34] Revent ah, thats being reported in https://phabricator.wikimedia.org/T161836 [11:34:39] could you also add it there too [11:34:48] paladox: When I woke up earlier, I checked on TMH… there were a bunch of files that had 10, 20 hour transcode times. [11:34:56] oh [11:35:02] I reset them. and they failed. [11:35:10] there was high load last night on two of the video transcoders i think [11:35:24] When I investigated, they were all ‘source file missing’ in the db... [11:35:46] ‘most’ have since reappeared, and I then reset them to [11:35:46] Ah, yeh that sound similar to whats being reported in https://phabricator.wikimedia.org/T161836 [11:36:12] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148077 (10Paladox) Users are reporting webm videos as breaking too when trying to re transcode them. [11:36:44] (nods) I think is what interesting is that they seem to (eventually) reappear… afaik without operator intervention. [11:37:10] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148078 (10Paladox) [11:37:39] It seems similar to what people have in the past reported about errors when deleting and then restoring files, that they are ‘not found’, but if you wait a while it works. [11:37:45] Revent yep. We have had a ton of duplicate tasks because it seems files disappear then re appear [11:38:26] paladox: I try to bug you guys before actually opening tasks, for that reason. [11:38:53] As far as the scaler load... [11:38:55] ok thanks :) [11:38:56] Umm… [11:39:13] I messed up about a week ago, the ‘not priority’ backlog is my fault. [11:39:14] Theres a task for the transcodes being high load [11:39:16] let me find it [11:39:33] In that I reset way too many of the ‘uninitialized’ at once. [11:39:35] Revent https://phabricator.wikimedia.org/T161918 [11:40:29] They have been running, but I’ve been trying to somewhat babysit them, as in watching for any that got stuck or anything. [11:40:44] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148080 (10Paladox) @Revent reported these urls https://commons.wikimedia.org/wiki/File:T%C5%AFn%C4%9B_pod_mal%C3%BDm_vo... [11:41:04] Revent thanks :) [11:41:16] Hence noticing the ’10-20 hour’ tasks, which I strongly suspect got stuck due to the source files going *poof* temporarily. [11:41:42] Yep I guess thats one of the reasons for high load [11:42:03] https://quarry.wmflabs.org/query/17726 [11:42:05] FYI [11:42:25] Everying above “Circumcision” is some other problem. [11:42:43] The ones below that are due to this 404 bug. [11:43:16] oh [11:43:32] Hey, the list ‘was’ like 90 or so… :P [11:43:56] 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3148083 (10Paladox) The high load may be caused by T161836 as @Revent has notice transcodes showing as 10 - 20 hours. He also found some videos was failing to transcode with err... [11:43:57] Like I said, they seem to come back after a while. [11:44:08] Yep other users have notice that too [11:44:34] Revent it may be related to maint work here https://phabricator.wikimedia.org/T161836#3146271 [11:47:09] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148087 (10Revent) The video 'works' because when you simply play it, you view some transcode based on your preferences.... [11:47:16] ^ that [11:48:22] thanks [11:48:59] paladox: Since they reappeared, I was suspected it was simply a transient issue due to some aspect of how the complete file store is split between physical drives, that showed up because of the continuous load. [11:49:19] Yep, i think it's something to do with swift. [11:49:28] But… I don’t know any details about how the system is set up, so that’s just a guess. [11:49:31] per someone saying that on the task [11:50:23] Revent i wonder should a notice go on commons for files not showing up? [11:51:26] 06Operations, 13Patch-For-Review: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3148090 (10Revent) I reset the transcodes that had been running for egregiously long periods... this included ones running for 10+ hours that should have run in a minute or two.... [11:52:15] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:52:32] Umm… there have been a number of comments, over time. I think most people are aware that if you just wait a while, it then works. [11:55:01] ok [11:55:39] I suspect this happens to lots of files where it is never noticed, since the thumbs still show up. [11:56:45] oh [12:20:15] RECOVERY - puppet last run on etcd1001 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [12:23:05] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:43:20] paladox: https://commons.wikimedia.org/wiki/File:Wolno%C5%9B%C4%87_panoramy.webm [12:43:33] oh [12:43:45] Has ‘reappeared’, as the original file. [12:44:04] It was 404 earlier. [12:44:13] yep [12:47:27] https://commons.wikimedia.org/wiki/File:Theodor_Kallifatides_fr.webm has also recently reappeared. [12:50:23] paladox: I see you is poking them. :P [12:50:45] Oh yep [12:52:00] Any idea what the problem is? [12:52:05] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [12:52:20] Nope, but most likly something with swift? [12:52:59] Something I know nothing at all about. :P [13:05:21] 06Operations, 06Commons, 06Multimedia, 10media-storage: Storage backend errors on commons when deleting/restoring pages - https://phabricator.wikimedia.org/T141704#3148177 (10Revent) @ankry Just now, for me at least, that link works. The errors seem to be intermittent. [13:08:46] PROBLEM - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/etcd/k8s - 311 bytes in 3.087 second response time [13:09:05] PROBLEM - All k8s worker nodes are healthy on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/k8s/nodes/ready - 185 bytes in 0.120 second response time [13:10:35] PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:11:45] RECOVERY - All k8s etcd nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 1.201 second response time [13:13:05] RECOVERY - All k8s worker nodes are healthy on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.120 second response time [13:17:15] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:20:59] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148189 (10Revent) The current list of affected files (at least, of ones with a failed transcode that make them apparent)... [13:21:16] paladox: ^ anonther one came back. [13:21:35] Wierd-ass issue [13:22:21] oh [13:22:59] That specific file was 404 for at least 5-6 hours... [13:23:08] oh [13:23:13] Yeah. [13:26:46] At least the ‘queued’ backlog is going down at a noticable rate again. [13:28:26] yep [13:29:09] And, yeah, shoving 7k on it at once was a stupid move on my part, I admit that. [13:29:46] I was expecting that most would be short. [13:38:35] RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [13:40:15] paladox: https://wikitech.wikimedia.org/wiki/Media_storage <- OMFG, have fun with that. :P [13:40:39] lol [13:44:20] (03PS1) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 [13:44:44] (03CR) 10jerkins-bot: [V: 04-1] get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 (owner: 10ArielGlenn) [13:46:15] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [13:50:27] (03PS2) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 [13:59:25] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:59:45] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:01:16] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [14:01:36] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:16:58] (03PS1) 10Rush: tools: persistent iowait issues [puppet] - 10https://gerrit.wikimedia.org/r/345975 [14:23:53] (03CR) 10Andrew Bogott: [C: 031] tools: persistent iowait issues [puppet] - 10https://gerrit.wikimedia.org/r/345975 (owner: 10Rush) [14:25:09] (03CR) 10Rush: [C: 032] tools: persistent iowait issues [puppet] - 10https://gerrit.wikimedia.org/r/345975 (owner: 10Rush) [15:09:08] (03PS3) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 [15:17:28] (03PS4) 10ArielGlenn: get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 [15:34:05] (03CR) 10ArielGlenn: [C: 032] get rid of the file_obj, fobj, finfo, fileobj var names [dumps] - 10https://gerrit.wikimedia.org/r/345972 (owner: 10ArielGlenn) [15:45:48] Revent: thanks for T161918 [15:45:48] T161918: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918 [15:46:48] load on the videoscalers seems good from early this morning EU time, good :) [15:49:48] one thing that we should work on is a reliable setting for all the videoscalers to avoid hhvm choking after saturating all its threads [15:54:18] 06Operations: videoscalers (mw1168, mw1169) - high load / overheating - https://phabricator.wikimedia.org/T161918#3148299 (10Reedy) [16:03:37] (03PS1) 10ArielGlenn: clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977 [16:05:05] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 25 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [16:10:15] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=4741.40 Read Requests/Sec=3191.20 Write Requests/Sec=279.10 KBytes Read/Sec=30852.40 KBytes_Written/Sec=3872.00 [16:21:48] (03PS2) 10ArielGlenn: clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977 [16:23:40] (03PS3) 10ArielGlenn: clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977 [16:24:15] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=118.40 Read Requests/Sec=90.90 Write Requests/Sec=0.70 KBytes Read/Sec=706.50 KBytes_Written/Sec=42.00 [16:29:25] PROBLEM - puppet last run on wtp1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:36:41] 06Operations, 06Office-IT, 07LDAP: Remove disabled users from internal mailing lists - https://phabricator.wikimedia.org/T161004#3148359 (10mmodell) Relatedly: when an @wikimedia.org email account gets disabled, phabricator mail queues get backed up with attempts to deliver notifications. Example: it appea... [16:41:42] (03CR) 10ArielGlenn: [C: 032] clean up use of fdesc, fhandle varnames [dumps] - 10https://gerrit.wikimedia.org/r/345977 (owner: 10ArielGlenn) [16:45:52] (03PS1) 10ArielGlenn: clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980 [16:46:16] (03CR) 10jerkins-bot: [V: 04-1] clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980 (owner: 10ArielGlenn) [16:47:59] (03PS2) 10ArielGlenn: clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980 [16:52:04] (03CR) 10ArielGlenn: [C: 032] clean up some unused escaped command strings [dumps] - 10https://gerrit.wikimedia.org/r/345980 (owner: 10ArielGlenn) [16:57:25] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:07:25] PROBLEM - puppet last run on labvirt1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:18:25] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:33:15] PROBLEM - puppet last run on rutherfordium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:36:25] RECOVERY - puppet last run on labvirt1006 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [17:42:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:46:25] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [17:47:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [17:52:32] (03PS2) 10ArielGlenn: convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 [17:52:49] (03CR) 10jerkins-bot: [V: 04-1] convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 (owner: 10ArielGlenn) [17:55:29] (03PS3) 10ArielGlenn: convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 [18:01:15] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [18:29:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:33:45] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:34:35] PROBLEM - dhclient process on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:34:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [18:37:25] RECOVERY - dhclient process on thumbor1001 is OK: PROCS OK: 0 processes with command name dhclient [18:37:36] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:59:15] PROBLEM - puppet last run on rdb1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:01:33] !log restart hhvm on mw1191 (dump debug in /tmp/hhvm.16619.bt.) - threads stuck in HPHP::Treadmill::getAgeOldestRequest [19:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:45] RECOVERY - Nginx local proxy to apache on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.033 second response time [19:03:45] RECOVERY - HHVM rendering on mw1191 is OK: HTTP OK: HTTP/1.1 200 OK - 71701 bytes in 0.175 second response time [19:03:45] RECOVERY - Apache HTTP on mw1191 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 614 bytes in 0.166 second response time [19:09:55] (03PS1) 10Dereckson: Enable NewUserMessage on tr.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345982 (https://phabricator.wikimedia.org/T161962) [19:14:14] (03PS1) 10Dereckson: Enable AbuseFilter blocks on tr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/345983 (https://phabricator.wikimedia.org/T161960) [19:27:15] RECOVERY - puppet last run on rdb1005 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [19:37:55] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:41:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:46:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [19:48:32] (03PS4) 10ArielGlenn: convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 (https://phabricator.wikimedia.org/T160507) [19:49:32] (03CR) 10ArielGlenn: [C: 032] convert to using page ranges for all checkpoint stuff and no checkpoints whatsoever, for checkpoint-file enabled wikis [dumps] - 10https://gerrit.wikimedia.org/r/343542 (https://phabricator.wikimedia.org/T160507) (owner: 10ArielGlenn) [19:53:49] (03PS1) 10ArielGlenn: retry failed page content pieces immediately after page content step completes [dumps] - 10https://gerrit.wikimedia.org/r/345985 (https://phabricator.wikimedia.org/T160507) [20:03:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:04:45] PROBLEM - puppet last run on phab2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:05:55] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:13:51] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:32:45] RECOVERY - puppet last run on phab2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:33:15] PROBLEM - puppet last run on ms-be1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [20:35:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [20:40:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:02:15] RECOVERY - puppet last run on ms-be1031 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [21:35:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [21:44:06] RainbowSprinkles hi, the index on gerrit seems inconsitent as my patch on https://gerrit.wikimedia.org/r/#/q/status:open shows merge conflict but i just rebased it and + it dosent show merge conflict when viewing the patch. [21:49:23] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit Index inconsistent - https://phabricator.wikimedia.org/T161966#3148604 (10Paladox) [21:50:27] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit Index inconsistent - https://phabricator.wikimedia.org/T161966#3148617 (10Paladox) [21:50:35] PROBLEM - puppet last run on analytics1056 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:52:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 21 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [21:54:42] 06Operations, 10Gerrit, 06Release-Engineering-Team: Gerrit Index inconsistent - https://phabricator.wikimedia.org/T161966#3148622 (10Paladox) 05Open>03declined It seems gerrit has supper powers and managed to figure out a merge conflict just as a change was being merged. [21:57:05] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:02:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:09:05] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 20 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:14:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 19 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:18:26] PROBLEM - puppet last run on labvirt1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:18:45] RECOVERY - puppet last run on analytics1056 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [22:22:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [22:31:05] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 21 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:36:05] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 18 probes of 272 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:47:25] RECOVERY - puppet last run on labvirt1009 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [23:01:26] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148639 (10GermanJoe) Welcome back: https://commons.wikimedia.org/wiki/File:Yaroslava_Shvedova.JPG [23:39:45] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 20 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:44:45] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 19 probes of 277 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [23:54:15] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=766.40 Read Requests/Sec=5490.70 Write Requests/Sec=13.80 KBytes Read/Sec=22134.80 KBytes_Written/Sec=229.20 [23:54:49] 06Operations, 06Multimedia, 10media-storage: 404 error while accessing some images files (e.g. djvu, jpg, png, webm) on Commons and other sites - https://phabricator.wikimedia.org/T161836#3148644 (10Revent) Of the ones I listed before (videos), these are now back... https://commons.wikimedia.org/wiki/File:X5...