[02:24:49] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 08m 46s) [02:24:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:59:41] PROBLEM - Apache HTTP on mw1193 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.092 second response time [03:00:41] RECOVERY - Apache HTTP on mw1193 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 612 bytes in 0.130 second response time [03:33:21] PROBLEM - puppet last run on analytics1042 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.test],File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [04:00:21] RECOVERY - puppet last run on analytics1042 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:08:31] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=295.80 Read Requests/Sec=2797.20 Write Requests/Sec=11.30 KBytes Read/Sec=29702.00 KBytes_Written/Sec=108.00 [04:16:31] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=2.50 Read Requests/Sec=0.00 Write Requests/Sec=0.40 KBytes Read/Sec=0.00 KBytes_Written/Sec=5.20 [06:01:51] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:02:01] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:04:52] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [06:04:52] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [06:07:51] PROBLEM - haproxy failover on dbproxy1009 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:07:51] PROBLEM - haproxy failover on dbproxy1004 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [06:21:51] RECOVERY - haproxy failover on dbproxy1009 is OK: OK check_failover servers up 2 down 0 [06:21:51] RECOVERY - haproxy failover on dbproxy1004 is OK: OK check_failover servers up 2 down 0 [08:38:51] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:39:52] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:51] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 3.152 second response time [08:42:41] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [09:37:41] RECOVERY - MariaDB Slave Lag: s2 on db2049 is OK: OK slave_sql_lag Replication lag: 0.15 seconds [09:48:38] 06Operations: Puppet: test non stringified facts across the fleet - https://phabricator.wikimedia.org/T166372#3296715 (10Volans) All but two diffs are related to `$::processorcount`: ``` californium.wikimedia.org - "no_workers": "8", + "no_workers": 8, ...SNIP... - "processes": "8... [09:48:50] (03PS1) 10Volans: Monitoring: remove spaces from list of interfaces [puppet] - 10https://gerrit.wikimedia.org/r/355896 (https://phabricator.wikimedia.org/T166372) [10:35:01] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:35:51] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:36:51] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [10:39:51] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 9.715 second response time [10:41:51] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:44:41] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 1.576 second response time [10:52:01] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:53:51] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:56:41] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.608 second response time [10:58:51] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [11:02:01] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:02:01] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:02:51] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [11:03:01] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [11:03:51] PROBLEM - HHVM jobrunner on mw1169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:04:41] RECOVERY - HHVM jobrunner on mw1169 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.080 second response time [11:06:01] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:07:51] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [11:13:05] (03CR) 10Volans: "Compiler results available here: https://puppet-compiler.wmflabs.org/6550/" [puppet] - 10https://gerrit.wikimedia.org/r/355896 (https://phabricator.wikimedia.org/T166372) (owner: 10Volans) [11:14:51] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:41] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 1.350 second response time [11:19:01] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:51] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:51] PROBLEM - HHVM jobrunner on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:41] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [11:25:01] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 7.892 second response time [11:28:51] PROBLEM - HHVM jobrunner on mw1259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:30:41] RECOVERY - HHVM jobrunner on mw1259 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [11:35:01] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:36:01] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 9.730 second response time [11:38:01] PROBLEM - HHVM jobrunner on mw1260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:41] RECOVERY - HHVM jobrunner on mw1168 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 0.074 second response time [11:41:01] RECOVERY - HHVM jobrunner on mw1260 is OK: HTTP OK: HTTP/1.1 200 OK - 202 bytes in 9.414 second response time [11:45:01] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:02] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:45:02] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:01] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [11:48:01] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [11:48:01] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [12:01:01] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:01] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:01:02] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:02:01] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [12:02:01] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [12:03:01] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [12:06:01] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:07:01] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:08:01] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:09:01] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [12:12:01] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [12:12:01] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [12:19:01] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:19:01] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:20:01] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [12:20:03] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [13:19:27] !log restart db1069:3313 mysql instance, stuck on replication [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:11] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [13:37:12] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:37:12] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:38:01] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [13:38:11] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [13:38:14] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [13:59:11] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [14:02:11] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:03:01] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [14:03:01] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [14:22:12] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [14:24:11] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [15:02:11] PROBLEM - restbase endpoints health on restbase-dev1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:02:12] PROBLEM - restbase endpoints health on restbase-dev1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:04:11] RECOVERY - restbase endpoints health on restbase-dev1002 is OK: All endpoints are healthy [15:04:11] RECOVERY - restbase endpoints health on restbase-dev1001 is OK: All endpoints are healthy [15:24:01] (03PS2) 10Alexandros Kosiaris: role::kubernetes::worker: upgrade calico everywhere [puppet] - 10https://gerrit.wikimedia.org/r/355394 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [15:25:33] (03PS3) 10Alexandros Kosiaris: role::kubernetes::worker: upgrade calico everywhere [puppet] - 10https://gerrit.wikimedia.org/r/355394 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [15:29:37] (03CR) 10Alexandros Kosiaris: [C: 032] role::kubernetes::worker: upgrade calico everywhere [puppet] - 10https://gerrit.wikimedia.org/r/355394 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [18:39:43] Ping? [18:39:53] See https://quarry.wmflabs.org/query/18947 [18:41:37] Whoops, broke the query for a moment, fixed... [18:42:23] Some recent uploads on Commons are giving a transcode_error of “* An unknown error occurred in storage backend "local-swift-eqiad". * An unknown error occurred in storage backend "local-swift-codfw”.”, but only for some resolutions. [18:44:20] https://quarry.wmflabs.org/query/18950 is another example. [22:08:55] akosiaris: Ping? [22:09:03] Since you are ‘on call’... [22:47:51] PROBLEM - Disk space on labstore1005 is CRITICAL: DISK CRITICAL - free space: /srv/tools 474054 MB (5% inode=83%) [23:44:23] I need a script that pokes this channel every few hours until someone wakes up…. [23:50:08] Revent: what issue do you have? [23:50:28] Easiest to just point at examples... [23:50:30] Revent: did you fill a ticket on Phabricator? [23:50:34] https://quarry.wmflabs.org/query/18951 [23:50:38] https://quarry.wmflabs.org/query/18950 [23:50:43] https://quarry.wmflabs.org/query/18947 [23:51:01] It might be related to some known issue (shrugs) [23:51:49] Revent: open a CC on Phabricator, add Filippo Giunchedi as cc [23:52:31] Dereckson: Those files (and there are a substantial number more, I’m estimating about a hundred transcodes over the last few days) are ‘persistent’ about which transcodes fail that way, even if reset. [23:54:53] Revent: add also 'operations' as project (and media-storage) [23:55:14] (nods) Working on it. [23:57:52] 06Operations, 06Commons, 10media-storage: More missing 'original' files on Commons - https://phabricator.wikimedia.org/T163068#3185065 (10Dereckson) The first two and the last now work from esams. https://upload.wikimedia.org/wikipedia/commons/6/69/Autonomous_bus_trials_South_Perth_-_3.ogv is still 404. [23:58:27] Dereckson: ^ That was a different issue. [23:59:32] I fixed two of those by uploading a new copy from youtube (the source), and got the author (Anna Frodesiak) to upload a new copy of another one. [23:59:58] ok