[00:01:06] PROBLEM - puppet last run on francium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:05:26] hi! does anybody know, what c/c2 in this url means? [00:05:29] https://upload.wikimedia.org/wikipedia/commons/c/c2/MK_30955-71_Kloster_Eberbach_Dachstuhl.jpg [00:05:57] think those are the first and first two characters of a hash [00:07:04] yes, seems so [00:07:14] is it generated during upload? [00:23:31] If they are from a hash, it’s not SHA1. [00:23:55] (that’s not the SHA1 of that file) [00:26:32] hash of the filename [00:28:30] and it's md5 [00:28:37] https://www.mediawiki.org/wiki/Manual:$wgHashedUploadDirectory [00:28:49] yep [00:28:52] php > var_dump( md5( "MK_30955-71_Kloster_Eberbach_Dachstuhl.jpg" ) ); [00:28:52] string(32) "c2e3d2802d6733a4e0133aeaa6b421f4" [00:29:06] RECOVERY - puppet last run on francium is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [00:31:36] probably made more sense when the backend was nfs [00:33:19] tldr; most file systems can't handle millions of files in a directory, so we use the hash to divide them up into folders into a somewhat logical method [00:35:46] meh [00:38:54] although iirc swift containers are separated using this [00:45:06] PROBLEM - puppet last run on db1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:54:26] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:07:56] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:13:06] RECOVERY - puppet last run on db1033 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [01:22:52] (03PS17) 10Zppix: Add support for searching gerrit using bug:T1 [puppet] - 10https://gerrit.wikimedia.org/r/308753 (https://phabricator.wikimedia.org/T85002) (owner: 10Paladox) [01:35:56] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [02:18:11] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.1) (duration: 06m 03s) [02:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:22:41] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Nov 6 02:22:41 UTC 2016 (duration 4m 30s) [02:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:37:36] PROBLEM - puppet last run on elastic1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:51:56] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:53:25] https://commons.wikimedia.org/wiki/File:20161003_Panel_LA_Case_Study_HD.webm <- wow…. [02:54:02] WebM 480P - 485 kbps - [02:54:02] Completed 10:16, 2016 October 5 - 10 h 15 min 33 s [02:54:20] over 10 hours to transcode.... [02:56:55] Revent : Original file ‎(WebM audio/video file, VP8/Vorbis, length 39 min 2 s, 1,280 × 720 pixels, 5.9 Mbps overall) , where's 480p? [02:57:15] Under transcode status [03:00:31] arseny92: It ‘looks like’ someone used video2commons, on the 5th and 6th, to upload a ‘lot’ of gig+ videos…many failed, but than 10 hour ‘encode time’ is the longest I have seen, probably because the servers where so overloaded. [03:02:46] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:05:36] RECOVERY - puppet last run on elastic1043 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [03:08:46] PROBLEM - puppet last run on cp3038 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:14:48] Revent : T147153#2691403 [03:14:49] T147153: Please upload large file to Wikimedia Commons - https://phabricator.wikimedia.org/T147153 [03:15:59] Ah… yeah, there were lots other than just those, but I think he did them another way. [03:16:31] the above file you linked to is on that list [03:16:40] https://commons.wikimedia.org/wiki/File:20161003_POTUS_Panel_HD.webm <- for instance, not in that list [03:17:03] But a 1.2G video from that same time period. [03:17:41] It is [03:17:44] 5th [03:19:42] 5th one is a press briefing... [03:19:56] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [03:20:13] https://phabricator.wikimedia.org/p/Jasonanaggie/ tho…. quite a few lists. [03:20:32] see the anchor [03:20:42] #2691403 [03:21:12] or scroll down the page ;) [03:21:16] Oh, gotcha [03:22:36] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 749.63 seconds [03:31:46] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [03:33:46] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:34:06] PROBLEM - puppet last run on analytics1054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIPCity.dat.gz] [03:36:46] RECOVERY - puppet last run on cp3038 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [03:37:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 277.68 seconds [04:00:16] RECOVERY - Last backup of the maps filesystem on labstore1001 is OK: OK - Last run for unit replicate-maps was successful [04:01:46] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [04:02:06] RECOVERY - puppet last run on analytics1054 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:04:16] PROBLEM - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code [04:45:36] PROBLEM - puppet last run on cp3010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:14:36] RECOVERY - puppet last run on cp3010 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:11:06] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:12:06] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [06:30:26] PROBLEM - Disk space on logstash1003 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:31:06] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:31:16] PROBLEM - Disk space on logstash1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%) [06:38:46] PROBLEM - puppet last run on elastic2003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 3 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[lldpd],Package[dstat] [06:44:06] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:06:46] RECOVERY - puppet last run on elastic2003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:33:16] PROBLEM - puppet last run on logstash1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:48:16] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:56:46] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:58:36] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [08:46:16] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [08:54:10] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774444 (10AlexMonk-WMF) is this a duplicate of T49515? [08:55:46] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:57:36] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [09:17:06] PROBLEM - puppet last run on cp3042 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [09:18:16] PROBLEM - puppet last run on logstash1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tmux] [09:46:06] RECOVERY - puppet last run on cp3042 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [09:55:46] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:57:46] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:00:46] PROBLEM - MD RAID on relforge1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:01:36] RECOVERY - MD RAID on relforge1001 is OK: OK: Active: 8, Working: 8, Failed: 0, Spare: 0 [10:10:53] logstash host disks are again filled up [10:13:06] RECOVERY - Disk space on logstash1001 is OK: DISK OK [10:13:16] RECOVERY - Disk space on logstash1002 is OK: DISK OK [10:13:16] !log removing logstash.log.1 from logstash100[123] to free some space [10:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:26] RECOVERY - Disk space on logstash1003 is OK: DISK OK [10:17:16] RECOVERY - puppet last run on logstash1002 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [10:19:30] it will buy some time but it seems the same problem again [10:24:16] PROBLEM - puppet last run on ms-be1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:25:00] (will re-check later on) [10:31:26] RECOVERY - puppet last run on logstash1003 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:38:46] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:39:00] elukey: > /dev/null 2>&1 will fix it for much longer. :P [10:43:06] RECOVERY - puppet last run on logstash1001 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [10:52:16] RECOVERY - puppet last run on ms-be1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [11:00:15] Revent: seems stable now, no need :P [11:06:46] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [12:05:06] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:33:06] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [13:13:06] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:13:56] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:14:36] PROBLEM - puppet last run on mw1266 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:29:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 627 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3072659 keys, up 6 days 6 hours - replication_delay is 627 [14:33:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3058005 keys, up 6 days 6 hours - replication_delay is 0 [14:42:36] RECOVERY - puppet last run on mw1266 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:49:26] PROBLEM - Disk space on thumbor1001 is CRITICAL: DISK CRITICAL - free space: / 1227 MB (2% inode=97%) [14:52:26] RECOVERY - Disk space on thumbor1001 is OK: DISK OK [15:01:14] (03PS7) 10Ori.livneh: Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 [15:06:46] (03CR) 10Ori.livneh: [C: 032] Add an Icinga check for Graphite metric freshness [puppet] - 10https://gerrit.wikimedia.org/r/251675 (owner: 10Ori.livneh) [16:21:26] PROBLEM - puppet last run on seaborgium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:26] RECOVERY - puppet last run on seaborgium is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [17:23:04] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774942 (10bd808) >>! In T150092#2774143, @Andrew wrote: >>>! In T150092#2773987, @bd808 wrote: >> What about using https://blueprints.launc... [17:45:24] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2774962 (10Andrew) [18:06:38] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2775006 (10Andrew) > in general the point of OAuth would be use easily revoked tokens for > authentication requests coming from any extern... [18:08:05] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2775007 (10Andrew) > Can probably usurp Keystone's own password authentication plugin, subclass > it, add a check against context.remote_ad... [18:33:19] 06Operations, 06Labs, 10Labs-Infrastructure, 10netops, 10wikitech.wikimedia.org: Provide public access to OpenStack APIs - https://phabricator.wikimedia.org/T150092#2775047 (10bd808) >>! In T150092#2775006, @Andrew wrote: >> in general the point of OAuth would be use easily revoked tokens for >> authen... [18:36:03] (03PS2) 10BBlack: nginx (1.11.4-1+wmf14) jessie-wikimedia; urgency=medium [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/319776 [18:36:05] (03PS1) 10BBlack: remove stapling_proxy patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320113 [18:36:07] (03PS1) 10BBlack: remove readahead patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320114 [18:36:09] (03PS1) 10BBlack: add stapling-multi-file patch [software/nginx] (wmf-1.11.4) - 10https://gerrit.wikimedia.org/r/320115 [19:15:36] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:30:06] PROBLEM - puppet last run on alsafi is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:43:36] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [19:58:06] RECOVERY - puppet last run on alsafi is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:21:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:24:46] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:25:46] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [20:34:46] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:34:46] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:43:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:53:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [20:57:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 629 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3086964 keys, up 6 days 12 hours - replication_delay is 629 [20:59:36] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [21:09:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3072555 keys, up 6 days 12 hours - replication_delay is 0 [21:11:16] PROBLEM - puppet last run on ms-be1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:31:06] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [21:32:06] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3072601 keys, up 6 days 13 hours - replication_delay is 0 [21:40:26] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [22:13:06] PROBLEM - puppet last run on restbase1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:13:31] !log Run namespacesDupe maintenance script on gl.wikisource (T150143) [22:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:13:38] T150143: Check double redirects / namespace issue on gl.wikisource - https://phabricator.wikimedia.org/T150143 [22:39:56] PROBLEM - puppet last run on mw1214 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:41:06] RECOVERY - puppet last run on restbase1010 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [22:52:56] PROBLEM - Host puppetmaster2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:53:16] RECOVERY - Host puppetmaster2001 is UP: PING OK - Packet loss = 0%, RTA = 36.71 ms [22:54:56] PROBLEM - puppet last run on mw2219 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:55:06] PROBLEM - puppet last run on db2011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:55:26] PROBLEM - puppet last run on lvs2003 is CRITICAL: CRITICAL: Puppet has 9 failures. Last run 2 minutes ago with 9 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml],File[/etc/modprobe.d/blacklist-linux44.conf],File[/usr/local/bin/phaste],File[/usr/local/lib/nagios/plugins/get-raid-status-hpssacli] [22:55:26] PROBLEM - puppet last run on cp2017 is CRITICAL: CRITICAL: Puppet has 6 failures. Last run 2 minutes ago with 6 failures. Failed resources (up to 3 shown): File[/usr/local/sbin/varnish-backend-restart],File[/usr/local/bin/varnishmedia],File[/etc/logrotate.d/confd],File[/usr/local/bin/confd-lint-wrap] [23:07:56] RECOVERY - puppet last run on mw1214 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [23:08:56] PROBLEM - puppet last run on mw2186 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/etcd-manage] [23:21:26] RECOVERY - puppet last run on cp2017 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:21:26] RECOVERY - puppet last run on lvs2003 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [23:21:56] RECOVERY - puppet last run on mw2186 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [23:22:06] RECOVERY - puppet last run on db2011 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:22:56] RECOVERY - puppet last run on mw2219 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [23:39:46] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]