[00:00:36] Know of any other extensions which have correct deployment config? [00:00:47] Cutting from non-master, you mean? [00:00:55] y [00:00:59] Yeah, is anybody else even doing that? [00:01:22] Hmm so it's kind of almost already set up that way! But not completely [00:01:25] "@": "Set value to true to copy last branch, or a string for the specific commit, branch, or tag", [00:01:26] "special_extensions": { [00:01:27] "CentralNotice": "wmf_deploy", [00:01:29] "DonationInterface": "deployment", [00:01:36] Maybe we just need to change what make-wmf-branch does with that information [00:02:02] I'm curious about why we would have some wmf/* branches getting cut but not others? [00:02:14] oh nvm, none are being cut [00:02:15] Ideally what this would mean is "CN stil gets a wmf/* branch, but it's cut from wmf_deploy instead of master" [00:02:25] yes! [00:02:31] Wikidata is apparently also a special extension, frozen at wmf2 [00:02:33] 2 [00:02:35] *wmf22 [00:02:35] I can make a task, if you haven't already [00:02:43] Pleaes do [00:02:48] cool [00:03:36] !log Running FlowFixLinks.php on testwiki [00:03:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [00:04:36] Well that went well [00:04:39] It fataled after a few seconds :S [00:06:40] RoanKattouw: wikidata cuts branches every 2 weeks instead of once [00:06:43] one* [00:08:59] Oh OK [00:15:52] oh, by the way awight [00:16:05] yessir [00:16:09] since this deployment branch is improperly named, it's not restricted to deployers like it should be [00:16:24] Good call. [00:16:31] any mediawiki dev can merge there [00:16:51] I'm thinking we should change the name for sure. And confirm gerrit permissions like you pointed out. [00:17:43] Hopefully the config.json patch make the biggest headache go away [00:19:16] awight, so you guys manually merge everything into deployment branch? [00:19:30] We kind of have to. [00:19:43] ...payments code. [00:19:47] There are rules. [00:20:51] We deploy using a different infrastructure, on the fundraising cluster. It follows the named deployment branches... [00:20:57] mmm, so basically there are two levels of oversight: CR on master + manual merges? [00:21:02] Yep. [00:21:16] booo-riiiing! :P [00:21:19] The deployment branch thing is important to us, so we can schedule releases [00:21:41] Manual merges to deployment are pretty great IMO, cos it allows us to roll back to stable code, etc. [00:21:43] Well, also, I think we get punished for things like automatic merges on the i18n. [00:21:46] By PCI. [00:21:59] And, you know. I rather like that. [00:22:26] we used to do manual branches in mobile [00:22:49] ...the i18n. Not being punished. Letting anything auto-merge into deployment for a payments box would be Bad. [00:23:43] do things merged to the deployment branch get automatically deployed somehow? [00:23:48] Nope. [00:23:54] We do that too. [00:24:35] But the idea is that some human on this team has to vouch for all the actual changes before they go. [00:24:44] as someone who spent horrible amount of time deployoing MF, I pity thee:) [00:25:06] It's kind of nice sometimes. [00:25:10] ...kind of. [00:26:01] I guess it's been a while since we've cherry-picked around anything, so maybe we're not really taking advantage of that anymore. Probably a good sign. [00:26:12] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1664840 (10Dzahn) p:5Triage>3Normal [00:26:26] PROBLEM - HHVM rendering on mw1244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:26] PROBLEM - HHVM rendering on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:26] PROBLEM - HHVM rendering on mw1242 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:26] PROBLEM - HHVM rendering on mw1237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:26] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:35] PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:35] PROBLEM - HHVM rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:35] PROBLEM - Apache HTTP on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:36] PROBLEM - HHVM rendering on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:36] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:36] PROBLEM - Apache HTTP on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:37] PROBLEM - Apache HTTP on mw1047 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:38] uh oh... [00:26:46] is that a single rack? [00:26:46] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - HHVM rendering on mw1218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - Apache HTTP on mw1065 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:47] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:48] PROBLEM - Apache HTTP on mw1220 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:48] PROBLEM - HHVM rendering on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:49] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:49] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:50] PROBLEM - HHVM rendering on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:50] PROBLEM - HHVM rendering on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:51] PROBLEM - HHVM rendering on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:55] PROBLEM - HHVM rendering on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:55] PROBLEM - Apache HTTP on mw1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:55] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:55] PROBLEM - HHVM rendering on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:26:55] PROBLEM - Apache HTTP on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:07] 503s [00:27:09] site down [00:28:14] or... not? [00:28:15] hm [00:28:15] RECOVERY - HHVM rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.090 second response time [00:28:16] RECOVERY - HHVM rendering on mw1237 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.087 second response time [00:28:16] RECOVERY - HHVM rendering on mw1242 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.092 second response time [00:28:16] RECOVERY - HHVM rendering on mw1245 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.101 second response time [00:28:16] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.085 second response time [00:28:16] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [00:28:16] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.088 second response time [00:28:16] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.104 second response time [00:28:26] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [00:28:26] RECOVERY - Apache HTTP on mw1047 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [00:28:27] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [00:28:27] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [00:28:27] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.101 second response time [00:28:27] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [00:28:36] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [00:28:36] RECOVERY - Apache HTTP on mw1094 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [00:28:36] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.042 second response time [00:28:36] RECOVERY - Apache HTTP on mw1220 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.028 second response time [00:28:36] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [00:28:36] RECOVERY - Apache HTTP on mw1150 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [00:28:37] RECOVERY - HHVM rendering on mw1215 is OK: HTTP OK: HTTP/1.1 200 OK - 64997 bytes in 0.124 second response time [00:28:37] RECOVERY - HHVM rendering on mw1089 is OK: HTTP OK: HTTP/1.1 200 OK - 64997 bytes in 0.124 second response time [00:28:37] RECOVERY - HHVM rendering on mw1106 is OK: HTTP OK: HTTP/1.1 200 OK - 64997 bytes in 0.128 second response time [00:28:38] RECOVERY - HHVM rendering on mw1059 is OK: HTTP OK: HTTP/1.1 200 OK - 64997 bytes in 0.126 second response time [00:28:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 75.00% of data above the critical threshold [5000.0] [00:28:39] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [00:28:39] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 64997 bytes in 0.155 second response time [00:28:40] RECOVERY - Apache HTTP on mw1248 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [00:28:55] RECOVERY - Apache HTTP on mw1255 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [00:28:55] RECOVERY - HHVM rendering on mw1250 is OK: HTTP OK: HTTP/1.1 200 OK - 64996 bytes in 0.097 second response time [00:29:28] Krenair: legoktm is the site up? [00:29:46] diamond thieves just swapped out the video feed? [00:29:48] Seems OK to me [00:29:52] It is now [00:29:54] It wasn't before [00:29:55] I did a couple of requests earlier and just got 503s, but now it seems fine? [00:31:09] Yeah lots of catchpoint failures [00:31:35] getting 502 Bad Gateway for en.wiki [00:31:35] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet last ran 1 day ago [00:31:46] kaldari: just now? [00:31:53] yeah for https://en.wikipedia.org/w/index.php?title=Homing_pigeon&diff=prev&oldid=682284024 [00:31:54] up for me [00:32:24] Yup wfm too [00:32:29] Looks like another short ddos [00:32:38] works now [00:32:45] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [500.0] [00:32:57] also got an internal server error at https://en.wikipedia.org/w/index.php?title=Research_in_lithium-ion_batteries&diff=prev&oldid=682254959 about the same time [00:33:10] "(Cannot access the database: Can't connect to MySQL server on '10.64.16.28' (4) (10.64.16.28))" [00:33:52] you're probably wondering why I'm looking up Homing pigeons and lithium-ion batteries at the same time. I can explain! [00:34:17] kaldari: terrorist! [00:50:37] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:50:44] Ha! best scrollback I've glanced at in a while. :> [00:54:24] kaldari: https://xkcd.com/214/ ... [00:56:26] !log legoktm@tin Synchronized php-1.26wmf24/extensions/Echo/Hooks.php: Remove duplicate 'MediaWiki' prefix from echo.unseen stats (duration: 00m 12s) [00:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [01:05:25] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] [01:29:26] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v6, cp3044_v6 [01:30:16] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp2008_v6 [01:31:26] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:34:06] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:41:06] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 22 not-conn: cp2009_v6, cp4019_v6 [01:41:36] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2005_v6, cp3043_v6 [01:43:35] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:46:46] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [01:48:12] (03CR) 10MZMcBride: "Hi. Thank you for submitting this changeset. What's needed here?" [software] - 10https://gerrit.wikimedia.org/r/225218 (https://phabricator.wikimedia.org/T59617) (owner: 10coren) [01:52:25] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp4013_v6 [01:54:15] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:08:46] PROBLEM - puppet last run on mw2173 is CRITICAL: CRITICAL: puppet fail [02:10:13] (03PS2) 10Thcipriani: Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) [02:14:05] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3048_v6 [02:15:55] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:30:55] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 633 MB (2% inode=58%) [02:31:06] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3039_v6 [02:33:05] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:35:16] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 06m 21s) [02:35:26] RECOVERY - puppet last run on mw2173 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [02:35:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:44] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-23 02:38:44+00:00 [02:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:52] I'm getting 503s on en.wikipedia.org. [02:47:27] PROBLEM - Apache HTTP on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:27] PROBLEM - Apache HTTP on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:28] PROBLEM - HHVM rendering on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:28] PROBLEM - Apache HTTP on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:28] PROBLEM - HHVM rendering on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:28] PROBLEM - Apache HTTP on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - Apache HTTP on mw1037 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - Apache HTTP on mw1034 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - HHVM rendering on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - HHVM rendering on mw1062 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - HHVM rendering on mw1110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:35] PROBLEM - HHVM rendering on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:36] PROBLEM - Apache HTTP on mw1241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:47:36] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:45] yeahhhhhhhhhhh [02:48:55] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:56] PROBLEM - Apache HTTP on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:56] PROBLEM - HHVM rendering on mw1236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:57] PROBLEM - HHVM rendering on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:48:57] PROBLEM - HHVM rendering on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:06] PROBLEM - Apache HTTP on mw1237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:49:15] RECOVERY - Apache HTTP on mw1249 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.025 second response time [02:49:15] RECOVERY - Apache HTTP on mw1173 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [02:49:15] RECOVERY - HHVM rendering on mw1167 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.110 second response time [02:49:15] RECOVERY - Apache HTTP on mw1178 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [02:49:15] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [02:49:16] RECOVERY - HHVM rendering on mw1256 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.090 second response time [02:49:16] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [02:49:17] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time [02:49:17] RECOVERY - Apache HTTP on mw1034 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [02:49:18] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [02:49:18] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [02:49:19] RECOVERY - HHVM rendering on mw1062 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.118 second response time [02:49:30] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.098 second response time [02:49:30] RECOVERY - HHVM rendering on mw1212 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.095 second response time [02:49:31] RECOVERY - HHVM rendering on mw1255 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.100 second response time [02:49:31] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 65108 bytes in 0.123 second response time [02:49:32] RECOVERY - HHVM rendering on mw1220 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.111 second response time [02:49:32] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.045 second response time [02:49:33] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.099 second response time [02:49:33] RECOVERY - HHVM rendering on mw1077 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.118 second response time [02:49:34] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.102 second response time [02:49:34] RECOVERY - HHVM rendering on mw1087 is OK: HTTP OK: HTTP/1.1 200 OK - 65108 bytes in 0.136 second response time [02:49:35] RECOVERY - HHVM rendering on mw1063 is OK: HTTP OK: HTTP/1.1 200 OK - 65108 bytes in 0.131 second response time [02:49:35] RECOVERY - HHVM rendering on mw1035 is OK: HTTP OK: HTTP/1.1 200 OK - 65108 bytes in 0.122 second response time [02:49:36] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [02:49:36] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2003_v6 [02:49:37] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [02:49:37] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.030 second response time [02:49:38] Better now. [02:50:31] I just came in to say I was getting some errors trying to view/edit enwiki, I'm curious what happened [02:50:51] RECOVERY - HHVM rendering on mw1247 is OK: HTTP OK: HTTP/1.1 200 OK - 65107 bytes in 0.089 second response time [02:50:56] RECOVERY - Apache HTTP on mw1237 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time [02:52:06] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10501 bytes in 1.108 second response time [02:52:26] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 28.57% of data above the critical threshold [500.0] [02:53:26] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - 24 ESP OK [02:53:46] ok, got paged for the ipv6 and see we have this =P [02:56:50] Seems to be fine agaib [02:57:01] you hate that word ;] [02:57:12] but indeed, it seems ok again [02:57:35] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 408 MB (1% inode=58%) [02:57:44] Seems to be fine, Ahab. How is the white whale? [03:04:26] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:05:13] !log l10nupdate@tin Synchronized php-1.26wmf24/cache/l10n: l10nupdate for 1.26wmf24 (duration: 10m 12s) [03:05:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:07:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [03:23:26] RECOVERY - Disk space on labstore1002 is OK: DISK OK [04:04:53] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL: CRITICAL: 18.52% of data above the critical threshold [100000000.0] [04:05:05] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:06:44] Coren: I've filed https://phabricator.wikimedia.org/T113435 do take a look (about these alerts) [04:32:56] (03CR) 10Krinkle: "Might as well upgrade to 1.11.3 directly since there were no breaking API changes since 1.9.x. I've verified the code used on dbtree uses " (031 comment) [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) (owner: 10Reedy) [04:38:14] (03PS1) 10Yuvipanda: elasticsearch: Setup nobelium with standard roles [puppet] - 10https://gerrit.wikimedia.org/r/240302 (https://phabricator.wikimedia.org/T113282) [04:40:54] RECOVERY - Incoming network saturation on labstore1003 is OK: OK: Less than 10.00% above the threshold [75000000.0] [04:41:18] (03PS2) 10Yuvipanda: elasticsearch: Setup nobelium with standard roles [puppet] - 10https://gerrit.wikimedia.org/r/240302 (https://phabricator.wikimedia.org/T113282) [04:41:34] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: Setup nobelium with standard roles [puppet] - 10https://gerrit.wikimedia.org/r/240302 (https://phabricator.wikimedia.org/T113282) (owner: 10Yuvipanda) [04:43:10] 6operations, 5Patch-For-Review: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1665047 (10yuvipanda) [05:13:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 75 data above and 8 below the confidence bounds [05:40:44] (03CR) 1020after4: [C: 031] Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [05:45:21] (03CR) 1020after4: [C: 031] Add config deployment [tools/scap] - 10https://gerrit.wikimedia.org/r/240292 (https://phabricator.wikimedia.org/T109512) (owner: 10Thcipriani) [05:46:04] (03CR) 1020after4: [C: 031] Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [05:55:33] PROBLEM - Disk space on mw1152 is CRITICAL: DISK CRITICAL - free space: /tmp 0 MB (0% inode=99%) [05:56:13] (03PS1) 10EBernhardson: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 [05:56:56] (03CR) 10jenkins-bot: [V: 04-1] Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 (owner: 10EBernhardson) [05:57:02] (03PS2) 10EBernhardson: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 [05:57:52] (03CR) 10jenkins-bot: [V: 04-1] Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 (owner: 10EBernhardson) [05:57:55] :( [05:58:06] mean ol jenkins [06:00:51] _joe_: /tmp on mw1152 has 16 gigs transcode_*webm files [06:02:02] and ogv, dating back to the 16th [06:02:14] (03PS3) 10EBernhardson: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 [06:02:28] <_joe_> moritzm: uhm, I'll take a look now [06:03:01] (03CR) 10jenkins-bot: [V: 04-1] Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 (owner: 10EBernhardson) [06:04:28] (03PS4) 10EBernhardson: Add read-only reverse proxy for labs ES [puppet] - 10https://gerrit.wikimedia.org/r/240305 [06:05:27] yuvipanda: btw i dunno if i need to go through some process, but it would be useful if i could log into nobelium :) [06:06:00] ebernhardson: I can't yet either. Not sure what's going on [06:06:11] ahh, ok [06:06:12] <_joe_> what's nobelium? [06:06:15] ebernhardson: do you already have access to the es servers? [06:06:22] _joe_: testing a labs replica of es [06:06:32] _joe_: roughly similar to mysql labs replica, open access to query the data [06:06:57] yuvipanda: yes i have root on the es cluster [06:07:32] ebernhardson: ya then I think this can be considered part of that [06:07:51] You should have root here too once puppet decides to work [06:07:58] sweet [06:08:54] RECOVERY - Disk space on mw1152 is OK: DISK OK [06:14:34] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: videoscaling doesn't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1665198 (10Joe) 3NEW [06:14:44] <_joe_> moritzm: filed a bug ^^ [06:25:28] 6operations, 7Graphite: Upgrade Graphite from 0.9.12 to 0.9.13 - https://phabricator.wikimedia.org/T104536#1665237 (10Krinkle) p:5Triage>3Normal [06:30:42] (03PS1) 10Yuvipanda: elasticsearch: provision nobelium as labsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/240308 (https://phabricator.wikimedia.org/T113282) [06:31:28] ebernhardson: ^ should give you access [06:31:39] PROBLEM - puppet last run on db2060 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:08] (03PS2) 10Yuvipanda: elasticsearch: provision nobelium as labsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/240308 (https://phabricator.wikimedia.org/T113282) [06:32:10] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:21] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:22] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:26] (03CR) 10Yuvipanda: [C: 032 V: 032] elasticsearch: provision nobelium as labsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/240308 (https://phabricator.wikimedia.org/T113282) (owner: 10Yuvipanda) [06:32:31] PROBLEM - puppet last run on db1059 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:32] wah wah [06:32:54] oh [06:33:03] that's just the normal _joe_ is here puppet failures [06:33:11] sigh, I was expecting these in the morning [06:33:20] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:40] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:59] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:01] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:10] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:12] (03PS1) 10Yuvipanda: elasticsearch: Actually provision nobelium [puppet] - 10https://gerrit.wikimedia.org/r/240309 (https://phabricator.wikimedia.org/T113282) [06:34:20] (03CR) 10jenkins-bot: [V: 04-1] elasticsearch: Actually provision nobelium [puppet] - 10https://gerrit.wikimedia.org/r/240309 (https://phabricator.wikimedia.org/T113282) (owner: 10Yuvipanda) [06:34:41] (03PS2) 10Yuvipanda: elasticsearch: Actually provision nobelium [puppet] - 10https://gerrit.wikimedia.org/r/240309 (https://phabricator.wikimedia.org/T113282) [06:35:10] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:36] (03CR) 10Yuvipanda: [C: 032] elasticsearch: Actually provision nobelium [puppet] - 10https://gerrit.wikimedia.org/r/240309 (https://phabricator.wikimedia.org/T113282) (owner: 10Yuvipanda) [06:43:09] PROBLEM - puppet last run on nobelium is CRITICAL: CRITICAL: puppet fail [06:45:35] (03PS1) 10Muehlenhoff: Fix definition of deployable networks [puppet] - 10https://gerrit.wikimedia.org/r/240310 (https://phabricator.wikimedia.org/T113351) [06:47:05] (03CR) 10Muehlenhoff: "That's a bug in the previously existing ferm service http_deployment_server, I made" [puppet] - 10https://gerrit.wikimedia.org/r/240083 (owner: 10Muehlenhoff) [06:55:51] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [06:56:30] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:56:40] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:57:00] RECOVERY - puppet last run on db1059 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on db2060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:59] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:58:10] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:19] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:40] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:58:40] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:08:08] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1665304 (10Joe) This should be a bit better now given I raised a few HHVM timeouts. [07:12:39] (03PS1) 10Yuvipanda: elasticsearch: New role for labsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/240311 (https://phabricator.wikimedia.org/T113282) [07:15:03] (03CR) 10Yuvipanda: [C: 032] elasticsearch: New role for labsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/240311 (https://phabricator.wikimedia.org/T113282) (owner: 10Yuvipanda) [07:15:09] (03CR) 10Chad: elasticsearch: New role for labsearch cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240311 (https://phabricator.wikimedia.org/T113282) (owner: 10Yuvipanda) [07:15:15] Meh, was mid-comment. [07:15:29] ostriches: oops [07:15:36] ostriches: do complete comment! [07:15:41] I did. [07:15:44] After you merged. [07:15:45] oh [07:15:57] $things_not_going_my_way_today++ [07:16:01] heh [07:16:06] ostriches: you don't have hiera for those. [07:16:17] ostriches: those values are set in hiera for role elasticsearch::server [07:16:22] this is a different role so those won't apply [07:16:43] I guess I'm missing why we need multiple roles :) [07:17:08] * ostriches spent a lot of puppet work to clean up that role and make it more portable :p [07:17:48] (03PS1) 10Yuvipanda: elasticsearch: Rename role to fit current convention [puppet] - 10https://gerrit.wikimedia.org/r/240312 [07:18:16] ostriches: so the elasticsearch::server role requires it to be either in labs, or have row/rack specified [07:18:40] also assumes that there is going to be more than one server [07:18:51] (no default for elasticsearch::cluster_hosts, for example) [07:19:00] cluster_hosts should be smarter. [07:19:55] yup [07:20:14] (03CR) 10Yuvipanda: [C: 032] elasticsearch: Rename role to fit current convention [puppet] - 10https://gerrit.wikimedia.org/r/240312 (owner: 10Yuvipanda) [07:20:46] should probably get rid of the role I introduced [07:22:26] Actually, cluster_name is also janky default. It assumes one cluster per DC. [07:22:44] yup [07:23:01] the labs / prod checks in the role also assume that [07:23:10] RECOVERY - puppet last run on nobelium is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [07:23:46] yuvipanda: Yeah. That's leftover cruft I haven't cleaned up yet. [07:23:48] ebernhardson: ostriches you both should have root on nobelium too, I think. [07:23:50] It was worse :p [07:23:54] ostriches: I can imagine :D [07:24:11] I don't understand elasticsearch enough to touch it yet... [07:24:22] err [07:24:26] touch 'it' as in the puppet cleanup [07:25:16] ostriches: how would we fix the rack / row thing? [07:25:23] explicitly disable via param? [07:25:55] also how do I check if ES is actually working fine? [07:26:48] `curl -s -XGET localhost:9200` [07:26:55] Should respond 200 and with a minimal json payload [07:27:33] Actually, you can set rack/row on the host. We just want awareness_attributes to be set to nothing instead so you don't actually /do/ anything with the data. [07:27:55] wonderful [07:27:56] that works [07:28:00] Which is actually undef by default. [07:28:17] (03CR) 10Muehlenhoff: [C: 031] "(Since this is now a mixed work I thought I'd give it re-review)" [puppet] - 10https://gerrit.wikimedia.org/r/237335 (owner: 10Muehlenhoff) [07:28:27] I guess we're not using rack/row detection at the moment. I know we've gone back and forth playing with it. [07:30:57] Yep, you should be fine there. [07:31:04] * ostriches just re-read too much puppet manifest [07:32:47] ostriches: ok, so I'll figure out which rack this is on tomorrow and use that [07:33:12] ostriches: so I guess with that + setting cluster_name I should be good [07:33:35] Yeah. cluster_name is the important bit that will keep you from joining the prod cluster. [07:33:44] (since they use the same multicast addy) [07:34:17] <_joe_> we don't use rack/row detection because we don't have a reliable way to do that [07:34:59] A reliable way to do what? [07:35:54] Ah, detect the rack/row automagically. Gotcha. [07:36:21] _joe_: I meant using the attributes for awareness when I said "detection" [07:36:23] Hello [07:36:24] <_joe_> yeah, s/that/it/ [07:36:33] Is eqiad ldap server down now? [07:37:05] devunt: Works for me. [07:37:40] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1665381 (10fgiunchedi) FWIW the reason we check `shunt` is that we're also interested when mailman fails to process messages, also looking again at the script it might be a small effort to rewrite it in something safer l... [07:38:51] I have created the new vm a few minutes ago, but I can't login into it [07:38:57] console output says "Sep 23 07:32:08 vm-test1 nslcd[1124]: [334873] failed to bind to LDAP server ldap://ldap-eqiad.wikimedia.org:389: Can't contact LDAP server" [07:39:09] (on wikitech) [07:39:52] Hmmmm [07:40:24] <_joe_> yuvipanda: ^^ any idea? [07:40:45] <_joe_> devunt: I'm not sure it's not a red herring [07:41:20] <_joe_> I guess yuvi is gone to bed finally :) [07:41:26] http://pastebin.com/MS9J0QNH [07:41:30] here's the full log [07:42:06] <_joe_> devunt: ep 23 07:36:18 vm-test1 puppet-agent[1441]: Could not request certificate: Connection refused - connect(2) [07:42:11] <_joe_> this ^^ is your problem [07:42:19] is it nfs issue? [07:42:24] <_joe_> nope [07:42:40] <_joe_> devunt: can you open a ticket? I can't really take a look right now [07:42:43] (03CR) 10Chad: [C: 032] Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [07:42:52] <_joe_> please specify the project as well [07:43:15] apergos: FYI snapshot1001 is out of disk space on / [07:43:15] (03Merged) 10jenkins-bot: Adjust SpecialVersionVersionUrl hook handler for upcoming Semantic versioning in wmf branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239752 (https://phabricator.wikimedia.org/T67306) (owner: 10Florianschmidtwelzow) [07:43:29] 6operations, 7Icinga, 7Monitoring: check_puppetrun: print "agent disabled" reason - https://phabricator.wikimedia.org/T98481#1665401 (10Gage) The script source doesn't say so, but I've noticed that it's written by ripienaar. Latest upstream implements this feature: https://github.com/ripienaar/monitoring-sc... [07:43:53] !log demon@tin Synchronized wmf-config/CommonSettings.php: semver the special:version hook (duration: 00m 12s) [07:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:44:54] _joe_, in phab? [07:45:50] uhoh helium baculasd2 is also almost full too akosiaris [07:46:50] <_joe_> devunt: yes [07:47:05] okay I'm filling the issue nwo [07:47:06] now [07:47:36] <_joe_> jgage: u around man? [07:47:42] yeah [07:47:51] <_joe_> how's it going? [07:47:59] good! and you? [07:48:11] so glad to be out of SF :P [07:53:41] yo jgage [07:53:55] <_joe_> we're allright :) [07:53:56] godog: yesterday it was fine, today it's out of space because: mediawiki takes up an additional 4 gb. awesome [07:54:10] however I wil lbe reinstalling it today so that will take care of that [07:54:24] apergos: \o/ sounds good [07:54:47] and how can I arrange a rfc meeting? [07:55:11] yodog :) [08:02:40] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [08:03:21] 6operations, 6Discovery, 7Elasticsearch: Ferm doesn't update @resolve hostnames on IP change - https://phabricator.wikimedia.org/T113380#1665434 (10MoritzMuehlenhoff) @resolve is working fine, it's rather a problem in ferm: A standard restart of ferm only does an iptables-restore on the files in /var/cache/... [08:10:31] (03PS5) 10Phuedx: Replicate browser test config for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [08:11:29] (03CR) 10Phuedx: [C: 032] Replicate browser test config for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [08:11:35] (03Merged) 10jenkins-bot: Replicate browser test config for QuickSurveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [08:12:15] (03CR) 10Phuedx: "OK. I'm going to create a task for us to talk about how to share this configuration between the per-commit integration tests and the night" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [08:26:23] (03CR) 10Phuedx: "Done. See T113455." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239158 (https://phabricator.wikimedia.org/T112204) (owner: 10Jdlrobson) [08:26:57] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1665505 (10Aklapper) >>! In T112928#1663454, @Stephmonette wrote: > I'll talk to Liam today and confirm that he logged into Phabricator. Thanks @Stephmonette! For docume... [08:28:34] 6operations, 10Annual-Report, 5Patch-For-Review: create git/gerrit repo for annual report 2015 - https://phabricator.wikimedia.org/T112928#1665508 (10hashar) [08:30:40] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [08:35:31] (03PS1) 10Filippo Giunchedi: WIP: cassandra: stop setting cluster_name as %{::site} [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) [08:43:10] RECOVERY - Disk space on helium is OK: DISK OK [08:50:20] PROBLEM - spamassassin on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:39] PROBLEM - RAID on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:50:39] PROBLEM - Check size of conntrack table on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:00] PROBLEM - configured eth on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:00] PROBLEM - dhclient process on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:31] PROBLEM - Disk space on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:39] PROBLEM - salt-minion processes on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:41] PROBLEM - DPKG on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:51:49] PROBLEM - puppet last run on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [08:53:59] PROBLEM - Exim SMTP on mx2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:59:28] (03PS1) 10ArielGlenn: enlarge root partition for snapshot hosts, no separate /srv [puppet] - 10https://gerrit.wikimedia.org/r/240323 [09:04:46] mutante: how much work is it typically to backport a package from sid to precise/trusty/jessie? We'd like to have https://packages.debian.org/sid/composer on tool labs [09:05:12] <_joe_> valhallasw`cloud: I can answer that [09:05:19] <_joe_> and the answer is: it depends [09:05:26] :-D [09:05:32] <_joe_> to precise, I guess it's going to be a _lot_ of work typically [09:05:34] (03PS2) 10Faidon Liambotis: Switch MX to mx1001/mx2001 (wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/240103 [09:05:40] (03CR) 10Faidon Liambotis: [C: 032] Switch MX to mx1001/mx2001 (wikimedia.org) [dns] - 10https://gerrit.wikimedia.org/r/240103 (owner: 10Faidon Liambotis) [09:05:44] <_joe_> is that composer as the php tool? [09:05:47] yeah [09:05:58] valhallasw`cloud: for CI we use integration/composer which is manually updated from time to time [09:06:16] valhallasw`cloud: seems composer fails to tag stable version, so that is not very suitable for .deb packaging unfortunately: -/ [09:07:11] 6operations, 10Possible-Tech-Projects, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1665578 (10Qgil) This is a message posted to all tasks under "Need Discussion" at #Possible-Tech-Projects. #... [09:07:11] <_joe_> hashar: there are tricks you can use [09:07:16] hashar: ah, so that's just a git checkout combined with setting a PATH? [09:07:19] (03CR) 10ArielGlenn: [C: 032] enlarge root partition for snapshot hosts, no separate /srv [puppet] - 10https://gerrit.wikimedia.org/r/240323 (owner: 10ArielGlenn) [09:07:47] valhallasw`cloud: yeah that is what we do. but apparently from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=714118 it is in sid nowadays [09:08:49] valhallasw`cloud: so maybe it can be backported to Jessie without too much troubles. But I would forget about Precise/Trusty [09:09:01] hashar: unfortunately all of tool labs is on precise/trusty :( [09:09:14] time to migrate! :-D [09:09:36] I have a feeling our users wouldn't be happy about that :-p [09:10:49] legoktm: ^ you hack doesn't seem to work :( [09:10:54] your* [09:11:37] (03CR) 10Hashar: [C: 031] Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [09:12:04] hashar: hm, so the contint version doesn't have a composer.phar? [09:12:15] it might not matter, I guess [09:13:24] valhallasw`cloud: I think we run composer update on a precise box, then send the result to git [09:17:00] RECOVERY - Exim SMTP on mx2001 is OK: SMTP OK - 0.143 sec. response time [09:17:10] RECOVERY - spamassassin on mx2001 is OK: PROCS OK: 3 processes with args spamd [09:17:27] hashar: ok, thanks :-) [09:17:30] _joe_: thank you as well [09:17:30] RECOVERY - RAID on mx2001 is OK: OK: no RAID installed [09:17:30] RECOVERY - Check size of conntrack table on mx2001 is OK: OK: nf_conntrack is 0 % full [09:17:50] RECOVERY - configured eth on mx2001 is OK: OK - interfaces up [09:18:00] RECOVERY - dhclient process on mx2001 is OK: PROCS OK: 0 processes with command name dhclient [09:18:21] RECOVERY - salt-minion processes on mx2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:18:21] RECOVERY - Disk space on mx2001 is OK: DISK OK [09:18:39] RECOVERY - DPKG on mx2001 is OK: All packages OK [09:18:40] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:18:40] <_joe_> valhallasw`cloud: you can try this: download the source for sid, and try to build the package [09:18:50] <_joe_> valhallasw`cloud: or, I can try to do that on our buildserver [09:23:00] (03CR) 10Alexandros Kosiaris: [C: 031] "Eeeeew. I think for now it will suffice however. That being said my OCD says "fiiiix it"" [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [09:27:45] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: rename cassandra cluster - https://phabricator.wikimedia.org/T112257#1665652 (10akosiaris) >>! In T112257#1664371, @mobrovac wrote: >>>! In T112257#1663602, @Eevans wrote: >> I prefer the latter: > > I tend to agree. Entering risky procedur... [09:28:22] (03CR) 10Hashar: [C: 04-1] "Can you please add a test case covering that behavior in tests/multiversion/MWRealmTest.php ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 (owner: 10Alex Monk) [09:36:35] (03Abandoned) 10Mark Bergsma: Reexec procps on changes [puppet] - 10https://gerrit.wikimedia.org/r/94949 (owner: 10Mark Bergsma) [09:39:04] 6operations, 6Discovery, 7Elasticsearch: Ferm doesn't update @resolve hostnames on IP change - https://phabricator.wikimedia.org/T113380#1665727 (10faidon) The cache is a Debian-ism and it can get quite annoying — I've been bitten in the past multiple times by it, e.g. when executing processes from within fe... [09:40:07] (03Abandoned) 10Alexandros Kosiaris: Default to UNKNOWN when NRPE checks timeout [puppet] - 10https://gerrit.wikimedia.org/r/165732 (owner: 10Alexandros Kosiaris) [09:42:34] (03Abandoned) 10Alexandros Kosiaris: Permanent mountpoints for heze/helium bacula SDs [puppet] - 10https://gerrit.wikimedia.org/r/180805 (owner: 10Alexandros Kosiaris) [09:42:44] (03Abandoned) 10Filippo Giunchedi: graphite: archive received metrics on disk [puppet] - 10https://gerrit.wikimedia.org/r/183537 (https://phabricator.wikimedia.org/T85908) (owner: 10Filippo Giunchedi) [09:43:14] https://phabricator.wikimedia.org/T112738 this seems to be happening again [09:46:03] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: harmonize HHVM settings with Zend ones [puppet] - 10https://gerrit.wikimedia.org/r/189505 (owner: 10Giuseppe Lavagetto) [09:51:05] (03Abandoned) 10Filippo Giunchedi: base: add checks for 127.0.1.1 in /etc/hosts [puppet] - 10https://gerrit.wikimedia.org/r/157795 (owner: 10Filippo Giunchedi) [09:52:20] (03CR) 10Hashar: "Any update on this? Maybe it can be raised again during one of the deployment cabal meetings?" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [09:52:58] _joe_: *nod* It probably needs a bit more work, I realized, because several of the dependencies are unavailable on trusty... so it might not be worth the effort (especially considering everyone else is moving to jessie) [09:53:37] (03Abandoned) 10Muehlenhoff: tmh (videoscaler): add base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/223244 (https://phabricator.wikimedia.org/T104970) (owner: 10Dzahn) [09:58:42] (03Abandoned) 10Muehlenhoff: enable firewalling on tin [puppet] - 10https://gerrit.wikimedia.org/r/229151 (owner: 10Dzahn) [10:11:52] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1665767 (10JohnLewis) the shunt directory is a really bad messaure of this though. Files are shunted for hundreds of reasons, few of which actually equate to 'failed to process'. If we really do want to messaure the shu... [10:42:19] (03Abandoned) 10Mark Bergsma: Repartition eqiad LVS service IPs [dns] - 10https://gerrit.wikimedia.org/r/92343 (owner: 10Mark Bergsma) [10:43:37] 6operations, 6Commons, 6Multimedia, 10Traffic: Commons API fails (413 error) to upload file within 100MB threshold - https://phabricator.wikimedia.org/T86436#1665790 (10Raymond) Same problem while overwriting an [[ https://commons.wikimedia.org/wiki/File:Portr%C3%A4t_von_Diederick_Hoeufft,_Kapit%C3%A4n_der... [10:45:22] 6operations, 6Discovery, 7Elasticsearch: Ferm doesn't update @resolve hostnames on IP change - https://phabricator.wikimedia.org/T113380#1665792 (10MoritzMuehlenhoff) Disabling the cache seems indeed like the cleanest solution. I ran some tests in labs on a relatively standard rule set (default rules plus a... [10:50:09] (03CR) 10MarcoAurelio: [C: 04-1] Add $wgMassMessageWikiAliases configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 (owner: 10Legoktm) [10:59:08] (03CR) 10MarcoAurelio: "Maybe a bit silly but this log is going to be private, right? I mean, just visible to security team users and others with need?. I think i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [11:02:54] (03CR) 10MarcoAurelio: Allow 'block' AbuseFilterAction on eswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239455 (https://phabricator.wikimedia.org/T113096) (owner: 10Platonides) [11:14:48] (03PS1) 10ArielGlenn: make datasets uid fixed, it must match across hosts [puppet] - 10https://gerrit.wikimedia.org/r/240334 [11:14:50] PROBLEM - DPKG on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:14:59] PROBLEM - puppet last run on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:19] PROBLEM - spamassassin on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:40] PROBLEM - RAID on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:40] PROBLEM - Check size of conntrack table on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:15:45] grumble [11:16:00] PROBLEM - configured eth on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:01] PROBLEM - dhclient process on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:29] PROBLEM - salt-minion processes on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:29] PROBLEM - Disk space on mx2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:16:51] PROBLEM - Exim SMTP on mx2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:16:57] how are you today! [11:19:21] PROBLEM - SSH on mx2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:19:46] (03CR) 10ArielGlenn: [C: 032] make datasets uid fixed, it must match across hosts [puppet] - 10https://gerrit.wikimedia.org/r/240334 (owner: 10ArielGlenn) [11:21:46] (03PS1) 10Muehlenhoff: Disable the ferm rules cache [puppet] - 10https://gerrit.wikimedia.org/r/240335 (https://phabricator.wikimedia.org/T113380) [11:24:52] RECOVERY - SSH on mx2001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [11:27:42] apergos: system users' uid must be between 100 and 999 [11:27:51] apergos: (see /etc/adduser.conf) [11:27:53] it's been 10003 ffor a long time [11:27:58] nd I shall be adding it to um [11:28:07] /usr/local/sbin/enforce-users-groups [11:28:09] as an exclusion [11:28:19] can't we just renumber them? [11:28:21] PROBLEM - puppet last run on cp3012 is CRITICAL: CRITICAL: puppet fail [11:28:32] that exclusion list isn't great, I was hoping we'd ditch it at some point [11:30:49] that's a lot of dump files to chown [11:35:30] good that we have this concept of computers, so we don't have to do it manually! [11:36:29] mark: if we didn't have that concept - we wouldn't have this situation at all :) [11:37:07] only parts of the tree get chowned, becaus eother parts are rsynced over by different users. so actually, yes there's a manual piece ot it where i< get to hunt arounr [11:37:09] around [11:37:19] so I'd like to not do this now and get the reinstalls done [11:37:51] (03CR) 10Alex Monk: [C: 04-1] Add $wgMassMessageWikiAliases configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237687 (owner: 10Legoktm) [11:43:24] (03CR) 10Alex Monk: "If I understand correctly, this sends info to fluorine and logstash, both of which require NDAs? I think most users with access would alre" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222025 (https://phabricator.wikimedia.org/T94774) (owner: 10CSteipp) [11:44:57] (03PS1) 10ArielGlenn: add dataset user to list of users we don't autoremove [puppet] - 10https://gerrit.wikimedia.org/r/240337 [11:46:02] (03PS2) 10ArielGlenn: add dataset user to list of users we don't autoremove [puppet] - 10https://gerrit.wikimedia.org/r/240337 [11:47:16] (03CR) 10ArielGlenn: [C: 032] add dataset user to list of users we don't autoremove [puppet] - 10https://gerrit.wikimedia.org/r/240337 (owner: 10ArielGlenn) [11:48:55] 6operations, 10Dumps-Generation: fix up datasets uid - https://phabricator.wikimedia.org/T113467#1665979 (10ArielGlenn) 3NEW a:3ArielGlenn [11:55:02] RECOVERY - puppet last run on cp3012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:58:05] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1666017 (10fgiunchedi) @johnlewis that sounds good to me as well! Comment was mostly for context on why the shunt check is there since I added it, I don't feel strongly about it either way though [12:11:29] RECOVERY - Restbase root url on restbase-test2001 is OK: HTTP OK: HTTP/1.1 200 - 15229 bytes in 0.115 second response time [12:11:40] RECOVERY - Restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [12:11:41] RECOVERY - Restbase root url on restbase-test2002 is OK: HTTP OK: HTTP/1.1 200 - 15229 bytes in 0.112 second response time [12:11:49] PROBLEM - Host mx2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:11:51] that's me ^ [12:11:57] (restbase, not mx) [12:12:30] RECOVERY - Restbase root url on restbase-test2003 is OK: HTTP OK: HTTP/1.1 200 - 15229 bytes in 0.130 second response time [12:12:50] RECOVERY - Restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [12:13:00] RECOVERY - Restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [12:14:18] mx is me :) [12:14:20] (03PS1) 10Muehlenhoff: Add ferm rules for Spark [puppet] - 10https://gerrit.wikimedia.org/r/240341 (https://phabricator.wikimedia.org/T83597) [12:15:39] RECOVERY - salt-minion processes on mx2001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [12:15:39] RECOVERY - Disk space on mx2001 is OK: DISK OK [12:15:50] RECOVERY - Host mx2001 is UP: PING OK - Packet loss = 0%, RTA = 34.74 ms [12:16:11] RECOVERY - DPKG on mx2001 is OK: All packages OK [12:16:11] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 1 hour ago with 0 failures [12:16:13] RECOVERY - Exim SMTP on mx2001 is OK: SMTP OK - 0.143 sec. response time [12:16:49] RECOVERY - RAID on mx2001 is OK: OK: no RAID installed [12:16:51] RECOVERY - Check size of conntrack table on mx2001 is OK: OK: nf_conntrack is 0 % full [12:17:19] RECOVERY - dhclient process on mx2001 is OK: PROCS OK: 0 processes with command name dhclient [12:17:20] RECOVERY - configured eth on mx2001 is OK: OK - interfaces up [12:30:39] 6operations, 10Wikimedia-Media-storage: Update rsvg on the image scalers - https://phabricator.wikimedia.org/T112421#1666117 (10matmarex) [12:37:39] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1666147 (10akosiaris) /srv/baculasd2 is the Archive pool storage. That is the long term archive (still subject to our privacy policy). It's mostly there for the "let's take this thing a backup one last time"... [12:42:02] (03CR) 10Reedy: [C: 04-1] "Thanks :)" [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) (owner: 10Reedy) [12:44:05] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Including roles in roles can turn out to be a major pain. We've done that in CI and when something goes wrong with the order of resource r" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [12:44:31] (03CR) 10Alexandros Kosiaris: "Actually, that was a reply to @ottomata, not a -1..." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [12:45:43] (03PS2) 10Alex Monk: Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 [12:46:10] (03CR) 10jenkins-bot: [V: 04-1] Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 (owner: 10Alex Monk) [12:46:50] (03CR) 10Alexandros Kosiaris: "T87870 is in a better state these days but still unresolved and with a blocking Task. Will ping on that one" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [12:47:50] (03PS1) 10Faidon Liambotis: Switch wiki-mail to mx1001 [dns] - 10https://gerrit.wikimedia.org/r/240345 [12:47:51] (03PS1) 10Faidon Liambotis: Add A/AAAA/PTR for wiki-mail-codfw [dns] - 10https://gerrit.wikimedia.org/r/240346 [12:48:18] (03PS3) 10Alex Monk: Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 [12:48:21] (03CR) 10Faidon Liambotis: [C: 032] Switch wiki-mail to mx1001 [dns] - 10https://gerrit.wikimedia.org/r/240345 (owner: 10Faidon Liambotis) [12:48:45] (03CR) 10Faidon Liambotis: [C: 032] Add A/AAAA/PTR for wiki-mail-codfw [dns] - 10https://gerrit.wikimedia.org/r/240346 (owner: 10Faidon Liambotis) [12:49:53] heeelloo jenkins [12:52:16] (03CR) 10Mobrovac: "T109727 has been resolved and deployed, so we may continue here. An in-lined comment though." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232722 (owner: 10Alexandros Kosiaris) [12:52:42] what is it doing [12:53:52] (03CR) 10Mobrovac: service::node: change logrotate parameters (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/232722 (owner: 10Alexandros Kosiaris) [12:54:35] hashar: any idea? [12:55:25] (03CR) 10Mobrovac: "I agree with @akosiaris, these shouldn't be roles at all..." [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [12:56:36] (03CR) 10Faidon Liambotis: [V: 032] Switch wiki-mail to mx1001 [dns] - 10https://gerrit.wikimedia.org/r/240345 (owner: 10Faidon Liambotis) [12:56:43] (03CR) 10Faidon Liambotis: [V: 032] Add A/AAAA/PTR for wiki-mail-codfw [dns] - 10https://gerrit.wikimedia.org/r/240346 (owner: 10Faidon Liambotis) [12:57:29] (03PS1) 10Faidon Liambotis: Assign wiki-mail-codfw to mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240348 [12:57:35] (03CR) 10Mobrovac: Add an Analytics specific instance of RESTBase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [12:57:57] (03CR) 10Mobrovac: "Minot nit, LGTM otherwise" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [12:58:40] hashar: something's broken with jenkins [12:58:54] over 10 minutes now [12:59:16] only for the DNS job it looks like, the puppet one was V+2ed fine [12:59:21] (03CR) 10Faidon Liambotis: [C: 032] Assign wiki-mail-codfw to mx2001 [puppet] - 10https://gerrit.wikimedia.org/r/240348 (owner: 10Faidon Liambotis) [12:59:58] 6operations, 7Technical-Debt: Retire Torrus - https://phabricator.wikimedia.org/T87840#1666196 (10mark) a:5mark>3None [13:01:06] (03PS1) 10Faidon Liambotis: mail: add inbound TLS support for main MXes [puppet] - 10https://gerrit.wikimedia.org/r/240351 (https://phabricator.wikimedia.org/T101452) [13:01:31] (03CR) 10Faidon Liambotis: [C: 032] mail: add inbound TLS support for main MXes [puppet] - 10https://gerrit.wikimedia.org/r/240351 (https://phabricator.wikimedia.org/T101452) (owner: 10Faidon Liambotis) [13:07:26] paravoid: which change / job ? [13:07:30] PROBLEM - very high load average likely xfs on ms-be1012 is CRITICAL: CRITICAL - load average: 274.40, 169.85, 80.88 [13:07:47] hashar: 240345 / 240346 [13:07:52] paravoid: ah the job operations-dns-lint doesn't seem to trigger :-/ [13:07:54] hashar: I see them in the frontpage of https://integration.wikimedia.org/zuul/ [13:08:41] the slave is disconnected https://integration.wikimedia.org/ci/computer/integration-lightslave-jessie-1002/ [13:10:22] paravoid: filled https://phabricator.wikimedia.org/T113474 [13:10:26] looking at the instance [13:10:36] bah [13:10:41] /dev/mapper/vd-second--local--disk 484M 168M 288M 37% /mnt [13:14:57] paravoid: unblocked. Going to build a larger slave [13:19:22] (03PS2) 10Reedy: Bundle jquery 1.11.3 [software/dbtree] - 10https://gerrit.wikimedia.org/r/239568 (https://phabricator.wikimedia.org/T96499) [13:22:11] !log upgrading cr1-codfw with newer junos [13:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:26:39] PROBLEM - salt-minion processes on ms-be1012 is CRITICAL: Connection refused by host [13:27:19] PROBLEM - swift-account-auditor on ms-be1012 is CRITICAL: Connection refused by host [13:27:30] PROBLEM - swift-object-replicator on ms-be1012 is CRITICAL: Connection refused by host [13:27:31] PROBLEM - Check size of conntrack table on ms-be1012 is CRITICAL: Connection refused by host [13:27:39] PROBLEM - swift-object-updater on ms-be1012 is CRITICAL: Connection refused by host [13:27:39] PROBLEM - puppet last run on ms-be1012 is CRITICAL: Connection refused by host [13:27:50] PROBLEM - swift-account-replicator on ms-be1012 is CRITICAL: Connection refused by host [13:27:59] PROBLEM - swift-account-reaper on ms-be1012 is CRITICAL: Connection refused by host [13:27:59] PROBLEM - SSH on ms-be1012 is CRITICAL: Connection refused [13:28:10] PROBLEM - RAID on ms-be1012 is CRITICAL: Connection refused by host [13:28:19] PROBLEM - swift-account-server on ms-be1012 is CRITICAL: Connection refused by host [13:28:20] PROBLEM - swift-container-replicator on ms-be1012 is CRITICAL: Connection refused by host [13:28:44] !log reboot ms-be1012, xfs hosed [13:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:31:20] RECOVERY - swift-account-auditor on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [13:31:30] RECOVERY - Check size of conntrack table on ms-be1012 is OK: OK: nf_conntrack is 2 % full [13:31:30] RECOVERY - swift-object-replicator on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [13:31:52] (03PS1) 10Filippo Giunchedi: WIP: report swift containers aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 [13:32:09] RECOVERY - swift-container-replicator on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [13:32:29] RECOVERY - swift-object-updater on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [13:32:38] (03CR) 10Ottomata: "Ok, lets do it. I will be in meetings for the next 4.5 hours. Feel free to go ahead and apply this on standby." [puppet] - 10https://gerrit.wikimedia.org/r/237335 (owner: 10Muehlenhoff) [13:32:39] RECOVERY - very high load average likely xfs on ms-be1012 is OK: OK - load average: 23.64, 8.37, 3.00 [13:32:45] (03CR) 10jenkins-bot: [V: 04-1] WIP: report swift containers aggregated stats [puppet] - 10https://gerrit.wikimedia.org/r/240358 (owner: 10Filippo Giunchedi) [13:33:09] RECOVERY - RAID on ms-be1012 is OK: OK: optimal, 14 logical, 14 physical [13:33:28] RECOVERY - SSH on ms-be1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2wmfprecise2 (protocol 2.0) [13:33:39] RECOVERY - puppet last run on ms-be1012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:33:49] RECOVERY - salt-minion processes on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [13:34:01] (03CR) 10Ottomata: "That's fine, then we can just include both of these roles separately on the aqs nodes themselves, and not even have an aqs role? That wou" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [13:34:08] RECOVERY - swift-account-reaper on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [13:34:18] RECOVERY - swift-account-replicator on ms-be1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [13:34:29] RECOVERY - swift-account-server on ms-be1012 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [13:34:51] ottomata: hey, forgot to reply the other day, likely running the collector under gdb makes it slow enough to stop being racy [13:37:26] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [13:39:17] (03PS2) 10Filippo Giunchedi: cassandra: stop setting cluster_name as %{::site} [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) [13:40:52] godog, you think its gdb, or maybe just because diamond isn't executing it [13:40:57] the exec path is different on CLI than in diamond [13:41:44] ottomata: ah yeah you are right, could be both really, wishful thinking perhaps.. [13:43:04] 6operations, 7Mail: Protect incoming emails with SMTP STARTLS - https://phabricator.wikimedia.org/T101452#1666305 (10faidon) [13:43:16] 6operations: Encrypt all the things - https://phabricator.wikimedia.org/T111653#1666311 (10faidon) [13:43:17] 6operations, 7Mail: Protect incoming emails with SMTP STARTLS - https://phabricator.wikimedia.org/T101452#1666308 (10faidon) 5Open>3Resolved a:3faidon This is now done :) [13:43:46] (03PS1) 10Muehlenhoff: Lower the conntrack tracking time for TIME_WAIT connections [puppet] - 10https://gerrit.wikimedia.org/r/240361 (https://phabricator.wikimedia.org/T105307) [13:44:23] (03CR) 10jenkins-bot: [V: 04-1] Lower the conntrack tracking time for TIME_WAIT connections [puppet] - 10https://gerrit.wikimedia.org/r/240361 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [13:44:42] 6operations, 7Mail: Replace primary mail relays (polonium/lead) - https://phabricator.wikimedia.org/T113211#1666315 (10faidon) All of the above are done. polonium still gets a fair share of emails (spammers don't really obey DNS TTLs); I'll be monitoring it over the next few days, find any stray email flows an... [13:51:27] (03PS3) 10Filippo Giunchedi: cassandra: stop setting cluster_name as %{::site} [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) [13:51:33] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: stop setting cluster_name as %{::site} [puppet] - 10https://gerrit.wikimedia.org/r/240321 (https://phabricator.wikimedia.org/T112257) (owner: 10Filippo Giunchedi) [13:55:40] puppet-lint was broken on some of the Jenkins slaves. I forgot to apt-get upgrade to get the latest puppet-lint package :-D [13:55:41] (03PS2) 10Muehlenhoff: Lower the conntrack tracking time for TIME_WAIT connections [puppet] - 10https://gerrit.wikimedia.org/r/240361 (https://phabricator.wikimedia.org/T105307) [13:56:56] !log force puppet run on restbase2001 to deploy new cassandra config [13:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:00:13] (03CR) 10Chad: [C: 032] Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 (owner: 10Alex Monk) [14:00:36] (03Merged) 10jenkins-bot: Make extension optional in getRealmSpecificFilename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239754 (owner: 10Alex Monk) [14:01:03] !log demon@tin Synchronized multiversion/MWRealm.php: (no message) (duration: 00m 11s) [14:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:29] RECOVERY - Cassandra database on restbase2001 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [14:02:09] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:03:08] (03PS4) 10Chad: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [14:03:32] (03CR) 10Chad: [C: 032] Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [14:03:38] (03Merged) 10jenkins-bot: Split langlist for beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/239748 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [14:04:35] (03CR) 10Filippo Giunchedi: [C: 031] Lower the conntrack tracking time for TIME_WAIT connections [puppet] - 10https://gerrit.wikimedia.org/r/240361 (https://phabricator.wikimedia.org/T105307) (owner: 10Muehlenhoff) [14:04:39] !log demon@tin Synchronized docroot/noc/conf/: (no message) (duration: 00m 13s) [14:04:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:04:50] !log demon@tin Synchronized langlist-labs: (no message) (duration: 00m 11s) [14:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:03] !log demon@tin Synchronized wmf-config/CommonSettings.php: (no message) (duration: 00m 12s) [14:05:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:36] (03CR) 10Chad: [C: 032] Enabling 'flood' flag at scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237049 (https://phabricator.wikimedia.org/T111753) (owner: 10MarcoAurelio) [14:08:06] (03Merged) 10jenkins-bot: Enabling 'flood' flag at scowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237049 (https://phabricator.wikimedia.org/T111753) (owner: 10MarcoAurelio) [14:08:33] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 12s) [14:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:09:10] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [14:11:30] ostriches, thanks for reviewing the getRealmSpecificFilename changes [14:11:43] (03CR) 10Alex Monk: "DNS issues as well, IIRC. Since that breaks LDAP" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [14:11:46] yw [14:12:54] (03CR) 10Chad: [C: 032] Change default AbuseFilter IP block duration to not indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240053 (https://phabricator.wikimedia.org/T113164) (owner: 10MarcoAurelio) [14:13:17] (03Merged) 10jenkins-bot: Change default AbuseFilter IP block duration to not indefinite [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240053 (https://phabricator.wikimedia.org/T113164) (owner: 10MarcoAurelio) [14:13:56] !log demon@tin Synchronized wmf-config/abusefilter.php: (no message) (duration: 00m 12s) [14:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:17:31] (03PS4) 10Chad: Do not rewrite https -> http for IRC notifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (owner: 10Faidon Liambotis) [14:17:51] (03CR) 10Chad: "PS4 was a manual rebase. Also: what's the status here?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (owner: 10Faidon Liambotis) [14:18:09] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:18] PROBLEM - Host cp2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:18] PROBLEM - Host cp2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:18] PROBLEM - Host cp2017 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:18] PROBLEM - Host cp2008 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:19] PROBLEM - Host mw2103 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:19] PROBLEM - Host mw2139 is DOWN: PING CRITICAL - Packet loss = 100% [14:18:19] PROBLEM - Host cp2012 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:05] mmhh paravoid ^ the fw upgrade? [14:19:15] yeah :/ [14:19:35] RECOVERY - Host cp2003 is UP: PING WARNING - Packet loss = 86%, RTA = 51.68 ms [14:19:35] RECOVERY - Host labstore2001 is UP: PING WARNING - Packet loss = 93%, RTA = 53.56 ms [14:19:38] RECOVERY - Host cp2002 is UP: PING WARNING - Packet loss = 86%, RTA = 47.40 ms [14:19:38] RECOVERY - Host cp2001 is UP: PING WARNING - Packet loss = 86%, RTA = 53.91 ms [14:19:38] RECOVERY - Host cp2017 is UP: PING WARNING - Packet loss = 86%, RTA = 47.22 ms [14:19:49] PROBLEM - Host mw2025 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:49] PROBLEM - Host lvs2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:49] PROBLEM - Host mw2065 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:49] PROBLEM - Host mw2043 is DOWN: PING CRITICAL - Packet loss = 100% [14:19:49] PROBLEM - Host ms-be2009 is DOWN: PING CRITICAL - Packet loss = 100% [14:20:02] :| heh [14:20:03] I did an ISSU but LACP didn't come back up [14:20:28] and not sure why it would make those hosts unreachable, there's the other router [14:21:11] PROBLEM - Host pybal-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:16] PROBLEM - Host mobile-lb.codfw.wikimedia.org_ipv6 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:26] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:28] yay pages [14:21:38] paravoid: not traffic affecting though [14:21:38] ffs [14:21:44] so, wassup ? [14:21:49] not end-user traffic affecting, no [14:21:57] router misbehaving even more now ? [14:22:05] I'm upgrading it [14:22:07] PROBLEM - Varnish HTTP mobile-frontend on cp2003 is CRITICAL: Connection timed out [14:22:07] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 60 connecting: (unnamed) [14:22:08] RECOVERY - Host labstore2001 is UP: PING WARNING - Packet loss = 80%, RTA = 34.81 ms [14:22:16] RECOVERY - Host cp2018 is UP: PING OK - Packet loss = 0%, RTA = 34.93 ms [14:22:16] RECOVERY - Host mw2186 is UP: PING OK - Packet loss = 0%, RTA = 34.46 ms [14:22:16] RECOVERY - Host suhail is UP: PING OK - Packet loss = 0%, RTA = 34.54 ms [14:22:17] RECOVERY - Host mw2015 is UP: PING OK - Packet loss = 0%, RTA = 34.44 ms [14:22:17] RECOVERY - Host mw2051 is UP: PING OK - Packet loss = 0%, RTA = 34.63 ms [14:22:17] RECOVERY - Host wtp2003 is UP: PING OK - Packet loss = 0%, RTA = 34.62 ms [14:22:17] RECOVERY - Host db2062 is UP: PING OK - Packet loss = 0%, RTA = 34.52 ms [14:22:21] NOOO [14:22:22] Ohh [14:22:23] Lol [14:22:24] well, vrrp.. this should not have happened, no ? [14:22:27] It's okay! [14:22:43] plus it was the backup, not the active one [14:22:50] plus I did an ISSU [14:22:54] so multiple failures at once, yes [14:23:09] ahaha [14:23:17] so the preferred path from eqiad -> codfw was via cr1 [14:23:44] and cr1 has ae1 down (that LACP never worked properly for some reason) *and* it was blackholing the traffic instead of sending it over to cr2 [14:23:54] RECOVERY - Host text-lb.codfw.wikimedia.org_ipv6 is UP: PING OK - Packet loss = 0%, RTA = 34.66 ms [14:24:02] by the time I tried show route/ospf etc. it was already too late [14:24:24] RECOVERY - Host ns1-v4 is UP: PING OK - Packet loss = 0%, RTA = 34.24 ms [14:24:24] RECOVERY - Host upload-lb.codfw.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 34.47 ms [14:24:33] RECOVERY - Host 208.80.153.12 is UP: PING OK - Packet loss = 0%, RTA = 34.33 ms [14:24:43] RECOVERY - Host ns1-v6 is UP: PING OK - Packet loss = 0%, RTA = 34.44 ms [14:24:43] RECOVERY - Varnish HTTP mobile-frontend on cp2003 is OK: HTTP OK: HTTP/1.1 200 OK - 378 bytes in 0.069 second response time [14:24:44] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [14:25:26] wow, none of the 40G ports have come up [14:26:54] PROBLEM - puppet last run on mc2014 is CRITICAL: CRITICAL: puppet fail [14:27:18] some are [14:27:44] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: Puppet has 1 failures [14:27:47] but all the aes are down... [14:27:54] PROBLEM - puppet last run on mw2110 is CRITICAL: CRITICAL: Puppet has 2 failures [14:28:04] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: Puppet has 1 failures [14:28:24] PROBLEM - puppet last run on mw2127 is CRITICAL: CRITICAL: Puppet has 1 failures [14:28:24] PROBLEM - puppet last run on mw2130 is CRITICAL: CRITICAL: Puppet has 2 failures [14:28:34] PROBLEM - puppet last run on ms-be2010 is CRITICAL: CRITICAL: puppet fail [14:28:34] PROBLEM - puppet last run on mw2055 is CRITICAL: CRITICAL: Puppet has 1 failures [14:28:44] PROBLEM - puppet last run on mw2056 is CRITICAL: CRITICAL: Puppet has 1 failures [14:28:44] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: Puppet has 2 failures [14:28:53] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 2 failures [14:29:13] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 118, down: 0, dormant: 0, excluded: 0, unused: 0 [14:29:14] PROBLEM - puppet last run on mw2092 is CRITICAL: CRITICAL: Puppet has 2 failures [14:29:34] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Puppet has 1 failures [14:29:52] fantastic [14:30:04] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Puppet has 1 failures [14:36:44] PROBLEM - puppet last run on mw2062 is CRITICAL: CRITICAL: puppet fail [14:37:05] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [14:37:23] PROBLEM - puppet last run on mw2111 is CRITICAL: CRITICAL: puppet fail [14:37:54] PROBLEM - puppet last run on nembus is CRITICAL: CRITICAL: puppet fail [14:38:03] PROBLEM - puppet last run on ms-be2009 is CRITICAL: CRITICAL: puppet fail [14:38:04] PROBLEM - puppet last run on cp2010 is CRITICAL: CRITICAL: puppet fail [14:38:14] PROBLEM - puppet last run on mw2047 is CRITICAL: CRITICAL: Puppet has 1 failures [14:38:23] PROBLEM - puppet last run on mw2101 is CRITICAL: CRITICAL: puppet fail [14:41:14] PROBLEM - puppet last run on mw2030 is CRITICAL: CRITICAL: Puppet has 2 failures [14:43:33] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1666433 (10mark) Let's start with giving Papaul bastiononly access, so he can do his current work more easily and get started learning the ropes of overall operations tasks / Gerri... [14:43:43] 10Ops-Access-Requests, 6operations: Requesting access to bastiononly or ops for papaul - https://phabricator.wikimedia.org/T111123#1666434 (10mark) a:5mark>3None [14:47:34] RECOVERY - puppet last run on cp2010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [14:47:44] RECOVERY - puppet last run on mw2127 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:47:54] RECOVERY - puppet last run on mw2055 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [14:48:04] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:48:24] RECOVERY - puppet last run on mw2092 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [14:48:35] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:53] RECOVERY - puppet last run on mw2030 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [14:48:53] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:54] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:04] RECOVERY - puppet last run on mw2110 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:49:14] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:23] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:49:24] RECOVERY - puppet last run on nembus is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:25] RECOVERY - puppet last run on ms-be2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:35] RECOVERY - puppet last run on mw2130 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:49:44] RECOVERY - puppet last run on mw2047 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:49:44] RECOVERY - puppet last run on mw2101 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:45] RECOVERY - puppet last run on ms-be2010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:54] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:54] RECOVERY - puppet last run on mw2056 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:04] RECOVERY - puppet last run on mc2014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:05] RECOVERY - puppet last run on mw2062 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [14:59:41] 6operations, 10netops: cr1/cr2-codfw QSFP+ errors every second for qsfp-0/0/0 - https://phabricator.wikimedia.org/T92616#1666475 (10faidon) 5Resolved>3Open This appeared again after a cr1-codfw reboot due to a JunOS upgrade. Let's finally ditch that CU5M. Procurement of fiber QSFP+s is tracked with RT #9548. [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150923T1500). Please do the needful. [15:00:05] jzerebecki irc-nickname: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:25] . [15:00:44] irc-nickname is such a clever nick! [15:01:06] jzerebecki: Merging your change now [15:06:41] 6operations, 10Possible-Tech-Projects, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1666493 (10mmodell) Much of the infrastructure that's needed for this is in place. I'm not sure it's really... [15:08:09] 6operations, 10Possible-Tech-Projects, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1666496 (10mark) Indeed, Ops and RelEng intend to work together on this in the upcoming quarter, using scap3. [15:11:11] 6operations, 6Release-Engineering-Team, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1666506 (10demon) [15:11:23] PROBLEM - puppet last run on wtp2011 is CRITICAL: CRITICAL: Puppet has 1 failures [15:11:33] PROBLEM - puppet last run on mw2128 is CRITICAL: CRITICAL: Puppet has 1 failures [15:12:24] PROBLEM - puppet last run on mw2007 is CRITICAL: CRITICAL: Puppet has 1 failures [15:12:34] PROBLEM - puppet last run on mw2171 is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:21] 6operations, 6Release-Engineering-Team, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1666511 (10mmodell) Probably duplicate of T104352 [15:13:25] (03Abandoned) 10Greg Grossmeier: Revert "Disable anonymous page creation on swWiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236045 (https://phabricator.wikimedia.org/T44894) (owner: 10Greg Grossmeier) [15:15:30] !log demon@tin Synchronized php-1.26wmf22/extensions/Wikidata: (no message) (duration: 00m 20s) [15:15:34] jzerebecki: ^^^ [15:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:15:44] testing [15:16:30] (03PS2) 10Tim Landscheidt: Tools: Remove gridengine aliases for some hosts [puppet] - 10https://gerrit.wikimedia.org/r/235157 (https://phabricator.wikimedia.org/T109485) [15:17:04] 6operations, 6Release-Engineering-Team, 10Wikimedia-Apache-configuration: Make it possible to quickly and programmatically pool and depool application servers - https://phabricator.wikimedia.org/T73212#1666526 (10demon) Actually, I think that depends on this. It asks for an API to use, and this task proposes... [15:17:47] Krenair: are we mid-SWAT? Want to do https://gerrit.wikimedia.org/r/#/c/236491/ this morning if you have the time? [15:19:43] RECOVERY - puppet last run on mw2111 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [15:22:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:26:34] (03PS1) 10coren: Add new shell users [puppet] - 10https://gerrit.wikimedia.org/r/240368 (https://phabricator.wikimedia.org/T113302) [15:28:49] ostriches: still need more time to actually see if the fix worked but it didn't break anything. [15:28:51] (03CR) 10Alex Monk: "Waiting for someone to announce, I think. Maybe it just needs a ticket against #developer-notice/#user-notice?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/217858 (owner: 10Faidon Liambotis) [15:29:04] RECOVERY - Disk space on labstore1002 is OK: DISK OK [15:29:35] ostriches: but you probably also need to sync wmf23? [15:29:39] (03PS1) 10coren: Add asherman to researchers [puppet] - 10https://gerrit.wikimedia.org/r/240369 (https://phabricator.wikimedia.org/T113118) [15:30:07] jzerebecki: I thought it was just wmf22? [15:30:20] ostriches: erm yes. you need to sync 23 and 24 also. [15:30:27] (03CR) 10jenkins-bot: [V: 04-1] Add asherman to researchers [puppet] - 10https://gerrit.wikimedia.org/r/240369 (https://phabricator.wikimedia.org/T113118) (owner: 10coren) [15:30:41] !log demon@tin Started scap: (no message) [15:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:56] jzerebecki: ^ [15:31:14] Aw, poop. [15:31:18] ostriches: wikidata is branched manually so the wmf22 from wikidata exists in 23 and 24 also [15:31:27] Ahhh, gotcha [15:32:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 0 below the confidence bounds [15:32:34] PROBLEM - Last backup of the maps filesystem on labstore1002 is CRITICAL: CRITICAL - Last run result was exit-code [15:32:48] !log demon@tin scap aborted: (no message) (duration: 02m 07s) [15:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:32:57] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog, 5Patch-For-Review: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1666581 (10coren) @asherman: Sorry this wasn't asked earlier, but nobody noticed that you didn't have shell access at all to begin with. In order t... [15:34:22] !log unbanning elastic1005 for T112559 [15:34:27] !log demon@tin Synchronized php-1.26wmf23/extensions/Wikidata: (no message) (duration: 00m 21s) [15:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:34:52] !log demon@tin Synchronized php-1.26wmf24/extensions/Wikidata: (no message) (duration: 00m 20s) [15:34:53] jzerebecki: 23 and 24 done too. sorry about that. [15:34:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:35:29] thx [15:35:34] RECOVERY - puppet last run on mw2171 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [15:35:44] 10Ops-Access-Requests, 6operations: RESTBase Admin access on aqs1001, aqs1002, and aqs1003 for Joseph and Dan - https://phabricator.wikimedia.org/T113416#1666592 (10coren) @kevinator: Please add approval language to the ticket at your convenience. [15:36:15] RECOVERY - puppet last run on wtp2011 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:36:24] RECOVERY - puppet last run on mw2128 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:37:23] RECOVERY - puppet last run on mw2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:37:23] (03CR) 10coren: [C: 032] Add new shell users [puppet] - 10https://gerrit.wikimedia.org/r/240368 (https://phabricator.wikimedia.org/T113302) (owner: 10coren) [15:40:04] PROBLEM - puppet last run on mw2022 is CRITICAL: CRITICAL: puppet fail [15:40:43] What a non-useful commit message :( [15:42:08] revert! [15:42:09] (03PS1) 10coren: Add chedasaurus to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/240371 (https://phabricator.wikimedia.org/T113302) [15:43:10] hashar: With what commit summary? "Remove new shell users" :D :D [15:43:53] ostriches: Sorry, while in the middle of doing it, the task numbers are "obvious" but in hindsight that's completely useless out of context. [15:44:21] * Coren flaggelates self in penance. [15:44:47] Actually the punishment is an hour in the village stocks :) [15:47:53] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [15:50:38] yuvipanda: it seems like the later week puppet swat is often far less popular than the early week one (so far, just my impression) [15:53:03] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [15:55:55] ostriches: works. thx again. [15:57:34] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 0 below the confidence bounds [16:01:34] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [16:04:59] !log deploy restbase daacf4d on restbase2* [16:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:47] ostriches, yeah that getRealmSpecificFilename change hasn't really worked [16:07:21] Boo [16:07:29] Looked like it would. [16:08:05] > var_dump( getRealmSpecificFilename( "$IP/../langlist" ) ); [16:08:07] string(49) "/mnt/srv/mediawiki-staging/php-master/../langlist" [16:08:17] Despite /mnt/srv/mediawiki-staging/php-master/../langlist-labs existing [16:08:59] (03CR) 10Alexandros Kosiaris: "Indeed it can be handled at a host level, but we are talking about changing the default in production here. I am fine with root logins on " [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [16:09:05] RECOVERY - puppet last run on mw2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [16:10:26] bblack: question: did we bumped up the logging limit of varnish on the beta labs varnish instance? [16:11:25] (03CR) 10Alexandros Kosiaris: "Actually, it would be better if both of these roles were merged into one, named "aqs" or something similar and that one applied to the hos" [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [16:14:32] (03PS1) 10Shanmugamp7: Adding "*.nps.gov" to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240376 (https://phabricator.wikimedia.org/T113188) [16:14:48] string(39) "/mnt/srv/mediawiki-staging/php-master/." [16:14:48] string(10) "./langlist" [16:14:49] string(70) "/mnt/srv/mediawiki-staging/php-master/.-labs./langlist does not exist!" [16:15:04] That's $base, $ext, and then the "realm take precedence over datacenter." value of new_filename [16:15:22] ugh, ircloud killing my cpu :/ [16:15:54] It's just because getRealmSpecificFilename can't handle any dots in the filename other than the extension one [16:17:25] to be honest, I'm not sure why we use $IP/../ for this when plenty of other code already assumes /srv/mediawiki/ works [16:17:31] ostriches, ^ [16:17:55] Hrm, so getRealmSpecificFilename( "/srv/mediawiki/fooooooo" ) instead? [16:18:00] yeah [16:19:46] uh. did I just take beta down? :/ [16:20:51] (03PS1) 10Chad: Use /srv/mediawiki directly instead of $IP/../ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240378 [16:21:11] I haven't merged anything in a bit... [16:21:58] uh, right. [16:22:22] that was because my live hack to test stuff on deployment-bastion got auto-deployed and broke everything, I think [16:22:30] lol. [16:22:58] !log start cassandra on restbase2002 [16:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:23:56] ostriches, http://meta.wikimedia.beta.wmflabs.org/wiki/Special:SiteMatrix looks a lot better with this applied [16:24:04] PROBLEM - HHVM rendering on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:06] still need to update WikimediaMaintenance to use it [16:24:14] PROBLEM - HHVM rendering on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:14] PROBLEM - HHVM rendering on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:14] PROBLEM - Apache HTTP on mw1252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:14] PROBLEM - Apache HTTP on mw1250 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:15] PROBLEM - HHVM rendering on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:15] PROBLEM - HHVM rendering on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - HHVM rendering on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - Apache HTTP on mw1245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - Apache HTTP on mw1094 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - HHVM rendering on mw1237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - Apache HTTP on mw1251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:23] PROBLEM - Apache HTTP on mw1256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:24] mw.org is down [16:24:24] PROBLEM - Apache HTTP on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:24] PROBLEM - Apache HTTP on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:25] PROBLEM - HHVM rendering on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:25] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:26] PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:32] ! [16:24:44] PROBLEM - HHVM rendering on mw1240 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:44] PROBLEM - HHVM rendering on mw1253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:44] PROBLEM - HHVM rendering on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:45] PROBLEM - HHVM rendering on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:45] PROBLEM - HHVM rendering on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:45] PROBLEM - HHVM rendering on mw1066 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:45] PROBLEM - HHVM rendering on mw1044 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:45] PROBLEM - HHVM rendering on mw1084 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:45] PROBLEM - HHVM rendering on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:46] PROBLEM - HHVM rendering on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:46] PROBLEM - HHVM rendering on mw1042 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:47] PROBLEM - HHVM rendering on mw1186 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:58] (for the record, I'm not touching anything in production at the moment.) [16:24:58] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:58] PROBLEM - HHVM rendering on mw1089 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:59] PROBLEM - Apache HTTP on mw1085 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:24:59] PROBLEM - HHVM rendering on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:00] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:00] PROBLEM - Apache HTTP on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:01] PROBLEM - Apache HTTP on mw1254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:01] PROBLEM - Apache HTTP on mw1247 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:02] PROBLEM - Apache HTTP on mw1236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:02] PROBLEM - HHVM rendering on mw1238 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:03] Glaisher: And enwiki [16:25:03] PROBLEM - HHVM rendering on mw1246 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:03] PROBLEM - HHVM rendering on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:04] PROBLEM - HHVM rendering on mw1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:04] PROBLEM - HHVM rendering on mw1258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:05] PROBLEM - HHVM rendering on mw1244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:06] PROBLEM - Apache HTTP on mw1181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:06] PROBLEM - HHVM rendering on mw1049 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:06] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:25:09] meta giving me 503 [16:25:28] yikes [16:25:52] Error 503 [16:25:53] "Request: GET http://en.wikipedia.org/wiki/Special:Watchlist, from 10.20.0.105 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 289324704 [16:25:55] multiple sites [16:25:55] Forwarded for: 80.176.129.180, 10.20.0.114, 10.20.0.114, 10.20.0.105 [16:25:55] what? [16:25:57] Error: 503, Service Unavailable at Wed, 23 Sep 2015 16:24:43 GMT " [16:26:02] urm [16:26:02] Again? [16:26:10] last week was a DDoS [16:26:12] 503's, yes. [16:26:22] What's the Ganglia link? [16:26:27] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:27] PROBLEM - Apache HTTP on mw1180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:27] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:27] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8967 bytes in 0.494 second response time [16:26:33] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - HHVM rendering on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1213 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:34] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:35] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:45] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8844 bytes in 0.034 second response time [16:26:46] paravoid ^^ maybe something you want to respond to [16:26:53] PROBLEM - Apache HTTP on mw1038 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:53] PROBLEM - Apache HTTP on mw1043 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:53] PROBLEM - HHVM rendering on mw1107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:53] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:53] PROBLEM - Apache HTTP on mw1035 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:53] PROBLEM - Apache HTTP on mw1162 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:53] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:54] PROBLEM - Apache HTTP on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:55] PROBLEM - Apache HTTP on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:55] PROBLEM - Apache HTTP on mw1150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:56] PROBLEM - Apache HTTP on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:26:56] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:11] PROBLEM - LVS HTTPS IPv6 on text-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8892 bytes in 0.029 second response time [16:27:14] DEWP DOWN :O [16:27:19] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8828 bytes in 0.030 second response time [16:27:20] Woop [16:27:25] PROBLEM - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8882 bytes in 0.456 second response time [16:27:31] logstash shows loads of database issues but no Jaime [16:27:31] I'm guessing ops is being paged now [16:27:32] enwiki down as well [16:27:32] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8950 bytes in 0.488 second response time [16:27:36] Wikipedia, y u no work? [16:27:38] RECOVERY - Apache HTTP on mw1210 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.720 second response time [16:27:39] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.816 second response time [16:27:39] RECOVERY - Apache HTTP on mw1176 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 7.631 second response time [16:27:39] PROBLEM - HHVM rendering on mw1249 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:39] PROBLEM - HHVM rendering on mw1243 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:27:43] 503 errors hitting [16:27:48] :( [16:27:52] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 65058 bytes in 0.109 second response time [16:27:52] RECOVERY - HHVM rendering on mw1077 is OK: HTTP OK: HTTP/1.1 200 OK - 65058 bytes in 0.114 second response time [16:27:53] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [16:27:53] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.050 second response time [16:27:53] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.098 second response time [16:27:53] RECOVERY - Apache HTTP on mw1181 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [16:27:53] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [16:27:53] RECOVERY - Apache HTTP on mw1036 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [16:27:54] RECOVERY - Apache HTTP on mw1151 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [16:27:55] RECOVERY - HHVM rendering on mw1049 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.121 second response time [16:27:55] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.115 second response time [16:27:55] RECOVERY - HHVM rendering on mw1149 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.152 second response time [16:27:56] RECOVERY - Apache HTTP on mw1070 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [16:27:56] RECOVERY - Apache HTTP on mw1107 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [16:28:04] yes we know, working on it [16:28:08] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.119 second response time [16:28:08] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.124 second response time [16:28:09] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.116 second response time [16:28:09] RECOVERY - Apache HTTP on mw1037 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.038 second response time [16:28:10] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [16:28:11] RECOVERY - Apache HTTP on mw1044 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.046 second response time [16:28:11] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [16:28:11] RECOVERY - Apache HTTP on mw1045 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [16:28:12] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [16:28:12] PROBLEM - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 8900 bytes in 0.410 second response time [16:28:13] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [16:28:20] heh, I said "temp" because the RECOVERIES started when I was typing :) [16:28:22] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.034 second response time [16:28:22] RECOVERY - Apache HTTP on mw1219 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.032 second response time [16:28:24] RECOVERY - Apache HTTP on mw1065 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [16:28:24] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [16:28:24] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.033 second response time [16:28:24] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [16:28:24] RECOVERY - Apache HTTP on mw1088 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [16:28:24] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.028 second response time [16:28:24] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [16:28:24] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [16:28:24] RECOVERY - HHVM rendering on mw1180 is OK: HTTP OK: HTTP/1.1 200 OK - 65058 bytes in 0.114 second response time [16:28:25] hoo|busy: :) [16:28:27] Good luck guys [16:28:35] It's back. [16:28:44] yeah [16:28:46] yay! [16:28:48] yep [16:28:49] greg-g: irony, Jaime's session is right now [16:28:49] The only thing that isn't back is icinga-wm ;) [16:29:37] RECOVERY - Apache HTTP on mw1242 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.026 second response time [16:29:37] RECOVERY - Apache HTTP on mw1046 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [16:29:37] RECOVERY - HHVM rendering on mw1212 is OK: HTTP OK: HTTP/1.1 200 OK - 65058 bytes in 0.123 second response time [16:29:37] RECOVERY - HHVM rendering on mw1039 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.117 second response time [16:29:38] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 15169 bytes in 0.595 second response time [16:29:47] RECOVERY - HHVM rendering on mw1183 is OK: HTTP OK: HTTP/1.1 200 OK - 65058 bytes in 0.108 second response time [16:29:47] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.127 second response time [16:29:47] RECOVERY - HHVM rendering on mw1047 is OK: HTTP OK: HTTP/1.1 200 OK - 65058 bytes in 0.121 second response time [16:29:57] RECOVERY - Apache HTTP on mw1038 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [16:29:57] RECOVERY - Apache HTTP on mw1043 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [16:29:57] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.039 second response time [16:29:57] RECOVERY - HHVM rendering on mw1107 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.128 second response time [16:29:57] RECOVERY - HHVM rendering on mw1093 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.126 second response time [16:29:57] RECOVERY - HHVM rendering on mw1105 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.124 second response time [16:29:57] RECOVERY - HHVM rendering on mw1083 is OK: HTTP OK: HTTP/1.1 200 OK - 65059 bytes in 0.164 second response time [16:29:58] RECOVERY - Apache HTTP on mw1162 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [16:29:58] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [16:29:59] RECOVERY - Apache HTTP on mw1035 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.041 second response time [16:29:59] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.044 second response time [16:30:00] RECOVERY - Apache HTTP on mw1241 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time [16:30:17] RECOVERY - HHVM rendering on mw1240 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.089 second response time [16:30:17] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.087 second response time [16:30:17] RECOVERY - HHVM rendering on mw1166 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.098 second response time [16:30:17] RECOVERY - HHVM rendering on mw1066 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.123 second response time [16:30:18] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.104 second response time [16:30:18] RECOVERY - HHVM rendering on mw1054 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.112 second response time [16:30:18] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.124 second response time [16:30:18] RECOVERY - HHVM rendering on mw1186 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.117 second response time [16:30:18] RECOVERY - HHVM rendering on mw1099 is OK: HTTP OK: HTTP/1.1 200 OK - 65054 bytes in 0.127 second response time [16:30:19] RECOVERY - HHVM rendering on mw1164 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.108 second response time [16:30:19] RECOVERY - HHVM rendering on mw1044 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.119 second response time [16:30:20] RECOVERY - HHVM rendering on mw1217 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.108 second response time [16:30:32] RECOVERY - Apache HTTP on mw1110 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.035 second response time [16:30:33] RECOVERY - Apache HTTP on mw1085 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.036 second response time [16:30:33] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.037 second response time [16:30:33] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.082 second response time [16:30:34] RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.119 second response time [16:30:34] RECOVERY - HHVM rendering on mw1089 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.131 second response time [16:30:35] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.116 second response time [16:30:35] RECOVERY - HHVM rendering on mw1060 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.158 second response time [16:30:36] RECOVERY - HHVM rendering on mw1091 is OK: HTTP OK: HTTP/1.1 200 OK - 65047 bytes in 0.152 second response time [16:30:36] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10501 bytes in 0.090 second response time [16:30:39] RECOVERY - Apache HTTP on mw1254 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.025 second response time [16:30:39] RECOVERY - Apache HTTP on mw1247 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.024 second response time [16:30:39] RECOVERY - Apache HTTP on mw1236 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.029 second response time [16:30:39] RECOVERY - HHVM rendering on mw1243 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.093 second response time [16:30:39] RECOVERY - LVS HTTPS IPv6 on mobile-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10544 bytes in 0.591 second response time [16:30:46] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 15140 bytes in 0.651 second response time [16:31:21] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 58.33% of data above the critical threshold [500.0] [16:31:41] RECOVERY - Apache HTTP on mw1238 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.027 second response time [16:32:30] RECOVERY - Apache HTTP on mw1257 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.043 second response time [16:32:30] RECOVERY - HHVM rendering on mw1238 is OK: HTTP OK: HTTP/1.1 200 OK - 65046 bytes in 0.087 second response time [16:33:30] RECOVERY - HHVM rendering on mw1246 is OK: HTTP OK: HTTP/1.1 200 OK - 65050 bytes in 0.093 second response time [16:33:30] RECOVERY - HHVM rendering on mw1258 is OK: HTTP OK: HTTP/1.1 200 OK - 65050 bytes in 0.090 second response time [16:33:41] RECOVERY - HHVM rendering on mw1244 is OK: HTTP OK: HTTP/1.1 200 OK - 65050 bytes in 0.100 second response time [16:33:51] RECOVERY - LVS HTTPS IPv4 on mobile-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 10570 bytes in 0.473 second response time [16:33:57] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 3.33% of data above the critical threshold [1000.0] [16:34:00] RECOVERY - Apache HTTP on mw1239 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.023 second response time [16:34:56] (03CR) 10MarcoAurelio: [C: 031] Adding "*.nps.gov" to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240376 (https://phabricator.wikimedia.org/T113188) (owner: 10Shanmugamp7) [16:35:01] PROBLEM - Cassanda CQL query interface on restbase2003 is CRITICAL: Connection refused [16:35:21] PROBLEM - Cassandra database on restbase2003 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (cassandra), command name java, args CassandraDaemon [16:35:22] !log ori@tin Synchronized php-1.26wmf23/includes/page: I952068d2d: Reduced the DOS potential of 404 page floods (duration: 00m 12s) [16:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:56] !log ori@tin Started scap: (no message) [16:36:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:36:31] PROBLEM - Restbase endpoints health on restbase2003 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:36:45] Coren, wait, you added a labs key to prod...? [16:36:50] PROBLEM - Restbase root url on restbase2003 is CRITICAL: Connection refused [16:36:50] PROBLEM - Restbase endpoints health on restbase2002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [16:37:01] PROBLEM - Restbase root url on restbase2002 is CRITICAL: Connection refused [16:41:10] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1666744 (10Krenair) @Coren: It looks like @JUnikowski_WMF has added that key to their labs account, so I think a new one needs to be made now. [16:43:38] 6operations, 10Wikimedia-Mailing-lists, 7Documentation: Overhaul Mailman documentation - https://phabricator.wikimedia.org/T109534#1666748 (10JohnLewis) Looking over this a bit more. I'm going to try and get this done over the weekend starting Friday. I'm also going to aim to get https://meta.wikimedia.org/... [16:48:30] ori, Reedy: would be nice to get https://gerrit.wikimedia.org/r/#/c/87269/ in to shape [16:48:58] (03PS1) 10coren: Block POSTs to some wiki URLs [puppet] - 10https://gerrit.wikimedia.org/r/240389 [16:49:01] bblack: ^^ [16:49:32] (03Abandoned) 10Florianschmidtwelzow: Run suggested search query in wmf wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/237029 (https://phabricator.wikimedia.org/T105202) (owner: 10Florianschmidtwelzow) [16:49:46] Krenair: No I didn't - Chedasaurus had updated the task description. [16:52:24] (03CR) 10Merlijn van Deen: "Sorry, I should have formulated that more clearly. I assume there is a production-wide base hiera configuration, which could have" [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [16:52:39] 6operations, 10ops-codfw, 10Incident-20150617-LabsNFSOutage: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1666778 (10Papaul) p:5High>3Normal [16:53:23] godog: what's up with rb200x ? [16:53:32] both rb and cass are crying and dying [16:53:55] cryin' and dyin' [16:54:01] you;re setting it up i guess [16:54:43] 6operations, 10ops-codfw: audit and update all codfw server's racktables info - https://phabricator.wikimedia.org/T84891#1666785 (10Papaul) 5Open>3Resolved Closing this task it was already done By Chris when in was here back in January. [16:54:50] (03PS2) 10Alex Monk: Use /srv/mediawiki directly instead of $IP/../ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240378 (https://phabricator.wikimedia.org/T112006) (owner: 10Chad) [16:55:29] (03CR) 10Alex Monk: "It only has issues with such paths when the filename has no extension..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240378 (https://phabricator.wikimedia.org/T112006) (owner: 10Chad) [16:57:11] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [16:58:14] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: videoscaling doesn't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1666802 (10brion) Are these from completed jobs or jobs that timed out during the shell-out, halting HHVM PHP execution before it... [17:00:32] PROBLEM - HHVM rendering on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:24] PROBLEM - Apache HTTP on mw1122 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:03] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:05:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 818 [17:10:13] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1118 [17:11:10] !log ori@tin Synchronized php-1.26wmf23/includes/page/Article.php: (no message) (duration: 00m 17s) [17:11:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:34] !log ori@tin Synchronized php-1.26wmf24/includes/page/Article.php: (no message) (duration: 00m 13s) [17:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:12:54] (03CR) 10Dzahn: [C: 031] Disable the ferm rules cache [puppet] - 10https://gerrit.wikimedia.org/r/240335 (https://phabricator.wikimedia.org/T113380) (owner: 10Muehlenhoff) [17:14:17] after investigation it seems I was mistaken and it seems the the morning SWAT didn't have the intended effect. could someone check if that change made it to terbium for mwscript ... --wiki wikidatawiki ?: https://gerrit.wikimedia.org/r/#/c/240365/1/extensions/Wikibase/client/includes/Changes/AffectedPagesFinder.php,cm [17:14:44] ostriches: ^^ [17:15:11] (03PS2) 10Dzahn: Fix definition of deployable networks [puppet] - 10https://gerrit.wikimedia.org/r/240310 (https://phabricator.wikimedia.org/T113351) (owner: 10Muehlenhoff) [17:15:13] RECOVERY - check_mysql on db1008 is OK: Uptime: 5272040 Threads: 1 Questions: 36284451 Slow queries: 35697 Opens: 86757 Flush tables: 2 Open tables: 64 Queries per second avg: 6.882 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [17:16:35] jzerebecki: Seems to be part of wmf24, but not wmf23 [17:16:45] Hmm, I sync'd 23 too [17:17:39] git log says it's there on tin. [17:17:58] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog, 5Patch-For-Review: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1666881 (10asherman) @coren: No problem, Will this work (below)? This is different from the SSH key I set up in labs. -----BEGIN RSA PRIVATE KEY---... [17:18:09] demon@tin Synchronized php-1.26wmf23/extensions/Wikidata: (no message) (duration: 00m 21s) [17:18:15] Er, or not. [17:18:16] can't see it on staging either (on tin) [17:18:22] wmf23 doesn't seem to have it [17:19:14] AaronSchulz: I at least did most of the updates today :P. Will see about getting the error part done later today [17:19:49] (03CR) 10Chad: [C: 032] Adding "*.nps.gov" to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240376 (https://phabricator.wikimedia.org/T113188) (owner: 10Shanmugamp7) [17:19:52] (03CR) 10Dzahn: [C: 032] "fixes firewalling on mira, noop on tin but required either way" [puppet] - 10https://gerrit.wikimedia.org/r/240310 (https://phabricator.wikimedia.org/T113351) (owner: 10Muehlenhoff) [17:20:17] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog, 5Patch-For-Review: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1666889 (10JanZerebecki) That is the private key. Please generate a new one and only send out the public part. [17:20:20] !log ori@tin scap failed: OSError [Errno 2] No such file or directory: '/var/lock/scap' (duration: 44m 24s) [17:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:22:03] Oh wow... [17:22:20] ostriches: jzerebecki: Guess you need to manually bump the submodule... I wont do it, am not working today [17:22:53] ostriches: should I do that? [17:23:05] If you would, I'll gladly merge once you've got it up [17:23:23] Krenair: Heh... [17:23:37] (03PS1) 10Eevans: typo s/Cassanda/Cassandra/ [puppet] - 10https://gerrit.wikimedia.org/r/240397 [17:23:46] ori: Do you have a stacktrace for that failure in scap? I fixed something similar the other day... [17:25:38] (03CR) 10Subramanya Sastry: [C: 031] ":)" [puppet] - 10https://gerrit.wikimedia.org/r/240397 (owner: 10Eevans) [17:25:57] (03CR) 10Chad: [V: 032] Adding "*.nps.gov" to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240376 (https://phabricator.wikimedia.org/T113188) (owner: 10Shanmugamp7) [17:26:22] (03CR) 10Mobrovac: [C: 031] typo s/Cassanda/Cassandra/ [puppet] - 10https://gerrit.wikimedia.org/r/240397 (owner: 10Eevans) [17:26:35] nuria: only the production mobile, upload, and text clusters so far [17:27:01] !log demon@tin Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 13s) [17:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:19] (which is the bulk of everything in prod, but I wanted to time out the other few manually) [17:27:26] 10Ops-Access-Requests, 6operations, 10Wikimedia-Blog, 5Patch-For-Review: stat1003/EventLogging access for asherman - https://phabricator.wikimedia.org/T113118#1666930 (10asherman) Ahh I see what I did wrong. My apologies, I have generated a new SSH key and here is the public key: ssh-rsa AAAAB3NzaC1yc2EAA... [17:34:54] (03CR) 10Dzahn: "puppet applied the diff on mira, instead of the broken string it now uses +&R_SERVICE(tcp, 80, $DEPLOYABLE_NETWORKS); as expected. all goo" [puppet] - 10https://gerrit.wikimedia.org/r/240310 (https://phabricator.wikimedia.org/T113351) (owner: 10Muehlenhoff) [17:35:02] !log enabling, and forcing puppet run on restbase2003 [17:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:40:55] bblack: ok, if you can update ticket when you update the beta cluster it will be great for us to know when to test there [17:41:24] RECOVERY - Cassanda CQL query interface on restbase2003 is OK: TCP OK - 0.034 second response time on port 9042 [17:43:09] ostriches: https://gerrit.wikimedia.org/r/#/c/240427/ [17:45:10] !log demon@tin Synchronized php-1.26wmf23/extensions/Wikidata: (no message) (duration: 00m 25s) [17:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:45:18] jzerebecki: ^^^ [17:45:36] ostriches: thx [17:45:51] yw [17:48:07] akosiaris: yt? [17:48:31] 6operations, 5Patch-For-Review: setup / deploy nobelium for elastic-search testing in labs - https://phabricator.wikimedia.org/T113282#1667018 (10EBernhardson) We should move the mounted disk, it's currently mounted at /srv but for elasticsearch we should mount it to: /var/lib/elasticsearch/labsearch [17:49:17] (03PS1) 10John F. Lewis: mailman: sudo mailman_check_queue as list [puppet] - 10https://gerrit.wikimedia.org/r/240430 (https://phabricator.wikimedia.org/T113326) [17:49:24] !log starting Cassandra on restbase2003 [17:49:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:49:34] RECOVERY - Cassandra database on restbase2003 is OK: PROCS OK: 1 process with UID = 113 (cassandra), command name java, args CassandraDaemon [17:49:38] ottomata: yes [17:50:19] about aqs puppet patch [17:50:27] i'm thinking about submitting another patch that refactores those two roles [17:50:45] i think that we don't need to copy/paste cassandra and restbase puppet code into thse classes [17:50:48] i think the existing roles work fine. [17:51:03] they are configurable with hiera, so [17:51:12] should be able to just include them and use hiera, no? [17:51:29] the roles are not configurable and thank god for that [17:51:40] it's the modules that are configurable [17:52:20] what do you want to refactor in those roles ? [17:52:28] reusing roles btw is a bad idea [17:55:36] ori: Do you have a stacktrace for that failure in scap? I fixed something similar the other day... [17:55:43] i sync-filed a change while a scap was in progress [17:56:06] Ah, started scap, sync file'd, scap failed? [17:56:13] i don't have the trace. but it was appropriate for scap to throw an error. it just wasn't the right error. [17:56:23] no: started scap; attempted to sync-file; sync-file failed. [17:56:30] * ostriches nods [17:59:54] (03CR) 10Dzahn: "there are no spaces between the IPs here:" [puppet] - 10https://gerrit.wikimedia.org/r/240310 (https://phabricator.wikimedia.org/T113351) (owner: 10Muehlenhoff) [17:59:59] akosiaris: reusing roles is a bad idea?! [18:00:03] the roles are configurable [18:00:04] twentyafterfour: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150923T1800). Please do the needful. [18:00:05] because they include modules [18:00:47] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: videoscaling doesn't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1667083 (10brion) I notice a lot of the webms have 2-pass log files alongside them, so definitely looks like jobs got shut down w... [18:01:03] akosiaris: all this cassandra role does is include the cassandra module, set up firewalls and monitoring [18:01:05] ottomata: unfortunately yes. they rarely are reusable [18:01:12] it seems like a really nice WMF specific use of the module. [18:01:24] its pretty generic [18:01:29] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: HHVM timeouts mean videoscaling can't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1667084 (10brion) [18:01:44] !log restbase deploying f65313ed [18:01:49] 6operations, 10MediaWiki-extensions-TimedMediaHandler: Frequent job timeouts on HHVM video scalers - https://phabricator.wikimedia.org/T113284#1667093 (10brion) [18:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:01:51] 6operations, 10MediaWiki-General-or-Unknown, 10Wikimedia-Video: HHVM timeouts mean videoscaling can't clean locally transcoded files from the filesystem - https://phabricator.wikimedia.org/T113447#1665198 (10brion) [18:02:44] it is pretty generic and that's mostly because it's good puppet code. but let's say you want to add another firewall or monitoring rule. how do you go about it ? [18:03:07] if cassandra, you only add ones that are good for generic wmf use of cassandra [18:03:32] what does "if cassandra" mean ? [18:03:42] if the monitoring is for cassandra [18:03:48] not for say, restbase. [18:04:08] so what is you need the rule but it is not good for generic wmf use of cassandra ? [18:04:13] as in, "Oh, i want to monitor number of keyspaces (I know 0 about cassandra) then sure, put it in the role [18:04:14] what if* [18:04:33] then it goes elsewhere, i had never heard anyone else say that composing roles was bad [18:04:38] my point is that the team using the cassandra role might not want you check or vice versa [18:04:43] say it was an aqs + cassandra specific monitroing [18:04:44] then i would have [18:05:09] class ...aqs { [18:05:09] ... [18:05:10] include role::cassandra. [18:05:10] monitor::specialthing [18:05:11] } [18:05:40] yeah, don't do that [18:05:50] roles including roles turns out to be a mess [18:06:17] CI did in various occasions and then all sets of cross module/cross role dependencies started crawling up [18:06:27] we do it a lot in analytics, and it works well [18:06:42] it works, but noone wants to read those roles apart from you [18:06:50] haha [18:07:04] not to mention the fact that noone else uses them [18:07:09] it works so well though, no? it makes it really easy to configure things all on one instance in labs, or on multiple instances in prod [18:07:13] true. [18:07:38] akosiaris: but, then again 'role' is a wmf concept [18:07:50] i tried to argue for a wmf specific module path a couple of years ago [18:07:56] please dont include roles in roles :) [18:08:12] two types of modules, one for generic software configuartion, and another for wmf's use of it [18:08:28] then what you say would make more sense, as it the firewall rules could be in a module [18:08:29] ottomata: yeah I remember. the counter argument is that it was very difficult to get a really generic module [18:08:34] no way [18:08:41] you just have to be thorough! [18:09:06] its fine if it doesn't fit users outside of wmf [18:09:15] but wmf has many environments, and it should be useable in all of them [18:09:19] ANNNYYYWAY [18:09:20] so [18:09:33] you think it is right to copy/paste everything from the existant cassandra/restbase roles? [18:09:46] existent! [18:09:51] or extant [18:10:00] ori: too much hale berry [18:10:05] :P [18:10:07] haha [18:10:22] thank you ori, I will now send myself to 3 days of solitary confinement for my transgression :) [18:10:23] * ori didn't get the reference [18:10:47] ottomata: so my take would be: a) the cassandra role should be merged into the restbase role, since it is basically powering that thing and nothing else [18:11:08] b) that the aqs role should be mostly copying that restbase role [18:11:22] akosiaris: what if we eventually don't want to colocate restbase + cassandra? [18:11:24] that should allow both teams enough flexibility still allowing to reuse the bulk of the code [18:11:34] or decide to use main restbase for aqs? [18:11:52] ottomata: none of those 2 options seems very plausible at this point [18:12:10] quite the contrary [18:12:33] sorry, meant use the main *cassandra* for aqs [18:12:34] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [18:12:51] yeah, I got that [18:13:12] and I already said I don't really see it plausible at this point [18:13:19] ok, welp, i don't like it so much, but not enough to argue much more :) i'm fine with an aqs role that conmbines the existing two analytics roles [18:13:19] otherwise we would not be having this conversation [18:13:43] ottomata: thanks [18:13:50] glad I managed to convince you [18:14:03] btw, role/analytics/aqs.pp or role/aqs.pp ? [18:14:05] haha, i wouldnt' say that! i think composing roles is good [18:14:08] good question [18:14:10] i don't know [18:14:13] mabye just aqs [18:14:13] I prefer the latter tbh [18:14:18] since it is outside of analytics cluster [18:14:30] ok. it's a deal [18:14:41] k i will submit a patch cause i'm in it already atm [18:14:42] :) [18:20:41] 6operations, 10RESTBase, 6Services: RESTBase and domain renames - https://phabricator.wikimedia.org/T113307#1667204 (10mobrovac) >>! In T113307#1664054, @Eevans wrote: > Where the domain is a component of the partition key, this means completely rewriting the affected rows, no? Yup. > Could we maybe just r... [18:22:09] (03PS7) 10Ottomata: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:23:20] (03CR) 10Aude: "we still want to do this, but should announce beforehand and set a date/time to do this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/208655 (https://phabricator.wikimedia.org/T94416) (owner: 10Aude) [18:23:58] (03CR) 10Ori.livneh: "What I like about this change:" [puppet] - 10https://gerrit.wikimedia.org/r/239998 (owner: 10Ori.livneh) [18:25:09] akosiaris: ahh this is what i'm talking about! i shouldn't have to edit anything in the module to set up restbase! [18:25:16] there is a template file that lives there that I have to rename [18:26:27] (03PS2) 10Ori.livneh: Update mod_status configuration [puppet] - 10https://gerrit.wikimedia.org/r/239998 [18:26:56] (03PS8) 10Ottomata: Add Analytics Query Service role [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:28:40] (03PS1) 10Chad: Remove PHP localization cache code [tools/scap] - 10https://gerrit.wikimedia.org/r/240440 [18:30:44] ottomata: yes you shouldn't. fix it ? [18:31:03] why rename a template file btw ? [18:32:42] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1667264 (10JUnikowski_WMF) Do I need to gen a new key and post it here? [18:32:44] it was named analytics-restbase [18:32:52] i named it aqs [18:32:57] restbase/config.aqs.yaml.erb [18:32:59] vs [18:33:04] restbase/config.analytics.yaml.erb [18:35:23] 6operations, 10RESTBase, 6Services: RESTBase and domain renames - https://phabricator.wikimedia.org/T113307#1667269 (10Eevans) >>! In T113307#1667204, @mobrovac wrote: >>>! In T113307#1664054, @Eevans wrote: [ ... ] >> Could we maybe just rewrite all the way down into storage (in perpetuity of course)? >... [18:35:25] madhuvishy: it looks like we made the right decision regarding gobblin [18:35:31] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767 [18:35:31] https://cwiki.apache.org/confluence/display/KAFKA/Copycat+Data+API [18:36:00] ottomata: whaa nice [18:36:47] oops, wrong chat :) [18:40:29] (03PS2) 10Andrew Bogott: Remove the Graphite/Diamond based conntrack saturation check [puppet] - 10https://gerrit.wikimedia.org/r/239332 (owner: 10Muehlenhoff) [18:41:31] (03CR) 10Andrew Bogott: [C: 032] Remove the Graphite/Diamond based conntrack saturation check [puppet] - 10https://gerrit.wikimedia.org/r/239332 (owner: 10Muehlenhoff) [18:48:25] (03PS1) 1020after4: group1 wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240444 [18:48:50] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240444 (owner: 1020after4) [18:49:44] PROBLEM - puppet last run on ms-be3003 is CRITICAL: CRITICAL: puppet fail [18:50:58] (03PS8) 10Andrew Bogott: Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [18:51:12] (03Abandoned) 10Jgreen: Allocate frack management subnets [dns] - 10https://gerrit.wikimedia.org/r/175718 (owner: 10Mark Bergsma) [18:53:24] (03CR) 10Andrew Bogott: [C: 031] Install Extension:Translate on labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [18:53:45] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf24 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240444 (owner: 1020after4) [18:54:34] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf24 [18:54:38] (03CR) 10Ori.livneh: [C: 04-1] "Thanks Chad. Looks like I am owed a slap for this :) However, the third change (If3bd73f6b4) should not be reverted -- that bit was succes" [tools/scap] - 10https://gerrit.wikimedia.org/r/240440 (owner: 10Chad) [18:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:21] !log snapshot1001.eqiad.wmnet returned [12]: rsync: write failed on "/srv/mediawiki/wikiversions.cdb": No space left on device [18:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:55:43] I assume snapshot1001 being full is nothing I should worry about too much? [18:56:07] (03CR) 10Andrew Bogott: [C: 032] maintain-replicas: Do not record centralauth in meta_p.wiki [software] - 10https://gerrit.wikimedia.org/r/221042 (https://phabricator.wikimedia.org/T101750) (owner: 10Alex Monk) [18:57:03] (03CR) 10Andrew Bogott: [V: 032] maintain-replicas: Do not record centralauth in meta_p.wiki [software] - 10https://gerrit.wikimedia.org/r/221042 (https://phabricator.wikimedia.org/T101750) (owner: 10Alex Monk) [18:57:46] twentyafterfour, hmm... apergos? [18:58:00] do not worry [18:58:09] I will be reinstalling it, tomorrwo since it didnt' happen today [18:58:14] I should just ack that [18:58:33] apergos: or remove it from mediawiki-installation dsh group [18:59:25] ACKNOWLEDGEMENT - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=58%): arielglenn to be reinstalled shortly [18:59:34] I should have done that earlier today. braindead [19:00:22] will it be installed with hhvm? [19:02:08] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 65 data above and 9 below the confidence bounds [19:03:29] (03PS1) 10Madhuvishy: analytics: Add cron to drop Eventlogging data older than 90 days from hadoop [puppet] - 10https://gerrit.wikimedia.org/r/240449 (https://phabricator.wikimedia.org/T106253) [19:10:49] (03Abandoned) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) (owner: 1020after4) [19:16:12] What is going on with categorys? [19:16:19] all broken [19:16:21] the formatting [19:16:23] Steinsplitter: example? [19:16:25] https://commons.wikimedia.org/wiki/Category:Images_requiring_rotation_by_bot [19:16:31] just fick up a random cat on commons [19:16:59] or https://commons.wikimedia.org/wiki/Category:Tests [19:17:35] Steinsplitter: starting today? we just updated the MW version on commons: https://tools.wmflabs.org/sal/log/AU_7jqgx1oXzWjit5phG [19:17:46] just as in 30 minutes ago [19:17:59] yes [19:18:16] the cat are completly unusable now. [19:18:27] could be related to the lessphp change [19:18:30] i'll investigate [19:18:33] ori: thank you [19:18:36] Steinsplitter: could you file a bug meanwhile? [19:18:36] thanks [19:18:38] RECOVERY - puppet last run on ms-be3003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:18:42] ori: sure :) [19:18:45] thanks! [19:21:21] 6operations: Category design broken after 1.26wmf24 - https://phabricator.wikimedia.org/T113511#1667454 (10Steinsplitter) 3NEW [19:22:08] 6operations, 6Commons: Category design broken after 1.26wmf24 - https://phabricator.wikimedia.org/T113511#1667468 (10Steinsplitter) [19:22:16] (03PS2) 10Dzahn: mailman: sudo mailman_check_queue as list [puppet] - 10https://gerrit.wikimedia.org/r/240430 (https://phabricator.wikimedia.org/T113326) (owner: 10John F. Lewis) [19:22:51] (03CR) 10Dzahn: [C: 032] "yes, this must run as the list user, thanks for the fix. checked in compiler (917)" [puppet] - 10https://gerrit.wikimedia.org/r/240430 (https://phabricator.wikimedia.org/T113326) (owner: 10John F. Lewis) [19:25:47] icinga-wm: so tell me about it [19:26:10] JohnFLewis: fix works, that made it CRIT actually one is above the 42 [19:26:42] mutante: cool. I'll make the next patch now to stop shunting :) [19:27:18] Steinsplitter, greg-g: I think it's https://gerrit.wikimedia.org/r/#/c/234592/ ; ping jdlrobson / FlorianSW for fix [19:28:47] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above 42 [19:28:50] (03PS1) 10John F. Lewis: mailman: don't monitor shunt, monitor bounces [puppet] - 10https://gerrit.wikimedia.org/r/240470 (https://phabricator.wikimedia.org/T113326) [19:28:54] JohnFLewis: :) ok! [19:28:58] mutante: ^^ [19:29:06] acm [19:29:08] ack [19:29:43] (03CR) 10Dzahn: [C: 032] "per John's comment on ticket. i agree." [puppet] - 10https://gerrit.wikimedia.org/r/240470 (https://phabricator.wikimedia.org/T113326) (owner: 10John F. Lewis) [19:29:47] (03PS2) 10Dzahn: mailman: don't monitor shunt, monitor bounces [puppet] - 10https://gerrit.wikimedia.org/r/240470 (https://phabricator.wikimedia.org/T113326) (owner: 10John F. Lewis) [19:30:23] ori: the change effects only the minerva skin (mobile), not Vector, or at least it should. Maybe there are undesired side effects. [19:30:37] FlorianSW: I think there was a related set of changes to core [19:30:56] ori: yeah, I see: https://gerrit.wikimedia.org/r/#/c/233084/ (looked the associated task) [19:31:51] seems like reverting https://gerrit.wikimedia.org/r/#/c/233085/ for now is probably the quickest way to get this fixed [19:33:38] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/240470 (https://phabricator.wikimedia.org/T113326) (owner: 10John F. Lewis) [19:34:11] * FlorianSW is testing ori's change [19:35:53] 6operations, 5Patch-For-Review: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1667570 (10Dzahn) The first change fixed the existing monitoring. It turned CRIT because shunt was above 42. So proof that it worked. The second change removes the shunt queue and adds monitoring bo... [19:36:29] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below 42 [19:36:48] :) [19:37:06] 6operations, 5Patch-For-Review: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1667572 (10Dzahn) https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=fermium&service=mailman_queue_size is now OK again but as opposed to before we know it would trigger :) thanks [19:37:21] 6operations, 5Patch-For-Review: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1667573 (10Dzahn) 5Open>3Resolved [19:37:31] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1661872 (10Dzahn) [19:37:50] 6operations: check mailman queue size monitoring - https://phabricator.wikimedia.org/T113326#1661872 (10Dzahn) a:5Dzahn>3JohnLewis [19:38:53] Steinsplitter: have you created a task for the problem? [19:39:08] yes [19:39:08] FlorianSW: T113511 [19:39:18] ori, Steinsplitter thanks :) [19:43:03] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1667594 (10Krenair) >>! In T26675#1547385, @Krenair wrote: > Maybe we could restore the revision from a dump. Might it be in enwiki-2... [19:43:09] Steinsplitter: should be fixed now [19:43:18] thx [19:43:24] np, thanks for the report [19:43:58] and thanks FlorianSW for testing / reviewing [19:44:18] ori: np, I'll upload a fix for the "real" problem after your revert is merged in master :) [19:45:00] !log enabling puppet, and forcing a run on restbase2004 [19:45:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:47:20] ori: category pages are cached, too, right? (To be clear: output that's genereated thorugh CategoryViewer) [19:48:17] !log starting Cassandra on restbase2004 [19:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:18] FlorianSW: i think so; purging might be necessary [19:50:26] !log enabling puppet, and forcing a run on restbase2005 [19:50:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:50:52] ori: Wouldn't waiting 30 days again work, too? I'm not sure, if purging all pages is a really good idea :] [19:51:29] yeah, i wasn't suggesting purging all pages. a manual purge of the odd page cached in a bad state seems ok. [19:52:04] !log starting Cassandra on restbase2005 [19:52:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:52:31] ah, ok, my fault :) [19:55:20] (03PS2) 10Madhuvishy: analytics: Add cron to drop Eventlogging data older than 90 days from hadoop [puppet] - 10https://gerrit.wikimedia.org/r/240449 (https://phabricator.wikimedia.org/T106253) [19:56:44] !log enabling puppet, and forcing a run on restbase2006 [19:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:56] (03CR) 10Florianschmidtwelzow: Update mediawiki version regex to support semantic version (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228039 (https://phabricator.wikimedia.org/T67306) (owner: 1020after4) [19:58:04] !log starting Cassandra on restbase2006 [19:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:00:04] gwicke cscott arlolra subbu mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150923T2000). [20:02:06] ebernhardson: btw you should have root on nobelium too [20:03:18] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.67% of data above the critical threshold [1000.0] [20:03:37] ebernhardson: did that patch for multi dc es stuff merge? [20:03:53] yuvipanda: yes, but its not live anywhere [20:04:23] and due to the cross-extension dependency we would probably fail a few thousand requests trying to cherry pick it out [20:04:31] best to wait for the train deploy when it will switch the codebase atomically [20:04:46] ebernhardson: sure [20:04:47] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1667658 (10greg) >>! In T100519#1655357, @greg wrote: >>>! In T100519#1529708, @BBlack wrote: >> Basically, yeah. I ran down a similar plan... [20:04:53] (03PS3) 10Ottomata: analytics: Add cron to drop Eventlogging data older than 90 days from hadoop [puppet] - 10https://gerrit.wikimedia.org/r/240449 (https://phabricator.wikimedia.org/T106253) (owner: 10Madhuvishy) [20:04:55] ebernhardson: also awesome to have it merged etc ;) [20:05:03] ebernhardson: will also give us time to do the network ports stuff [20:05:19] ebernhardson: can you verify that you have root on nobelium? [20:05:28] yuvipanda: already did, works fine [20:05:30] Krenair: what's the gerrit search operator to check groups? [20:05:33] ebernhardson: \o/ awesome [20:05:51] yuvipanda: annoyingy the elasticsearch plugins didn't install, it looks like thats gated on some production flag [20:06:06] yuvipanda: so i thought ...its only a test fsck it and cloned the repository into place...but es still isn't happy :P i'll poke later [20:06:07] (03CR) 10Ottomata: [C: 032] analytics: Add cron to drop Eventlogging data older than 90 days from hadoop [puppet] - 10https://gerrit.wikimedia.org/r/240449 (https://phabricator.wikimedia.org/T106253) (owner: 10Madhuvishy) [20:06:29] ebernhardson: no I applied a different role [20:06:32] yuvipanda: oh ok [20:06:34] which I should probably not do [20:06:39] ebernhardson: I'll clean that up :D [20:06:48] ebernhardson: it might also cause problems for trebuchet [20:07:07] ebernhardson: I also really hate that ES depends on trebuchet :D [20:07:13] ebernhardson: can we rpelace that with a git::clone? [20:07:27] is there someway we can use trebuchet, saltstack, puppet and ansible? [20:07:30] i think that would be better [20:07:34] :P [20:07:35] Krenair: what's the gerrit search operator to check groups? [20:07:36] ownerin? [20:07:40] ah ok [20:07:46] -ownerin:ldap/ops [20:07:46] ebernhardson: :P [20:07:55] will not list changes owned by people in ops [20:07:56] yuvipanda: as for git clone, i have no clue why it was done with trebuchet as it is [20:08:06] yuvipanda: maybe git-fat is the reason? [20:08:18] are these plugins reallly huge? [20:08:35] yuvipanda: not rediculous, but the current checkout is 6.5M [20:08:43] ebernhardson: did you make a ticket with what we talked about? kafka topic + camus import? [20:08:43] (03PS2) 10Yuvipanda: typo s/Cassanda/Cassandra/ [puppet] - 10https://gerrit.wikimedia.org/r/240397 (owner: 10Eevans) [20:08:50] (03CR) 10Yuvipanda: [C: 032 V: 032] typo s/Cassanda/Cassandra/ [puppet] - 10https://gerrit.wikimedia.org/r/240397 (owner: 10Eevans) [20:09:00] yuvipanda: and pretty much every time we upgrade ES versions we replace all of them [20:09:20] ebernhardson: that's way smaller than MW which we just use git for :D [20:09:23] ottomata: doh no i didnt. also i found another log that might end up in kafka, pointed anomie your way [20:10:19] ottomata: sec doing it now [20:10:33] (03PS3) 10Yuvipanda: Kibana: Fix apache::site title [puppet] - 10https://gerrit.wikimedia.org/r/236727 (owner: 10BryanDavis) [20:10:42] (03CR) 10Yuvipanda: [C: 032 V: 032] Kibana: Fix apache::site title [puppet] - 10https://gerrit.wikimedia.org/r/236727 (owner: 10BryanDavis) [20:11:03] ebernhardson: oh yeah? whatcha mean another log? whatcha go? [20:11:52] ottomata: the api currently records a 150G log per day to fluorine that is a sampling of api requests [20:12:10] ottomata: this is obviously annoying to process, and a recent ask to add ~30% to its size with more info is being approached cautiously [20:12:25] ottomata: so i thought kafka might be a much better destination that can take the load? its ~6k logs/sec [20:13:08] bblack: do all calls to Special:BannerLoader always hit PHP, rather than getting cached by Varnish, for logged-in users? The calls are background calls via $.ajax [20:13:33] (03PS1) 10Ottomata: Use $eventlogging_retention_days for eventlogging drop partition job [puppet] - 10https://gerrit.wikimedia.org/r/240566 [20:13:46] yes indeed ebernhardson! [20:13:53] (03PS2) 10Yuvipanda: Add missing space in README [software] - 10https://gerrit.wikimedia.org/r/205078 (owner: 10Nemo bis) [20:13:58] you gonna do that avro too? [20:14:01] (03CR) 10Yuvipanda: [C: 032 V: 032] Add missing space in README [software] - 10https://gerrit.wikimedia.org/r/205078 (owner: 10Nemo bis) [20:14:04] 6operations, 5Patch-For-Review: Backup and decom home_pmtpa - https://phabricator.wikimedia.org/T113265#1667721 (10Dzahn) checked if it's done, but doesn't look like it yet: Select the Client (1-51): 2 The defined FileSet resources are: 1: home 2: home_pmtpa Select FileSet resource (1-2): 2 No Full... [20:14:32] ottomata: i'm not sure, that will depend on the needs and how much this log changes over time. I like avro for the strictness but anomie will certainly have opinions too :) Might just be json [20:15:04] aye [20:15:28] If you do JSON, it would be nice if you used a JSON schema [20:18:07] (03CR) 10Madhuvishy: [C: 031] Use $eventlogging_retention_days for eventlogging drop partition job [puppet] - 10https://gerrit.wikimedia.org/r/240566 (owner: 10Ottomata) [20:18:38] (03CR) 10Ottomata: [C: 032 V: 032] Use $eventlogging_retention_days for eventlogging drop partition job [puppet] - 10https://gerrit.wikimedia.org/r/240566 (owner: 10Ottomata) [20:18:57] (03PS5) 10Dzahn: Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [20:19:18] (03CR) 10Dzahn: [C: 032] "Gerrit cleanup day :)" [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [20:19:33] mutante: Did you grab the private bit of that? [20:19:38] ottomata: noted it on the ticket. thanks [20:19:40] greg-g, i need to deploy a few zero patches today, would it be ok to go after parsoid (in 40 min?) [20:19:49] there is an opening [20:19:53] i'll update the schedule [20:20:06] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1667748 (10MeganHernandez_WMF) Hey @Jgreen checking to see if the impression numbers should be sampled differentl... [20:20:28] yurik: what be the reason for a one off? [20:20:31] (03PS1) 10Hashar: contint: bring back libav-tools on slaves [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) [20:20:55] ostriches: no, where is that described [20:21:04] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1002 for JUnikowski_WMF - https://phabricator.wikimedia.org/T113298#1667757 (10JUnikowski_WMF) For good measure, here is a fresh public key. Thanks all for your precious help! ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDRWU19N8QQcl7h... [20:21:38] greg-g, some changes due to an urgent need to get a zero partner on board, and correct a special request for them [20:21:49] * greg-g nods [20:22:04] mutante: Sorry, I had told chasemp about it originally. iridium:/home/demon/gerritbot, last line starting with api-* [20:22:43] yes I was going to move the private parts to secret() iirc [20:22:50] I failed you ostriches [20:23:10] wah wah :( [20:23:20] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1667761 (10Jgreen) >>! In T97676#1667748, @MeganHernandez_WMF wrote: > Hey @Jgreen checking to see if the impress... [20:23:48] chasemp: secrets does not have a phabricator directory yet? [20:24:11] not sure [20:24:17] is there a ticket for this? [20:24:36] the phabricator actions [20:24:38] (03CR) 10Hashar: [C: 04-1] "That class is not included from anywhere apparently :-(((((" [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) (owner: 10Hashar) [20:24:53] sorry guys, but from that commit message i can't know any of this [20:25:16] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1667768 (10ellery) @jgreen do the banner impression numbers in the table pgheres.bannerimpressions correctly ref... [20:25:24] starting parsoid deploy now [20:25:34] mutante: It's a WIP for a bot that will handle importing of repos from Gerrit to Phabricator. [20:25:44] Meh, WIP, perhaps. The bot part is WIP. [20:25:55] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1667770 (10Eevans) All 6 codfw nodes are now joined, and everything looks good. ``` Datacenter: eqiad ================= Status=Up/Down |/ State=Normal/Lea... [20:25:56] I just wanted to land the auth bits since it was trivial to do so. [20:26:18] (03CR) 10coren: [C: 031] "It's an interesting approach, and not an insane one. +2 for the concept, but with the caveat that while the Ruby seems coherent I don't r" [puppet] - 10https://gerrit.wikimedia.org/r/230376 (https://phabricator.wikimedia.org/T107821) (owner: 10Tim Landscheidt) [20:26:37] mutante: I meant I was going to stub out the WIP for storing them on palladium [20:26:49] I didn't realize you were in the middle of things there [20:27:21] (03PS6) 10coren: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:27:49] chasemp: it's not so much being in the middle of something, it's just trying to help with gerrit cleanup day [20:29:36] (03PS2) 10Hashar: contint: bring back libav-tools on slaves [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) [20:29:45] ebernhardson: when do you think that patch will be merged? [20:29:51] or, when would you like it to be merged? [20:32:35] (03PS2) 10Yuvipanda: ssh: Disable root logins on prod [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [20:32:53] (03CR) 10Hashar: [V: 032] "It is in the huge mess of contint::packages::labs . I wanted to add it to contint::browsertests but it is not included from anywhere." [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) (owner: 10Hashar) [20:32:56] (03CR) 10Yuvipanda: "Updated to affect only prod." [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [20:33:53] halfak: About? [20:34:04] ? [20:34:10] Coren: do you not feel comfortable merging the patches? [20:34:16] instead of just +1ing them [20:34:43] yuvipanda: Depends on which. The one with the ruby, not without more testing for sure. [20:34:45] Reedy: what's up? [20:34:47] halfak: I need to extract an old xml dump, need 3.5-4TB space... Wondering if you had the room on the stats host I could use temporarily? [20:34:47] (03PS1) 10Alex Monk: Raise account creation limit for WikiUNAM editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240575 (https://phabricator.wikimedia.org/T113519) [20:34:57] Oh! Sure. [20:34:59] * halfak thinks. [20:35:07] So stat1003 might be good. [20:35:14] Any particular reason you need to extract the whole thing? [20:35:19] Coren: ok. can you work with Tim to get them merged? don't want to have them linger forever. [20:35:25] U was aviyt ti nerge ibe [20:35:27] halfak: there seems to be no zgrep for 7z files :( [20:35:35] I was about to merge one* [20:35:43] Oh! There's a 7z utility that you can grep into. [20:35:48] orly? [20:35:50] (03CR) 10Zfilipin: [C: 031] contint: bring back libav-tools on slaves [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) (owner: 10Hashar) [20:35:51] Do you need the contents of a particular page? [20:35:56] (03CR) 10coren: [C: 032] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:36:09] halfak: precisely. https://phabricator.wikimedia.org/T26675 [20:36:18] If I could get the whole xml block for that revision... [20:36:19] (03CR) 10Dzahn: [C: 032] Setup Gerrit role account for Phabricator actions [puppet] - 10https://gerrit.wikimedia.org/r/234332 (owner: 10Chad) [20:36:27] Oh cool. One of those old bugs. [20:36:28] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1667800 (10mobrovac) >>! In T108613#1667770, @Eevans wrote: > All 6 codfw nodes are now joined, and everything looks good. Yupiii [20:36:38] ostriches: i added the token to private [20:36:41] So. Yeah. If you can get me the datafile, I can get you that revision. [20:36:51] Or tell me where to find the datafile. [20:36:56] :) [20:37:36] Or... I can show you how to do these things. :) [20:37:38] halfak, I wrote it at the bottom of that thread [20:37:42] halfak: snapshot1001:/mnt/data/xmldatadumps/public/archive/enwiki/20100312/enwiki-20100312-pages-meta-history.xml.7z [20:37:44] ticket* [20:37:48] (03PS7) 10coren: Tools: Only forward mail for project users [puppet] - 10https://gerrit.wikimedia.org/r/203667 (https://phabricator.wikimedia.org/T93526) (owner: 10Tim Landscheidt) [20:38:11] cool. [20:38:16] (03PS3) 10Dzahn: contint: bring back libav-tools on slaves [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) (owner: 10Hashar) [20:38:19] Thanks for helping with this Reedy [20:38:19] Coren: ok thanks [20:38:27] Krenair: np [20:38:42] Well... that's going to take a while to grep through. [20:38:45] (03CR) 10Dzahn: [C: 032] contint: bring back libav-tools on slaves [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) (owner: 10Hashar) [20:38:50] Maybe a day or two since it's just one huge file. [20:38:55] If it was plit up. [20:38:58] halfak: there's no rush [20:39:02] It's been open 5 years :P [20:39:07] kk. Will get back to you. [20:39:08] Yeah :D [20:39:09] Thanks halfak [20:39:15] Can we mark this as assigned to you? [20:39:17] halfak: perfect thanks [20:39:20] n/p. Thanks for picking up the bug. [20:39:22] Sure. [20:39:22] yuvipanda: cheers for the hookup ;) [20:39:37] mutante: Thx [20:39:49] (03PS4) 10Dzahn: contint: bring back libav-tools on slaves [puppet] - 10https://gerrit.wikimedia.org/r/240569 (https://phabricator.wikimedia.org/T113520) (owner: 10Hashar) [20:40:34] (03PS1) 10Eevans: WIP: configure RESTBase for codfw datacenter [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) [20:40:40] Reedy, do I need to get all of the revision metadata or just the text? [20:40:46] just the text [20:40:49] kk [20:40:53] (03CR) 10Eevans: [C: 04-1] "Not yet." [puppet] - 10https://gerrit.wikimedia.org/r/240578 (https://phabricator.wikimedia.org/T108613) (owner: 10Eevans) [20:40:54] Reedy, there is this: enwiki-20100312-pages-meta-history.xml.bz2 - I wonder if that can go through zgrep actually? [20:41:11] It's 179G though, compressed [20:41:11] Hmm [20:41:18] Maybe, maybe not.. [20:41:27] I think zgrep will work [20:41:56] somebody modified zgrep to be "bgrep" for this [20:42:03] http://www.bzip.org/bzip2-howto/with-grep.html [20:42:05] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1667830 (10Reedy) [20:42:08] mutante: where's 7grep [20:42:11] :D [20:42:18] :) [20:42:27] 7z e -so .7z | grep [20:42:29] 6operations, 10Wikimedia-General-or-Unknown, 7Database: Revision 186704908 on en.wikipedia.org, Fatal exception: unknown "cluster16" - https://phabricator.wikimedia.org/T26675#1667831 (10Krenair) a:5ArielGlenn>3Halfak @halfak has agreed to help find this in the dump. We don't think zgrep will work with 7z. [20:42:36] But it's XML, so you want more than just the line that matches. [20:42:42] yeah [20:43:15] We actually have no rev_sha1 for this revision [20:43:27] rev_len=78946 though [20:43:43] (03PS4) 10coren: Tools: Puppetize updatetools [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [20:43:59] mutante: thank you for the patch merge :-} [20:44:30] (03PS3) 10Yuvipanda: ssh: Disable root logins on prod [puppet] - 10https://gerrit.wikimedia.org/r/160628 (owner: 10Matanya) [20:45:05] halfak, you're taking another copy of that file to a stats host right? [20:45:08] so I can delete this from tin? [20:45:12] hashar: de rien [20:45:34] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1667863 (10Jgreen) >>! In T97676#1667768, @ellery wrote: > @jgreen do the banner impression numbers in the table... [20:45:49] Krenair, the data file exists on a mount I have available on stat1003. [20:45:57] So probably? [20:46:07] !log deployed parsoid 6619409e [20:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:46:28] Reedy, I know you were looking at the file on tin as well - any objections if I delete? [20:47:01] feel free [20:47:45] done [20:49:08] (03CR) 10Dzahn: [C: 031] Tools: Replace reference to tools. in toolschecker.upstart [puppet] - 10https://gerrit.wikimedia.org/r/239762 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [20:49:16] (03PS2) 10Dzahn: Tools: Replace reference to tools. in toolschecker.upstart [puppet] - 10https://gerrit.wikimedia.org/r/239762 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [20:50:58] (03CR) 10Dzahn: [C: 031] "@Yuvipanda after reading the newer comments i think Negative24 is right. That links says:" [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [20:51:00] (03CR) 10Yuvipanda: "(Puppet-patch-triaging) Is this ready to go? I see a lot of +1s, across Ops, Ori and RelEng. Looks ok to me, but there might be outstandin" [puppet] - 10https://gerrit.wikimedia.org/r/232843 (https://phabricator.wikimedia.org/T109862) (owner: 10Thcipriani) [20:52:54] 6operations, 10Beta-Cluster, 10RESTBase, 6Services: Firewall rules too restrictive on deployment-restbase0x.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T113528#1667909 (10greg) [20:54:37] twentyafterfour: chasemp: is https://gerrit.wikimedia.org/r/#/c/226573/ useful to merge? [20:54:58] not fully baked since other changes I think [20:56:27] chasemp: think you've time to comment / -1 / +1? [20:58:40] ostriches is setting this up I think so he would be a better judge, not to pass the buck :) [20:58:45] I'll make a note and include the right ppl [20:59:57] (03CR) 10Rush: "Chad and Mukunda are doing this in parallel so I'm adding them to see if this conflicts with their plans." [puppet] - 10https://gerrit.wikimedia.org/r/226573 (owner: 10Negative24) [21:03:53] chasemp: ok! [21:04:13] (03PS6) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) [21:05:09] (03PS1) 10BBlack: improved client.ip/XFP/XRIP in common VCL [puppet] - 10https://gerrit.wikimedia.org/r/240582 [21:05:27] (03PS7) 10Dzahn: add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) [21:06:30] (03CR) 10Paladox: [C: 031] add IPv6 for ytterbium (gerrit) [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [21:06:40] (03CR) 10Dzahn: [C: 032] "note how this is just ytterbium, the server, not gerrit.wm itself which already has v6" [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [21:08:33] (03PS1) 10Rush: Add git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/240588 [21:09:06] (03CR) 10Dzahn: "up ip addr add 2620:0:861:3:208:80:154:81/128 dev eth0" [puppet] - 10https://gerrit.wikimedia.org/r/214437 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [21:10:39] mutante: I'm only looking at patches by non-opsen, btw :) [21:12:16] yuvipanda: i hope you don't mind i reduce my own queue too [21:12:28] (03PS2) 10Yuvipanda: haproxy: Move check_haproxy to module itself [puppet] - 10https://gerrit.wikimedia.org/r/228712 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:12:48] mutante: not at all! I'm just specifically targetting non-opsen with the hope that opsen target themselves :D [21:13:06] (03CR) 10Yuvipanda: [C: 032 V: 032] haproxy: Move check_haproxy to module itself [puppet] - 10https://gerrit.wikimedia.org/r/228712 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [21:13:09] !log ori@tin Synchronized php-1.26wmf24/includes/resourceloader/ResourceLoaderFileModule.php: 58bfb6f85b: Backport fix from PS9 of I1ff6115 (duration: 00m 17s) [21:13:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:14:46] yuvipanda: yes, i claim i do the same on non-special days, if it's by somebody who can't +2 themselves i'll look more than for other ops patches [21:16:27] (03PS3) 10Dzahn: add AAAA record for ytterbium (gerrit) [dns] - 10https://gerrit.wikimedia.org/r/214507 (https://phabricator.wikimedia.org/T37540) [21:17:12] mutante: :) ok [21:17:19] (03PS4) 10Dzahn: add AAAA record for ytterbium (gerrit) [dns] - 10https://gerrit.wikimedia.org/r/214507 (https://phabricator.wikimedia.org/T37540) [21:17:31] gwicke: any update on https://gerrit.wikimedia.org/r/#/c/204964/? [21:17:48] (03CR) 10Yuvipanda: "*poke*?" [dumps/html/deploy] - 10https://gerrit.wikimedia.org/r/204964 (https://phabricator.wikimedia.org/T94457) (owner: 10GWicke) [21:17:59] (03CR) 10Dzahn: [C: 032] add AAAA record for ytterbium (gerrit) [dns] - 10https://gerrit.wikimedia.org/r/214507 (https://phabricator.wikimedia.org/T37540) (owner: 10Dzahn) [21:20:10] bblack: yeah, reviewing [21:20:30] (03PS5) 10coren: Tools: Puppetize updatetools [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [21:20:53] The problem with gerrit day is how often we end up having to rebase. :-) [21:20:57] (03PS2) 10Rush: Add git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/240588 [21:22:04] (03CR) 10coren: [C: 032] Tools: Puppetize updatetools [puppet] - 10https://gerrit.wikimedia.org/r/203808 (https://phabricator.wikimedia.org/T94858) (owner: 10Tim Landscheidt) [21:22:13] (03CR) 10Dzahn: "this has been uploaded over a year ago now, do you still want this?" [puppet] - 10https://gerrit.wikimedia.org/r/159167 (owner: 10RobH) [21:23:31] pedal faster, Jenkins. [21:23:57] 10Ops-Access-Requests, 6operations: Requesting access to stat1002 for Neil P. Quinn - https://phabricator.wikimedia.org/T113533#1668059 (10Neil_P._Quinn_WMF) 3NEW [21:25:27] (03CR) 10Dzahn: [C: 04-1] "that IP address looks like it's in the wrong network:" [dns] - 10https://gerrit.wikimedia.org/r/240588 (owner: 10Rush) [21:27:17] (03PS3) 10Rush: Add git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/240588 [21:29:12] (03CR) 10Dzahn: [C: 031] "yes, gerrit is .81 and .82 is free" [dns] - 10https://gerrit.wikimedia.org/r/240588 (owner: 10Rush) [21:33:37] (03CR) 10Ori.livneh: improved client.ip/XFP/XRIP in common VCL (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/240582 (owner: 10BBlack) [21:34:19] (03PS4) 10Rush: Add git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/240588 [21:34:42] (03CR) 10Rush: [C: 032] Add git-ssh.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/240588 (owner: 10Rush) [21:35:07] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1668100 (10chasemp) [21:36:41] (03PS1) 10Rush: phab: add realserver IP [puppet] - 10https://gerrit.wikimedia.org/r/240592 [21:36:57] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1668113 (10greg) a:3chasemp [21:38:03] (03PS2) 10Rush: phab: add realserver IP [puppet] - 10https://gerrit.wikimedia.org/r/240592 [21:41:35] (03PS3) 10Rush: phab: add realserver IP [puppet] - 10https://gerrit.wikimedia.org/r/240592 [21:43:40] (03CR) 10Dzahn: [C: 04-1] "what Alex said, the list is already outdated now" [puppet] - 10https://gerrit.wikimedia.org/r/207454 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [21:46:19] (03CR) 10Dzahn: "there seems to be no consensus here on how to proceeed" [puppet] - 10https://gerrit.wikimedia.org/r/174896 (owner: 10Hoo man) [21:47:33] (03CR) 10Dzahn: [C: 031] icinga: unify swift alerts [puppet] - 10https://gerrit.wikimedia.org/r/209217 (https://phabricator.wikimedia.org/T88974) (owner: 10Filippo Giunchedi) [21:47:56] (03PS4) 10Dzahn: Create tc class analogous to ferm for traffic control [puppet] - 10https://gerrit.wikimedia.org/r/209558 (owner: 10coren) [21:49:39] (03CR) 10Dzahn: "the linked bug is called "Identify survey services compatible with our privacy policy". Does that mean it's not sure we are using limesurv" [puppet] - 10https://gerrit.wikimedia.org/r/213579 (https://phabricator.wikimedia.org/T94807) (owner: 10Nemo bis) [21:50:49] (03PS3) 10coren: Tools: Replace references to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [21:51:51] (03CR) 10Dzahn: "the DNS name would have to exist first." [puppet] - 10https://gerrit.wikimedia.org/r/213579 (https://phabricator.wikimedia.org/T94807) (owner: 10Nemo bis) [21:53:45] (03CR) 10Dzahn: [C: 04-1] [ssh, WIP] allow login from tools-login [puppet] - 10https://gerrit.wikimedia.org/r/220214 (https://phabricator.wikimedia.org/T103552) (owner: 10Merlijn van Deen) [21:54:56] (03CR) 10Dzahn: [C: 04-1] "needs manual rebase and has been a while. the linked bug is resolved. so i assume it's outdated" [puppet] - 10https://gerrit.wikimedia.org/r/217358 (https://phabricator.wikimedia.org/T99701) (owner: 10RobH) [21:55:43] (03PS2) 10Dzahn: Add openstack-pkg-tools to default packages [puppet] - 10https://gerrit.wikimedia.org/r/223033 (owner: 10Muehlenhoff) [21:56:35] (03PS3) 10Dzahn: packagebuilder: add openstack-pkg-tools package [puppet] - 10https://gerrit.wikimedia.org/r/223033 (owner: 10Muehlenhoff) [21:57:05] (03CR) 10Dzahn: [C: 032] "makes sense on the package builder host: Description-en: Tools and scripts for building Openstack packages in Debian" [puppet] - 10https://gerrit.wikimedia.org/r/223033 (owner: 10Muehlenhoff) [21:59:56] (03PS4) 10Rush: phab: add realserver IP [puppet] - 10https://gerrit.wikimedia.org/r/240592 [22:00:37] (03CR) 10Dzahn: [C: 04-1] "per inline comments from 20after4" [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [22:01:12] (03CR) 10Rush: [C: 032] phab: add realserver IP [puppet] - 10https://gerrit.wikimedia.org/r/240592 (owner: 10Rush) [22:01:49] 7Blocked-on-Operations, 6operations, 6Phabricator, 6Release-Engineering-Team, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1668227 (10chasemp) [22:02:31] (03PS2) 10Dzahn: Ignore warnings about URLs without modules for volatile directory [puppet] - 10https://gerrit.wikimedia.org/r/228682 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:03:30] (03CR) 10Dzahn: [C: 032] Ignore warnings about URLs without modules for volatile directory [puppet] - 10https://gerrit.wikimedia.org/r/228682 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:04:38] (03CR) 10Dzahn: "@ori new comments after Tim's reply?" [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:06:49] (03CR) 10Ori.livneh: [C: 031] "yeah, that's fine" [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:07:17] (03CR) 10Dzahn: "it's still included on bast1001 but i believe you remove that in the other patch that also removes the pmtpa home" [puppet] - 10https://gerrit.wikimedia.org/r/239126 (owner: 10Faidon Liambotis) [22:09:40] (03PS3) 10Dzahn: statsdlb: Fix strict puppet-lint check [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:09:57] (03CR) 10Dzahn: [C: 032] statsdlb: Fix strict puppet-lint check [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:10:28] (03PS5) 10Ori.livneh: Puppet compiler for Tim's redirects.dat DSL [puppet] - 10https://gerrit.wikimedia.org/r/138292 [22:11:16] (03Abandoned) 10Mobrovac: service::node: Add the list of domains for which not to use the proxy [puppet] - 10https://gerrit.wikimedia.org/r/207454 (https://phabricator.wikimedia.org/T97530) (owner: 10Mobrovac) [22:12:13] (03CR) 10Dzahn: "checked on graphite1001 - noop" [puppet] - 10https://gerrit.wikimedia.org/r/226463 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [22:12:19] (03PS1) 10Rush: LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 [22:12:48] (03PS2) 10Rush: LVS: git-ssh service for phab backend [puppet] - 10https://gerrit.wikimedia.org/r/240600 [22:14:27] (03PS2) 10Dzahn: labs_lvm: Require parted explicitly [puppet] - 10https://gerrit.wikimedia.org/r/240271 (https://phabricator.wikimedia.org/T112641) (owner: 10Tim Landscheidt) [22:15:36] (03CR) 10Dzahn: "user needs to be created first, doesn't exist yet" [puppet] - 10https://gerrit.wikimedia.org/r/240369 (https://phabricator.wikimedia.org/T113118) (owner: 10coren) [22:15:59] (03CR) 10Dzahn: [C: 031] "after https://gerrit.wikimedia.org/r/#/c/231142/3" [puppet] - 10https://gerrit.wikimedia.org/r/239126 (owner: 10Faidon Liambotis) [22:16:41] mutante: they are already stacked up after each other [22:17:08] (03CR) 10Dzahn: [C: 032] labs_lvm: Require parted explicitly [puppet] - 10https://gerrit.wikimedia.org/r/240271 (https://phabricator.wikimedia.org/T112641) (owner: 10Tim Landscheidt) [22:17:37] paravoid: ah:) so i checked if the backup was done earlier, but it wasnt yet [22:17:49] and that also made me wait to add more backup for sodium [22:18:07] (03PS4) 10coren: Tools: Replace references to tools.wmflabs.org [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [22:19:27] (03CR) 10coren: [C: 032] "Makes sense." [puppet] - 10https://gerrit.wikimedia.org/r/235941 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [22:20:40] (03Abandoned) 10RobH: labnet1002 install params [puppet] - 10https://gerrit.wikimedia.org/r/217358 (https://phabricator.wikimedia.org/T99701) (owner: 10RobH) [22:22:03] (03PS3) 10Ori.livneh: Introduce apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/207338 [22:27:24] (03PS5) 10TTO: Allow import from any Labs/Beta Cluster project to any other [mediawiki-config] - 10https://gerrit.wikimedia.org/r/157338 (https://phabricator.wikimedia.org/T17583) [22:33:40] deploying ZeroBanner 24 [22:35:15] !log yurik@tin Synchronized php-1.26wmf24/extensions/ZeroBanner: Deploying ZeroBanner interstitial handling (duration: 00m 18s) [22:35:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:36:27] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1668392 (10DStrine) [22:37:01] (03PS4) 10Ori.livneh: Introduce apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/207338 [22:37:17] (03Abandoned) 10Ori.livneh: Add Pyglet, a Trebuchet-deployed syntax-highlighting micro-service(!) [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [22:38:10] (03Abandoned) 10Ori.livneh: Remove obsolete VCL code for setting X-Analytics: https=1 [puppet] - 10https://gerrit.wikimedia.org/r/225280 (owner: 10Ori.livneh) [22:39:02] !log cr1-codfw RE switchover(s) [22:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:39:56] (03CR) 10Ori.livneh: [C: 032] Introduce apache::static_site [puppet] - 10https://gerrit.wikimedia.org/r/207338 (owner: 10Ori.livneh) [22:41:11] PROBLEM - Host cr1-codfw is DOWN: PING CRITICAL - Packet loss = 100% [22:42:06] has anyone seen the 40352 Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]] in fatalmonitor? CC: greg-g [22:42:30] 6operations, 6Labs, 10wikitech.wikimedia.org: intermittent nutcracker failures - https://phabricator.wikimedia.org/T105131#1668438 (10chasemp) 5Open>3Resolved a:3chasemp It's been 3 weeks now without a repeat, I'm going to resolve this but will be the first to reopen if we see it again :) [22:43:30] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.34 ms [22:50:17] switchover again [22:51:28] yurik: yes, there's a task for it [22:51:34] yurik: if you know anything about it, help appreciated :) [22:51:49] greg-g, i tried grepping tin for it, still searching. No idea whos doing that [22:52:01] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [22:52:02] yurik: ah, nvm, joe figured it out: https://phabricator.wikimedia.org/T112922 [22:53:10] PROBLEM - test icmp reachability to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 51 probes of 382 (alerts on 19) [22:53:10] PROBLEM - Host cr1-codfw is DOWN: CRITICAL - Network Unreachable (208.80.153.192) [22:53:13] ah, awesome, thx [22:53:25] * yurik was about to file another ticket [22:53:58] !log ori@tin Synchronized php-1.26wmf24/includes/resourceloader/ResourceLoaderFileModule.php: 14f46330d9: Backport fix from PS12 of I1ff6115 (duration: 00m 17s) [22:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235, 34ms) {#2648} [10Gbps DWDM]BR [22:54:40] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 112, down: 2, dormant: 0, excluded: 0, unused: 0BRxe-5/2/0: down - cr1-codfw:xe-5/2/0 {#10695} [10Gbps DF]BRae0: down - Core: cr1-codfw:ae0BR [22:56:30] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 116, down: 0, dormant: 0, excluded: 0, unused: 0 [22:56:30] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [22:56:40] RECOVERY - Host cr1-codfw is UP: PING OK - Packet loss = 0%, RTA = 36.37 ms [22:56:59] yurik: thanks for trying to debug :) [22:57:29] greg-g, more like raising stink about it ;) [22:59:02] potato, potahto [22:59:07] :) [23:00:00] PROBLEM - puppet last run on mw2018 is CRITICAL: CRITICAL: puppet fail [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150923T2300). [23:00:04] andrewbogott Krenair irc-nickname: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:09] ohi [23:00:30] 6operations, 10ops-ulsfo: Properly patch Telia @ ulsfo - https://phabricator.wikimedia.org/T112152#1668501 (10RobH) 5Open>3Resolved This was resolved via ul ticket 115156. both the port description and the dc xconnect tracking gsheet have been updated accordingly. [23:01:00] PROBLEM - puppet last run on wtp2017 is CRITICAL: CRITICAL: Puppet has 2 failures [23:02:12] Who is irc-nickname [23:02:17] oh, hah [23:02:24] greg-g, now I know why that example wasn't using the template [23:02:54] Krenair: yup :) [23:03:06] 6operations, 10ops-ulsfo: Move NTT @ ulsfo to a different cross-connect - https://phabricator.wikimedia.org/T112154#1668511 (10RobH) I had to wait for T112152 resolution; now complete. We didn't want to mess with two of our transit connections at the same time in the same location. I've updated Kevin @ NTT v... [23:03:17] Krenair: unless we hack jouncebot to ignore it, which, seems like more trouble than it is worth, but if someone's bored some night.... [23:03:51] we could hack the template to not mark it for the bot to pick up [23:04:00] ah [23:04:00] RECOVERY - test icmp reachability to codfw on ripe-atlas-codfw is OK: OK - failed 3 probes of 382 (alerts on 19) [23:04:44] (03CR) 10Alex Monk: [C: 04-1] Install Extension:Translate on labswiki (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214893 (https://phabricator.wikimedia.org/T100313) (owner: 10Ladsgroup) [23:05:41] ostriches, I think we should try to fix getRealmSpecificFilename instead of doing https://gerrit.wikimedia.org/r/#/c/240378/2 actually [23:05:58] (03PS1) 10Dzahn: annualreport: puppetize git cloning [puppet] - 10https://gerrit.wikimedia.org/r/240606 [23:06:00] Krenair: Agree [23:06:07] (03Abandoned) 10Chad: Use /srv/mediawiki directly instead of $IP/../ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240378 (https://phabricator.wikimedia.org/T112006) (owner: 10Chad) [23:07:43] (03CR) 10Alex Monk: [C: 032] Raise account creation limit for WikiUNAM editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240575 (https://phabricator.wikimedia.org/T113519) (owner: 10Alex Monk) [23:08:13] (03Merged) 10jenkins-bot: Raise account creation limit for WikiUNAM editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240575 (https://phabricator.wikimedia.org/T113519) (owner: 10Alex Monk) [23:08:23] * Krenair looks at jenkins suspiciously [23:09:16] !log krenair@tin Synchronized wmf-config/throttle.php: https://gerrit.wikimedia.org/r/#/c/240575/ (duration: 00m 17s) [23:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:10:35] (03PS1) 10Dzahn: dbtree: ensure that git clone is latest [puppet] - 10https://gerrit.wikimedia.org/r/240608 [23:17:08] !log krenair@tin Synchronized php-1.26wmf24/includes/specials/SpecialSearch.php: https://gerrit.wikimedia.org/r/#/c/240596/ (duration: 00m 18s) [23:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:17] (03PS2) 10Dzahn: annualreport: puppetize git cloning [puppet] - 10https://gerrit.wikimedia.org/r/240606 [23:17:30] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1668628 (10BBlack) @Krenair is that just single IPs, or can we add networks to it like the wgSquidServers type of lists use? [23:23:51] 6operations, 10MediaWiki-extensions-ZeroPortal, 10Traffic, 6Zero: zerofetcher in production is getting throttled for API logins - https://phabricator.wikimedia.org/T111045#1668661 (10Krenair) Just single IPs. Maybe there's a better way somewhere... [23:31:41] ostriches, looks like we can have pathinfo( $filename ) do the work [23:35:16] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1668743 (10Dzahn) It has been changed. I got it from James and updated the existing password file that ops uses. [23:35:40] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1668747 (10Dzahn) 5Open>3Resolved [23:36:39] ostriches, take a look at https://phabricator.wikimedia.org/P2082 [23:37:05] we can default $ext to '', or $pathinfo['extension'] if it's set [23:37:14] and set $base to $pathinfo['dirname'] . DIRECTORY_SEPARATOR . $pathinfo['filename'] [23:37:45] * Krenair is going to test this on beta [23:39:58] (03PS1) 10Alex Monk: MWRealm::getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) [23:40:01] (03PS3) 10Yuvipanda: Tools: Replace reference to tools. in toolschecker.upstart [puppet] - 10https://gerrit.wikimedia.org/r/239762 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [23:40:05] (03CR) 10jenkins-bot: [V: 04-1] MWRealm::getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [23:40:08] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Replace reference to tools. in toolschecker.upstart [puppet] - 10https://gerrit.wikimedia.org/r/239762 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [23:40:45] (03PS2) 10Alex Monk: Multiversion MWRealm getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) [23:40:51] (03CR) 10jenkins-bot: [V: 04-1] Multiversion MWRealm getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) (owner: 10Alex Monk) [23:41:11] oh dear [23:45:32] derp [23:47:08] (03PS3) 10Alex Monk: MWRealm::getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) [23:48:16] (03PS4) 10Alex Monk: Multiversion MWRealm getRealmSpecificFilename: Fix support for filenames without an extension but with full stops in the full path [mediawiki-config] - 10https://gerrit.wikimedia.org/r/240612 (https://phabricator.wikimedia.org/T112006) [23:48:21] ostriches, ^ [23:52:07] Some test cases with relative paths being resolved would make me sleep a little better. [23:55:13] Krenair: thanks for your review, and sorry I wasn’t here… we’ll have another go with that patch in a day or two. [23:55:39] RECOVERY - puppet last run on wtp2017 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [23:56:12] Krenair: what's the 'group in' again?! [23:56:14] * yuvipanda keeps forgetting [23:56:19] and the url was too long to keep around [23:56:22] yuvipanda, ownerin [23:56:28] ah yes [23:57:38] Krenair: do you have someone to help review and push https://gerrit.wikimedia.org/r/#/c/236500/ through? [23:58:20] RECOVERY - puppet last run on mw2018 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [23:58:50] gwicke: I suppose https://gerrit.wikimedia.org/r/#/c/229304/1 can be abandoned now that iojs isn't a thing anymore? [23:59:06] (03CR) 10Yuvipanda: "Abandoning since I don't think iojs is a thing anymore." [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [23:59:14] (03Abandoned) 10Yuvipanda: Don't require nodejs for restbase [puppet] - 10https://gerrit.wikimedia.org/r/229304 (owner: 10GWicke) [23:59:42] AndyRussG: can you update or abandon https://gerrit.wikimedia.org/r/#/c/182141/