[00:00:35] !log catrope Finished scap: SWAT (duration: 22m 15s) [00:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:08:31] RoanKattouw, was the Flow change included in that scap? [00:09:08] Yes [00:09:14] ebernhardson: And your search changes too [00:09:17] (03PS15) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [00:10:50] RoanKattouw: yup, thanks. both appear working (and messages deployed). [00:14:15] 6operations, 10Traffic, 7HTTPS, 5Patch-For-Review: Insecure POST traffic - https://phabricator.wikimedia.org/T105794#1487038 (10BBlack) >>! In T105794#1486712, @Merl wrote: >>>! In T105794#1484730, @BBlack wrote: >> ^ Added Merl (I'm guessing is the maintainer of MerlBot). > > Thx. But my bot send its fir... [00:30:45] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1487109 (10bd808) 3NEW a:3bd808 [00:30:46] !log krenair Synchronized php-1.26wmf15/extensions/SiteMatrix/SiteMatrix_body.php: https://gerrit.wikimedia.org/r/#/c/227379/ (duration: 00m 12s) [00:30:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:02] (03PS5) 10GWicke: Lower the InitiatingHeapOccupancyPercent from 45% to 35% [puppet] - 10https://gerrit.wikimedia.org/r/227335 (https://phabricator.wikimedia.org/T106619) [00:32:04] (03PS1) 10Alex Monk: Revert "Re-add default=wikipedia lines to wgCanonicalServer and wgSitename" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227381 [00:33:23] (03CR) 10Alex Monk: [C: 032] Revert "Re-add default=wikipedia lines to wgCanonicalServer and wgSitename" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227381 (owner: 10Alex Monk) [00:33:45] (03Merged) 10jenkins-bot: Revert "Re-add default=wikipedia lines to wgCanonicalServer and wgSitename" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227381 (owner: 10Alex Monk) [00:34:07] (03CR) 10CSteipp: [C: 031] Prevent access to hidden directories [puppet] - 10https://gerrit.wikimedia.org/r/217794 (https://phabricator.wikimedia.org/T94570) (owner: 10Muehlenhoff) [00:35:39] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227381/ (duration: 00m 13s) [00:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:37:38] (03PS16) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [00:39:47] (03PS17) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [01:01:36] (03CR) 10GWicke: [C: 031] enabled GC logging [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [01:02:24] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1487164 (10coren) While I feel no urgency to replace paramiko, I'd be of the opinion that we heed the security advice of our security dude as a matter of general procedure. :-... [01:03:39] 6operations: paramiko (python SSH implementation) needs older hashes for host authentication - https://phabricator.wikimedia.org/T106871#1487168 (10yuvipanda) Fair enough. [01:08:13] (03PS1) 10Yuvipanda: labstore: Make sure that output of .run() is unicode [puppet] - 10https://gerrit.wikimedia.org/r/227386 [01:08:25] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Make sure that output of .run() is unicode [puppet] - 10https://gerrit.wikimedia.org/r/227386 (owner: 10Yuvipanda) [01:11:03] (03CR) 10Alex Monk: [C: 032] Follow-up I6e77eb39: Actually configure new logo for suwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227371 (https://phabricator.wikimedia.org/T106784) (owner: 10Alex Monk) [01:11:09] (03Merged) 10jenkins-bot: Follow-up I6e77eb39: Actually configure new logo for suwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227371 (https://phabricator.wikimedia.org/T106784) (owner: 10Alex Monk) [01:11:54] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227371/ (duration: 00m 11s) [01:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:14:42] (03PS1) 10Yuvipanda: labstore: Follow up to I82ac5bf45cd3c2e10df25f825bf423cb2f7d1de0 [puppet] - 10https://gerrit.wikimedia.org/r/227387 [01:14:54] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Follow up to I82ac5bf45cd3c2e10df25f825bf423cb2f7d1de0 [puppet] - 10https://gerrit.wikimedia.org/r/227387 (owner: 10Yuvipanda) [01:18:48] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1487191 (10chasemp) I have a vague idea that https://phabricator.wikimedia.org/rEABF9ffa4003226c46813dfd6616ce173252b1f258c2 was surfacing a probl... [01:25:12] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1487203 (10GWicke) Interestingly, https://packages.debian.org/search?suite=default§ion=all&arch=any&searchon=names&keywords=+openjdk-8-jre lists jessie-backports having 8u45. Is that what we have bee... [01:32:35] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1487206 (10aaron) Is that graph supposed to be instantaneous_ops_per_sec? I would expect a total_commands_processed command to just be non-decreas... [01:48:43] (03PS6) 10Yuvipanda: [WIP] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 [02:03:42] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-28 02:03:41+00:00 [02:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:52] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 28 02:07:52 UTC 2015 (duration 7m 51s) [02:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:22:49] (03PS7) 10Yuvipanda: labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 (https://phabricator.wikimedia.org/T104453) [02:23:59] (03PS8) 10Yuvipanda: labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 (https://phabricator.wikimedia.org/T104453) [02:24:59] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 (https://phabricator.wikimedia.org/T104453) (owner: 10Yuvipanda) [02:25:36] (03PS1) 10Yuvipanda: labstore: Change permissions on db / ldap credentials [puppet] - 10https://gerrit.wikimedia.org/r/227395 [02:25:48] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Change permissions on db / ldap credentials [puppet] - 10https://gerrit.wikimedia.org/r/227395 (owner: 10Yuvipanda) [02:26:21] you also going to split up maintain-replicas and do part of it in python YuviPanda? [02:26:25] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 07m 29s) [02:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:27:23] Krenair: yes, there's a bug for that as well. [02:27:32] Krenair: there's a general de-perlizing and de-bashing going on [02:30:25] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-28 02:30:24+00:00 [02:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:43:15] (03PS14) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [02:44:53] (03CR) 10Mattflaschen: "Or enable VE extension, but only for Flow (no beta feature, not on by default, veaction=edit URL-hacking would probably work but that's ok" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [02:57:54] YuviPanda, what if the meta_p part was actually a mw maintenance script? [02:58:05] * Krenair ducks [03:02:38] (03PS15) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [03:07:11] (03PS1) 10BBlack: set dynamic_directors to false for labs [puppet] - 10https://gerrit.wikimedia.org/r/227402 (https://phabricator.wikimedia.org/T106662) [03:07:53] (03CR) 10BBlack: [C: 032 V: 032] set dynamic_directors to false for labs [puppet] - 10https://gerrit.wikimedia.org/r/227402 (https://phabricator.wikimedia.org/T106662) (owner: 10BBlack) [03:14:30] 6operations, 6Services, 10Traffic: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1487321 (10BBlack) it doesn't really matter where it's placed, but the frontend is probably more appropriate as it's where most mangling should occur. [03:21:31] bblack, do you know about all of the other puppet failures in deployment-prep? [03:27:03] Krenair: yeah I know about most of them [03:27:19] a lot of it's blocked on "upgrade deployment-prep caches to jessie" [03:27:44] they're basically invalid as a testing environment until they're jessie. lots of cache-role puppetization assumes jessie at this point. [03:27:56] the rest of it's about lack of TLS certs to make nginx puppetization work, mostly. [03:28:34] https://phabricator.wikimedia.org/T98758 + https://phabricator.wikimedia.org/T97593 [03:34:10] bblack, interesting, might see what I can do about the jessie one tomorrow [03:43:36] Krenair: awesome :) [03:56:10] (03PS16) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [04:08:10] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 14.29% of data above the critical threshold [100000000.0] [04:19:26] (03PS1) 10BBlack: tlsproxy: refactor/cleanup, beta work [puppet] - 10https://gerrit.wikimedia.org/r/227404 (https://phabricator.wikimedia.org/T97593) [04:32:57] (03PS1) 10Ori.livneh: Add 'session-redis' nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227406 [04:36:52] (03PS17) 10BryanDavis: labs: new role::logstash::stashbot class [puppet] - 10https://gerrit.wikimedia.org/r/227175 [04:38:00] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:50:40] PROBLEM - DPKG on rcs1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [04:50:52] (03CR) 10Aaron Schulz: Add 'session-redis' nutcracker group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227406 (owner: 10Ori.livneh) [04:51:50] PROBLEM - puppet last run on mw2172 is CRITICAL puppet fail [04:52:40] RECOVERY - DPKG on rcs1001 is OK: All packages OK [04:58:50] 6operations, 10CirrusSearch, 6Discovery: [epic] Update Elasticsearch to 1.6.1 or 1.7. 0 - https://phabricator.wikimedia.org/T106090#1487386 (10MoritzMuehlenhoff) If Ubuntu has released their security updates by then (and I hope they will), I'll install the updates before you start, so that the cluster nodes... [05:04:28] (03CR) 10Ori.livneh: Add 'session-redis' nutcracker group (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227406 (owner: 10Ori.livneh) [05:05:09] (03PS2) 10Ori.livneh: Add 'session-redis' nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227406 [05:19:39] RECOVERY - puppet last run on mw2172 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [05:20:57] Krenair: then it'll need credentials provisioned to run on the labsdb hosts which is complicated [05:27:39] (03CR) 10Ori.livneh: [C: 032] Add 'session-redis' nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227406 (owner: 10Ori.livneh) [05:29:56] (03PS1) 10Ori.livneh: Add redis_auth to redis nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227408 [05:30:18] (03CR) 10Ori.livneh: [C: 032 V: 032] Add redis_auth to redis nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227408 (owner: 10Ori.livneh) [05:33:37] (03PS1) 10Ori.livneh: Revert "Add redis_auth to redis nutcracker group" [puppet] - 10https://gerrit.wikimedia.org/r/227410 [05:34:02] (03CR) 10Ori.livneh: [C: 032 V: 032] Revert "Add redis_auth to redis nutcracker group" [puppet] - 10https://gerrit.wikimedia.org/r/227410 (owner: 10Ori.livneh) [05:35:51] PROBLEM - nutcracker port on mw1028 is CRITICAL: Connection refused [05:36:00] PROBLEM - nutcracker process on mw2155 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:01] PROBLEM - nutcracker port on mw1005 is CRITICAL: Connection refused [05:36:10] PROBLEM - nutcracker process on mw2014 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:10] PROBLEM - nutcracker process on mw1028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (nutcracker), command name nutcracker [05:36:19] PROBLEM - nutcracker process on mw2028 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:20] PROBLEM - nutcracker process on mw1005 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:30] PROBLEM - nutcracker port on mw2155 is CRITICAL: Connection refused [05:36:30] PROBLEM - nutcracker port on mw2034 is CRITICAL: Connection refused [05:36:31] PROBLEM - nutcracker port on mw1105 is CRITICAL: Connection refused [05:36:39] PROBLEM - nutcracker process on mw1067 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:39] PROBLEM - nutcracker port on mw2124 is CRITICAL: Connection refused [05:36:40] PROBLEM - nutcracker process on mw1006 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:49] PROBLEM - nutcracker process on mw1239 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:36:50] PROBLEM - nutcracker port on mw1108 is CRITICAL: Connection refused [05:36:50] PROBLEM - nutcracker port on mw2028 is CRITICAL: Connection refused [05:36:51] PROBLEM - nutcracker port on mw1239 is CRITICAL: Connection refused [05:36:59] PROBLEM - nutcracker port on mw1186 is CRITICAL: Connection refused [05:37:00] PROBLEM - nutcracker process on mw2124 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:00] PROBLEM - nutcracker process on mw1077 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:01] PROBLEM - nutcracker port on mw1077 is CRITICAL: Connection refused [05:37:11] PROBLEM - nutcracker port on mw2175 is CRITICAL: Connection refused [05:37:11] PROBLEM - nutcracker process on mw2034 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:11] PROBLEM - nutcracker port on mw1121 is CRITICAL: Connection refused [05:37:19] PROBLEM - nutcracker port on mw2014 is CRITICAL: Connection refused [05:37:19] PROBLEM - nutcracker port on mw2187 is CRITICAL: Connection refused [05:37:19] PROBLEM - nutcracker port on mw1043 is CRITICAL: Connection refused [05:37:20] PROBLEM - nutcracker port on mw1012 is CRITICAL: Connection refused [05:37:20] PROBLEM - nutcracker port on mw1067 is CRITICAL: Connection refused [05:37:20] PROBLEM - nutcracker process on mw1043 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:20] PROBLEM - nutcracker process on mw1108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:21] PROBLEM - nutcracker process on mw1186 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:21] PROBLEM - nutcracker process on mw1012 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:22] PROBLEM - nutcracker port on mw2205 is CRITICAL: Connection refused [05:37:29] PROBLEM - nutcracker process on mw2205 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:30] PROBLEM - nutcracker port on mw2071 is CRITICAL: Connection refused [05:37:30] PROBLEM - nutcracker process on mw2108 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:30] PROBLEM - nutcracker process on mw2159 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:30] PROBLEM - nutcracker port on mw2180 is CRITICAL: Connection refused [05:37:30] PROBLEM - nutcracker port on mw2210 is CRITICAL: Connection refused [05:37:30] PROBLEM - nutcracker process on mw2187 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:31] PROBLEM - nutcracker process on mw2210 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:31] PROBLEM - nutcracker port on mw1229 is CRITICAL: Connection refused [05:37:40] PROBLEM - nutcracker process on mw2213 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:40] PROBLEM - nutcracker process on mw2154 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:40] PROBLEM - nutcracker process on mw1105 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:41] PROBLEM - nutcracker process on mw2185 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:41] PROBLEM - nutcracker process on mw2180 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:41] PROBLEM - nutcracker process on mw2175 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:41] PROBLEM - nutcracker port on mw1006 is CRITICAL: Connection refused [05:37:42] PROBLEM - nutcracker port on mw1250 is CRITICAL: Connection refused [05:37:42] PROBLEM - nutcracker process on mw1223 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:49] PROBLEM - nutcracker process on mw1121 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:49] PROBLEM - nutcracker port on mw1187 is CRITICAL: Connection refused [05:37:49] PROBLEM - nutcracker process on mw1160 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (nutcracker), command name nutcracker [05:37:50] PROBLEM - nutcracker process on mw2148 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:51] PROBLEM - nutcracker process on mw1229 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:51] PROBLEM - nutcracker process on mw1209 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:59] PROBLEM - nutcracker port on mw2213 is CRITICAL: Connection refused [05:37:59] PROBLEM - nutcracker process on mw2071 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:59] PROBLEM - nutcracker process on mw1187 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:59] PROBLEM - nutcracker process on mw2204 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:37:59] PROBLEM - nutcracker port on mw2121 is CRITICAL: Connection refused [05:38:00] PROBLEM - nutcracker port on mw2204 is CRITICAL: Connection refused [05:38:00] PROBLEM - nutcracker port on mw2159 is CRITICAL: Connection refused [05:38:01] PROBLEM - nutcracker port on mw2108 is CRITICAL: Connection refused [05:38:01] PROBLEM - nutcracker process on mw1250 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:38:02] PROBLEM - nutcracker process on mw2115 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:38:02] PROBLEM - nutcracker port on mw1167 is CRITICAL: Connection refused [05:38:03] PROBLEM - nutcracker process on mw1236 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:38:19] that's me [05:38:20] PROBLEM - nutcracker process on mw1167 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:38:20] PROBLEM - nutcracker process on mw2121 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:38:20] PROBLEM - nutcracker process on mw1016 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:38:23] it'll fix itself in a moment [05:38:29] PROBLEM - nutcracker port on mw2185 is CRITICAL: Connection refused [05:38:29] PROBLEM - nutcracker port on mw2040 is CRITICAL: Connection refused [05:38:30] PROBLEM - nutcracker port on mw1160 is CRITICAL: Connection refused [05:38:39] PROBLEM - nutcracker port on mw1224 is CRITICAL: Connection refused [05:38:40] PROBLEM - nutcracker port on mw1071 is CRITICAL: Connection refused [05:38:40] PROBLEM - nutcracker port on mw1223 is CRITICAL: Connection refused [05:38:49] PROBLEM - nutcracker process on mw1152 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 109 (nutcracker), command name nutcracker [05:38:50] PROBLEM - nutcracker port on mw1236 is CRITICAL: Connection refused [05:38:50] PROBLEM - nutcracker port on mw1064 is CRITICAL: Connection refused [05:38:50] PROBLEM - nutcracker port on mw2013 is CRITICAL: Connection refused [05:38:50] PROBLEM - nutcracker port on mw2154 is CRITICAL: Connection refused [05:38:59] PROBLEM - nutcracker port on mw2195 is CRITICAL: Connection refused [05:38:59] PROBLEM - nutcracker port on mw1164 is CRITICAL: Connection refused [05:39:00] PROBLEM - nutcracker process on mw1224 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:00] PROBLEM - nutcracker process on mw2060 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:00] PROBLEM - nutcracker process on mw2040 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:01] PROBLEM - nutcracker port on mw2104 is CRITICAL: Connection refused [05:39:10] PROBLEM - nutcracker port on mw2115 is CRITICAL: Connection refused [05:39:10] PROBLEM - nutcracker port on mw2001 is CRITICAL: Connection refused [05:39:10] PROBLEM - nutcracker process on mw1071 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:10] PROBLEM - nutcracker process on mw1037 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:10] PROBLEM - nutcracker process on mw2156 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:10] PROBLEM - nutcracker port on mw2171 is CRITICAL: Connection refused [05:39:11] PROBLEM - nutcracker port on mw1193 is CRITICAL: Connection refused [05:39:11] PROBLEM - nutcracker port on mw1176 is CRITICAL: Connection refused [05:39:19] PROBLEM - nutcracker port on mw1242 is CRITICAL: Connection refused [05:39:19] PROBLEM - nutcracker process on mw1099 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:20] PROBLEM - nutcracker process on mw2104 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:20] PROBLEM - nutcracker port on mw1016 is CRITICAL: Connection refused [05:39:20] PROBLEM - nutcracker process on mw1064 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:20] PROBLEM - nutcracker process on mw1164 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:20] PROBLEM - nutcracker port on mw2060 is CRITICAL: Connection refused [05:39:21] PROBLEM - nutcracker port on mw2148 is CRITICAL: Connection refused [05:39:21] PROBLEM - nutcracker port on mw2173 is CRITICAL: Connection refused [05:39:22] PROBLEM - nutcracker port on mw2128 is CRITICAL: Connection refused [05:39:30] PROBLEM - nutcracker port on mw2066 is CRITICAL: Connection refused [05:39:30] PROBLEM - nutcracker port on mw1152 is CRITICAL: Connection refused [05:39:30] PROBLEM - nutcracker port on mw2156 is CRITICAL: Connection refused [05:39:30] PROBLEM - nutcracker port on mw1065 is CRITICAL: Connection refused [05:39:30] PROBLEM - nutcracker port on mw1100 is CRITICAL: Connection refused [05:39:30] PROBLEM - nutcracker port on mw1209 is CRITICAL: Connection refused [05:39:31] PROBLEM - nutcracker process on mw2013 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:31] PROBLEM - nutcracker process on mw2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:32] PROBLEM - nutcracker process on mw2022 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:32] PROBLEM - nutcracker process on mw2128 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:33] PROBLEM - nutcracker process on mw2063 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:33] PROBLEM - nutcracker process on mw2173 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:34] PROBLEM - nutcracker port on mw1217 is CRITICAL: Connection refused [05:39:40] PROBLEM - nutcracker process on mw2171 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:41] RECOVERY - nutcracker process on mw1223 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:39:49] PROBLEM - nutcracker process on mw1242 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:49] PROBLEM - nutcracker process on mw1176 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:49] PROBLEM - nutcracker process on mw1193 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:49] PROBLEM - nutcracker process on mw1088 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:39:59] PROBLEM - nutcracker port on mw2007 is CRITICAL: Connection refused [05:39:59] PROBLEM - nutcracker process on mw1052 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:00] PROBLEM - nutcracker process on mw2195 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:00] PROBLEM - nutcracker port on mw2063 is CRITICAL: Connection refused [05:40:00] PROBLEM - nutcracker process on mw1100 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:00] PROBLEM - nutcracker process on mw2066 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:01] PROBLEM - nutcracker port on mw1117 is CRITICAL: Connection refused [05:40:01] PROBLEM - nutcracker process on mw1217 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:09] PROBLEM - nutcracker port on mw1052 is CRITICAL: Connection refused [05:40:10] PROBLEM - nutcracker port on mw1099 is CRITICAL: Connection refused [05:40:19] PROBLEM - nutcracker process on mw2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:20] PROBLEM - nutcracker port on mw1088 is CRITICAL: Connection refused [05:40:20] PROBLEM - nutcracker process on mw1117 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:39] PROBLEM - nutcracker process on mw1065 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [05:40:40] RECOVERY - nutcracker port on mw1223 is OK: TCP OK - 0.000 second response time on port 11212 [05:40:50] PROBLEM - nutcracker port on mw1037 is CRITICAL: Connection refused [05:40:50] PROBLEM - nutcracker port on mw2022 is CRITICAL: Connection refused [05:42:40] RECOVERY - nutcracker process on mw1239 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:42:40] RECOVERY - nutcracker process on mw1152 is OK: PROCS OK: 1 process with UID = 109 (nutcracker), command name nutcracker [05:42:49] RECOVERY - nutcracker port on mw1037 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:49] RECOVERY - nutcracker port on mw1108 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:49] RECOVERY - nutcracker port on mw1236 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:49] RECOVERY - nutcracker port on mw1064 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:49] RECOVERY - nutcracker port on mw2022 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:50] RECOVERY - nutcracker port on mw2028 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:50] RECOVERY - nutcracker port on mw2013 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:51] RECOVERY - nutcracker port on mw1239 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:51] RECOVERY - nutcracker port on mw2154 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:52] RECOVERY - nutcracker port on mw2195 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:52] RECOVERY - nutcracker port on mw1186 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:53] RECOVERY - nutcracker port on mw1164 is OK: TCP OK - 0.000 second response time on port 11212 [05:42:53] RECOVERY - nutcracker process on mw1224 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:42:54] RECOVERY - nutcracker process on mw1077 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:09] RECOVERY - nutcracker process on mw1037 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:09] RECOVERY - nutcracker process on mw2156 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:09] RECOVERY - nutcracker port on mw2175 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:09] RECOVERY - nutcracker port on mw1121 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:09] RECOVERY - nutcracker port on mw2171 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:10] RECOVERY - nutcracker process on mw2034 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:10] RECOVERY - nutcracker port on mw1193 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:11] RECOVERY - nutcracker port on mw1176 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:11] RECOVERY - nutcracker port on mw1242 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:19] RECOVERY - nutcracker port on mw2014 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:19] RECOVERY - nutcracker port on mw2187 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:19] RECOVERY - nutcracker port on mw1043 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:19] RECOVERY - nutcracker process on mw1099 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:19] RECOVERY - nutcracker port on mw1012 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:20] RECOVERY - nutcracker port on mw1067 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:20] RECOVERY - nutcracker port on mw1016 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:21] RECOVERY - nutcracker process on mw2104 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:21] RECOVERY - nutcracker process on mw1043 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:22] RECOVERY - nutcracker process on mw1108 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:22] RECOVERY - nutcracker process on mw1064 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:23] RECOVERY - nutcracker process on mw1164 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:23] RECOVERY - nutcracker process on mw1186 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:24] RECOVERY - nutcracker port on mw2060 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:35] RECOVERY - nutcracker process on mw2001 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:35] RECOVERY - nutcracker process on mw2022 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:36] RECOVERY - nutcracker process on mw2128 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:36] RECOVERY - nutcracker port on mw1217 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:37] RECOVERY - nutcracker process on mw2063 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:37] RECOVERY - nutcracker process on mw2173 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:38] RECOVERY - nutcracker process on mw2213 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:38] RECOVERY - nutcracker process on mw2154 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:39] RECOVERY - nutcracker process on mw2171 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:39] RECOVERY - nutcracker process on mw1105 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:40] RECOVERY - nutcracker process on mw2185 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:40] RECOVERY - nutcracker process on mw2180 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:41] RECOVERY - nutcracker process on mw2175 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:41] RECOVERY - nutcracker port on mw1006 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:42] RECOVERY - nutcracker port on mw1250 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:42] RECOVERY - nutcracker process on mw1242 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:43] RECOVERY - nutcracker process on mw1121 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:43:43] RECOVERY - nutcracker port on mw1187 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:55] RECOVERY - nutcracker port on mw2204 is OK: TCP OK - 0.000 second response time on port 11212 [05:43:59] RECOVERY - nutcracker process on mw2066 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:00] RECOVERY - nutcracker port on mw2159 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:00] RECOVERY - nutcracker port on mw1117 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:00] RECOVERY - nutcracker port on mw2108 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:00] RECOVERY - nutcracker port on mw1005 is OK: TCP OK - 0.002 second response time on port 11212 [05:44:00] RECOVERY - nutcracker process on mw1250 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:00] RECOVERY - nutcracker port on mw1167 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:01] RECOVERY - nutcracker process on mw1217 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:01] RECOVERY - nutcracker process on mw2115 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:02] RECOVERY - nutcracker process on mw1236 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:02] RECOVERY - nutcracker process on mw2014 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:09] RECOVERY - nutcracker port on mw1052 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:09] RECOVERY - nutcracker process on mw1028 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [05:44:09] RECOVERY - nutcracker port on mw1099 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:10] RECOVERY - nutcracker process on mw2007 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:10] RECOVERY - nutcracker process on mw2028 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:10] RECOVERY - nutcracker port on mw1088 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:10] RECOVERY - nutcracker process on mw1117 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:11] RECOVERY - nutcracker process on mw1167 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:20] RECOVERY - nutcracker process on mw2121 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:20] RECOVERY - nutcracker process on mw1016 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:20] RECOVERY - nutcracker process on mw1005 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:21] RECOVERY - nutcracker port on mw2185 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:21] RECOVERY - nutcracker port on mw2155 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:21] RECOVERY - nutcracker port on mw2040 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:21] RECOVERY - nutcracker port on mw2034 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:29] RECOVERY - nutcracker port on mw1105 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:29] RECOVERY - nutcracker port on mw1160 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:30] RECOVERY - nutcracker port on mw1224 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:30] RECOVERY - nutcracker process on mw1065 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:30] RECOVERY - nutcracker port on mw1071 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:30] RECOVERY - nutcracker process on mw1067 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:44:31] RECOVERY - nutcracker port on mw2124 is OK: TCP OK - 0.000 second response time on port 11212 [05:44:39] RECOVERY - nutcracker process on mw1006 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [05:46:30] (03PS1) 10Yuvipanda: labstore: Followup to I90dc98401b89e769fa058943e3714e383dfe25ea [puppet] - 10https://gerrit.wikimedia.org/r/227413 (https://phabricator.wikimedia.org/T104453) [05:46:50] (03PS2) 10Yuvipanda: labstore: Followup to I90dc98401b89e769fa058943e3714e383dfe25ea [puppet] - 10https://gerrit.wikimedia.org/r/227413 (https://phabricator.wikimedia.org/T104453) [05:48:11] (03CR) 10Yuvipanda: [C: 032] labstore: Followup to I90dc98401b89e769fa058943e3714e383dfe25ea [puppet] - 10https://gerrit.wikimedia.org/r/227413 (https://phabricator.wikimedia.org/T104453) (owner: 10Yuvipanda) [06:08:29] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1487469 (10MoritzMuehlenhoff) If we move to openjdk-8 at a later point we would likely make our own backport and work with the Debian Java maintainers towards providing our build in jessie-backport. We... [06:53:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 28 06:53:21 UTC 2015 (duration 53m 20s) [06:53:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:10:57] (03PS1) 10Muehlenhoff: Enable base::firewall [puppet] - 10https://gerrit.wikimedia.org/r/227416 [07:14:20] PROBLEM - RAID on analytics1004 is CRITICAL Active: 7, Working: 7, Failed: 1, Spare: 0 [07:18:24] (03PS1) 10Muehlenhoff: Enable ferm on mc1009 [puppet] - 10https://gerrit.wikimedia.org/r/227417 [07:18:26] (03PS1) 10Muehlenhoff: Enable ferm for remaining mc1* systems [puppet] - 10https://gerrit.wikimedia.org/r/227418 [07:30:13] !log dropped others20150724190859 on labstore1002 [07:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:51] (03CR) 10Giuseppe Lavagetto: "@ori: While I agree that we need an unified fix, I think this is pretty isolated for now and can allow us to move forward relatively safe." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227000 (owner: 10Giuseppe Lavagetto) [07:38:20] (03PS3) 10Giuseppe Lavagetto: mediawiki: catch thumb_handler.php to HHVM as well [puppet] - 10https://gerrit.wikimedia.org/r/227000 [07:43:52] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1487523 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff [07:44:16] (03PS1) 10Yuvipanda: labstore: Daemonize create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/227420 [07:45:08] (03PS2) 10Yuvipanda: labstore: Daemonize create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/227420 (https://phabricator.wikimedia.org/T104453) [07:45:28] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Daemonize create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/227420 (https://phabricator.wikimedia.org/T104453) (owner: 10Yuvipanda) [07:50:18] (03PS1) 10Yuvipanda: labstore: Fix use-before-reference properly [puppet] - 10https://gerrit.wikimedia.org/r/227421 [07:51:18] (03CR) 10Yuvipanda: [C: 032] labstore: Fix use-before-reference properly [puppet] - 10https://gerrit.wikimedia.org/r/227421 (owner: 10Yuvipanda) [07:54:40] PROBLEM - puppet last run on mw2062 is CRITICAL Puppet has 1 failures [07:55:05] (03PS1) 10Yuvipanda: labstore: Don't fetch 'homedir' property from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/227422 [07:55:25] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Don't fetch 'homedir' property from LDAP [puppet] - 10https://gerrit.wikimedia.org/r/227422 (owner: 10Yuvipanda) [08:11:00] (03PS1) 10Yuvipanda: labstore: Do not fail if homedir does not exist [puppet] - 10https://gerrit.wikimedia.org/r/227423 [08:11:32] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Do not fail if homedir does not exist [puppet] - 10https://gerrit.wikimedia.org/r/227423 (owner: 10Yuvipanda) [08:12:39] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1487540 (10MoritzMuehlenhoff) [08:12:41] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1487538 (10MoritzMuehlenhoff) 5Open>3Resolved elasticsearch-1.7.0 has been imported for jessie-wikimedia and trusty-... [08:13:27] !log added elasticsearch-1.7.0 to carbon for jessie and trusty [08:13:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:14:19] PROBLEM - puppet last run on cp3035 is CRITICAL puppet fail [08:15:01] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1487541 (10jcrespo) @Mattflaschen once new servers are available, data will be transferred without losing records or availability. Data will not be backuped, in the traditional sense, but it will be... [08:18:40] RECOVERY - puppet last run on mw2062 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [08:32:00] PROBLEM - puppet last run on sca1001 is CRITICAL Puppet has 1 failures [08:42:10] RECOVERY - puppet last run on cp3035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:57:49] RECOVERY - puppet last run on sca1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:00:10] (03PS3) 10Filippo Giunchedi: enabled GC logging [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [09:00:48] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] enabled GC logging [puppet] - 10https://gerrit.wikimedia.org/r/227355 (https://phabricator.wikimedia.org/T106619) (owner: 10Eevans) [09:12:57] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1487639 (10hashar) ----- A gotcha is the gem binaries are compiled against a ruby version. If you deploy on different Linux distributions / vary ruby version, you end up with... [09:17:52] (03PS3) 10Faidon Liambotis: mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 [09:17:54] (03PS3) 10Faidon Liambotis: exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 [09:17:56] (03PS2) 10Faidon Liambotis: exim: use exim4 directly from role::otrs [puppet] - 10https://gerrit.wikimedia.org/r/216650 [09:17:58] (03PS2) 10Faidon Liambotis: exim: use exim4 directly from role::mail::lists [puppet] - 10https://gerrit.wikimedia.org/r/216649 [09:18:00] (03PS2) 10Faidon Liambotis: exim: use exim4 directly from Phab/RT [puppet] - 10https://gerrit.wikimedia.org/r/216648 [09:18:02] (03PS2) 10Faidon Liambotis: exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 [09:18:04] (03PS2) 10Faidon Liambotis: exim: kill all exim::* classes except for ::roled [puppet] - 10https://gerrit.wikimedia.org/r/216646 [09:18:06] (03PS2) 10Faidon Liambotis: exim: kill unused exim::roled parameters [puppet] - 10https://gerrit.wikimedia.org/r/216645 [09:18:08] (03PS2) 10Faidon Liambotis: exim: inline @local_domains [puppet] - 10https://gerrit.wikimedia.org/r/216644 [09:18:10] (03PS2) 10Faidon Liambotis: exim: remove $smart_route_list [puppet] - 10https://gerrit.wikimedia.org/r/216643 [09:18:12] (03PS2) 10Faidon Liambotis: mail: remove secondary MX role from sodium (2nd take) [puppet] - 10https://gerrit.wikimedia.org/r/216642 [09:18:14] (03PS2) 10Faidon Liambotis: exim: kill one-size-fits-all SMTP_IMAP_MM template [puppet] - 10https://gerrit.wikimedia.org/r/216641 [09:18:16] (03PS2) 10Faidon Liambotis: exim: split main MX config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216640 [09:18:18] (03PS2) 10Faidon Liambotis: exim: untangle exim4.conf between roles & simplify [puppet] - 10https://gerrit.wikimedia.org/r/216635 [09:18:20] (03PS2) 10Faidon Liambotis: exim: split phab_relay into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216636 [09:18:22] (03PS2) 10Faidon Liambotis: exim: split rt_relay into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216637 [09:18:24] (03PS2) 10Faidon Liambotis: exim: split OTRS config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216638 [09:18:26] (03PS2) 10Faidon Liambotis: exim: split mailman config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216639 [09:22:00] 6operations, 6Services: Find spares for SCA services - https://phabricator.wikimedia.org/T107137#1487649 (10mobrovac) 3NEW [09:22:36] wow paravoid [09:22:39] mass rebase? [09:22:39] :) [09:22:41] (03CR) 10Faidon Liambotis: [C: 032] exim: untangle exim4.conf between roles & simplify [puppet] - 10https://gerrit.wikimedia.org/r/216635 (owner: 10Faidon Liambotis) [09:23:12] 6operations, 6Services: Find spares for SCA services - https://phabricator.wikimedia.org/T107137#1487659 (10mobrovac) [09:23:15] 6operations, 6Mobile-Apps, 6Services, 3Mobile-Content-Service, 5Patch-For-Review: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1487658 (10mobrovac) [09:23:20] 6operations, 6Services: Find spares for SCA services - https://phabricator.wikimedia.org/T107137#1487649 (10mobrovac) [09:23:23] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206310 (10mobrovac) [09:24:30] (03CR) 10Faidon Liambotis: [C: 032] exim: split phab_relay into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216636 (owner: 10Faidon Liambotis) [09:26:11] wth [09:26:29] phabricator's puppet is spewing tons of refreshes on each run :( [09:34:10] (03PS1) 10Jcrespo: Increasing db1035 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227428 [09:39:40] (03CR) 10Jcrespo: [C: 032] Increasing db1035 weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227428 (owner: 10Jcrespo) [09:40:15] (03CR) 10Faidon Liambotis: [C: 032] exim: split rt_relay into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216637 (owner: 10Faidon Liambotis) [09:41:24] !log jynus Synchronized wmf-config/db-eqiad.php: Increasing db1035 weight (duration: 00m 13s) [09:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:56:16] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1487727 (10MoritzMuehlenhoff) Not sure how many Gems we're talking here and how often they change, if the numbers are low, it would also be an option to use gem2deb to create a... [10:03:36] !log citoid deploying d57ec96 [10:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:07:40] 6operations: tin doesn't have access to same memcached as terbium and app servers - https://phabricator.wikimedia.org/T103198#1487754 (10fgiunchedi) >>! In T103198#1485010, @bd808 wrote: >>>! In T103198#1484465, @fgiunchedi wrote: >> I agree that's confusing, though I'm not sure if `mwscript` (part of scap) is o... [10:24:30] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1487777 (10hashar) One problem is that logstash moves quickly when it comes to dependency and there is a lot to package or the distro versions would not match the requirements :... [10:27:49] (03CR) 10Faidon Liambotis: [C: 032] exim: split OTRS config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216638 (owner: 10Faidon Liambotis) [10:28:08] (03CR) 10Hashar: [C: 031] "That is great thank you! The other errors are bits that needs to be moved to modules and reference puppet:///files/...." [puppet] - 10https://gerrit.wikimedia.org/r/198116 (https://phabricator.wikimedia.org/T87132) (owner: 10Tim Landscheidt) [10:31:17] !log merging a series of mail-related patches; ping me personally if problems arise [10:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:32:17] (03CR) 10Faidon Liambotis: [C: 032] exim: split mailman config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216639 (owner: 10Faidon Liambotis) [10:37:38] (03CR) 10Faidon Liambotis: [C: 032] exim: split main MX config into a separate config erb [puppet] - 10https://gerrit.wikimedia.org/r/216640 (owner: 10Faidon Liambotis) [10:39:47] 6operations, 7Database: Spikes of job runner new connection errors to mysql "Error connecting to 10.64.32.24: Can't connect to MySQL server on '10.64.32.24' (4)" - https://phabricator.wikimedia.org/T107072#1487800 (10jcrespo) The large spikes are no longer, but the issue persist almost over all shards, with mu... [10:39:48] (03CR) 10Faidon Liambotis: [C: 032] exim: kill one-size-fits-all SMTP_IMAP_MM template [puppet] - 10https://gerrit.wikimedia.org/r/216641 (owner: 10Faidon Liambotis) [10:42:00] PROBLEM - puppet last run on cp3043 is CRITICAL puppet fail [11:10:10] RECOVERY - puppet last run on cp3043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:20:39] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [11:21:11] PROBLEM - Restbase root url on restbase1007 is CRITICAL: Connection refused [11:21:20] PROBLEM - Disk space on restbase1007 is CRITICAL: DISK CRITICAL - free space: /var 420 MB (0% inode=99%) [11:31:50] godog: mind acking these alerts for rb1007? [11:32:16] (could also put rb100[7-9] in maintenance mode or sth) [11:32:28] mobrovac: yup, I did that 10m ago [11:32:36] ah k thnx [11:37:30] PROBLEM - Host mc2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:40:30] RECOVERY - Host mc2004 is UPING OK - Packet loss = 0%, RTA = 44.72 ms [11:44:56] ^ mc2004 was me, I scheduled a downtime in Icinga, but apparently it didn't kick it in time [12:22:42] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1487961 (10matthiasmullie) @jcrespo: Flow doesn't currently record it's entries in `text`: they're stored separately (extension1, `flow_revision.rev_content`) to be easily accessible cross-wiki. We'... [12:25:51] PROBLEM - puppet last run on mira is CRITICAL Puppet has 1 failures [12:29:14] !log reenable puppet on restbase1001 after merging https://gerrit.wikimedia.org/r/#/c/227355/ [12:29:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:40:47] 6operations, 10CirrusSearch, 6Discovery, 10hardware-requests, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1487996 (10fgiunchedi) to clarify, I think it makes sense to quote 800G SSD and also 300G SSD for price co... [12:44:56] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1488003 (10fgiunchedi) no error shown after 25h of testing (fill up to 100% and read back the contents). the bytes read/written per second figures are skewed by being calculated o... [12:46:31] RECOVERY - Disk space on restbase1009 is OK: DISK OK [12:47:11] RECOVERY - Disk space on restbase1007 is OK: DISK OK [12:50:11] RECOVERY - puppet last run on mira is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:54:47] 6operations, 10MediaWiki-Database: Compress data at external storage - https://phabricator.wikimedia.org/T106386#1488009 (10jcrespo) > Is that a sane idea? @matthiasmullie I do not have enough architecture and flow knowledge to agree or disagree with your suggestion (I only created this task because it was su... [12:57:26] (03Abandoned) 10DCausse: Upgrade swift repository [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/225483 (owner: 10Manybubbles) [13:00:05] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150728T1300). Please do the needful. [13:07:35] (03PS4) 10Giuseppe Lavagetto: mediawiki: catch thumb_handler.php to HHVM as well [puppet] - 10https://gerrit.wikimedia.org/r/227000 [13:13:08] 6operations: Update wikimedia apt repo to include debs for shiny-server - https://phabricator.wikimedia.org/T106435#1488018 (10fgiunchedi) what's the medium/long term idea? just labs and/or MWV? (see also @yuvipanda's thoughts on the code review, related to this) [13:13:08] !log temporarily changing master of db1069(s1) to db1051 in order to fix some labsdb inconsistencies on enwiki_p [13:13:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:18:36] (03CR) 1020after4: [C: 032] Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [13:18:59] (03Merged) 10jenkins-bot: Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [13:20:07] (03PS2) 10Chmarkine: Add "Secure" flag to GeoIP cookie [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) [13:24:26] (03CR) 10Chmarkine: "How about making GeoIP cookie secure first?" [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) (owner: 10Chmarkine) [13:24:57] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1488031 (10Ottomata) Interesting! Rather than using git-fat + rsync, perhaps it would be better to do the deploy-repo module, where there is a git-submodule that includes commi... [13:25:27] (03PS2) 10Ottomata: adding user madhuvishy to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/227357 (owner: 10RobH) [13:25:42] (03CR) 10Ottomata: [C: 032 V: 032] adding user madhuvishy to analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/227357 (owner: 10RobH) [13:27:49] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 3585.32188316 [13:28:42] (03PS4) 10Faidon Liambotis: mail: rename role::mail::lists to role::lists [puppet] - 10https://gerrit.wikimedia.org/r/216652 [13:28:44] (03PS4) 10Faidon Liambotis: exim: fold exim::roled into role::mail::mx [puppet] - 10https://gerrit.wikimedia.org/r/216651 [13:28:46] (03PS3) 10Faidon Liambotis: exim: use exim4 directly from role::otrs [puppet] - 10https://gerrit.wikimedia.org/r/216650 [13:28:48] (03PS3) 10Faidon Liambotis: exim: use exim4 directly from role::mail::lists [puppet] - 10https://gerrit.wikimedia.org/r/216649 [13:28:50] (03PS3) 10Faidon Liambotis: exim: use exim4 directly from Phab/RT [puppet] - 10https://gerrit.wikimedia.org/r/216648 [13:28:52] (03PS3) 10Faidon Liambotis: exim: remove defer_domains for single-domain MXes [puppet] - 10https://gerrit.wikimedia.org/r/216647 [13:28:54] (03PS3) 10Faidon Liambotis: exim: kill all exim::* classes except for ::roled [puppet] - 10https://gerrit.wikimedia.org/r/216646 [13:28:56] (03PS3) 10Faidon Liambotis: exim: kill unused exim::roled parameters [puppet] - 10https://gerrit.wikimedia.org/r/216645 [13:28:58] (03PS3) 10Faidon Liambotis: exim: inline @local_domains [puppet] - 10https://gerrit.wikimedia.org/r/216644 [13:29:00] (03PS3) 10Faidon Liambotis: exim: remove $smart_route_list [puppet] - 10https://gerrit.wikimedia.org/r/216643 [13:29:01] ACKNOWLEDGEMENT - RAID on analytics1004 is CRITICAL Active: 7, Working: 7, Failed: 1, Spare: 0 ottomata This is a Cisco box, and not used in production. The drives are not even mounted. I do use this box for testing new technologies, but it will never be used for production, so there is likely not a need to fix this. [13:29:02] (03PS3) 10Faidon Liambotis: mail: remove secondary MX role from sodium (2nd take) [puppet] - 10https://gerrit.wikimedia.org/r/216642 [13:29:31] (03PS6) 10Filippo Giunchedi: git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 (https://phabricator.wikimedia.org/T67549) (owner: 10ArielGlenn) [13:36:45] twentyafterfour: ping [13:37:22] twentyafterfour: /srv/mediawiki-staging/ used to have a "readonly" remote, but not anymore [13:37:30] twentyafterfour: this broke the alert for "Unmerged changes on repository mediawiki_config" [13:39:00] PROBLEM - puppet last run on mw1011 is CRITICAL Puppet last ran 19 hours ago [13:39:14] just reenabled ^ [13:40:16] !log restarted zookeeper on conf1001 to effect OpenJDK security update [13:40:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:00] RECOVERY - puppet last run on mw1011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [13:41:23] <_joe_> !log disabled puppet on mw1152, thumb_handler testing [13:41:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:48:14] (03PS1) 10Muehlenhoff: Add Joel Krauska to the bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227449 [13:48:59] (03PS7) 10Filippo Giunchedi: git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 (https://phabricator.wikimedia.org/T67549) (owner: 10ArielGlenn) [13:49:13] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] git deploy: don't fetch/checkout/restart on the deployment server [puppet] - 10https://gerrit.wikimedia.org/r/212291 (https://phabricator.wikimedia.org/T67549) (owner: 10ArielGlenn) [13:49:36] (03CR) 10Tim Landscheidt: labstore: Followup to I90dc98401b89e769fa058943e3714e383dfe25ea (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227413 (https://phabricator.wikimedia.org/T104453) (owner: 10Yuvipanda) [13:50:13] 6operations, 10Deployment-Systems, 5Patch-For-Review: Trebuchet doesn't like when a deployer server is also a minion, a edge case for scap - https://phabricator.wikimedia.org/T67549#1488080 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi >>! In T67549#1469568, @thcipriani wrote: > @fgiunchedi works as exp... [13:50:21] 6operations, 10Beta-Cluster, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1488083 (10hashar) deployment-cache-text03 has been created with Jessie system. That is to prepare the migration of the Trusty cache deployment-cache-text02. [13:50:41] 6operations, 10Beta-Cluster, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1488084 (10Krenair) [13:52:51] (03CR) 10Tim Landscheidt: labstore: Followup to I90dc98401b89e769fa058943e3714e383dfe25ea (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227413 (https://phabricator.wikimedia.org/T104453) (owner: 10Yuvipanda) [13:58:26] !log upgrading baham to gdnsd 2.2.0 [13:58:28] bblack: ^ [13:58:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:37] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1488102 (10matthiasmullie) AIUI: the immediate spa... [14:03:28] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 3 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1488108 (10Joe) 3NEW a:3Joe [14:03:37] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1488108 (10Joe) [14:03:49] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1488108 (10Joe) a:5Joe>3Cmjohnson [14:05:19] PROBLEM - check_puppetrun on betelgeuse is CRITICAL Puppet has 1 failures [14:05:38] (03PS2) 10Faidon Liambotis: Add Joel Krauska to the bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227449 (owner: 10Muehlenhoff) [14:05:49] (03CR) 10Faidon Liambotis: [C: 032] Add Joel Krauska to the bastiononly group [puppet] - 10https://gerrit.wikimedia.org/r/227449 (owner: 10Muehlenhoff) [14:09:57] (03PS1) 10Giuseppe Lavagetto: imagescalers: convert one host to mpm worker [puppet] - 10https://gerrit.wikimedia.org/r/227450 [14:10:55] (03CR) 10Faidon Liambotis: "What's the benefit/point, though? There is nothing private about the GeoIP cookie. It's also a session cookie (no max-age), so there is re" [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) (owner: 10Chmarkine) [14:11:59] (03CR) 10Giuseppe Lavagetto: [C: 032] imagescalers: convert one host to mpm worker [puppet] - 10https://gerrit.wikimedia.org/r/227450 (owner: 10Giuseppe Lavagetto) [14:15:09] RECOVERY - check_puppetrun on betelgeuse is OK Puppet is currently enabled, last run 141 seconds ago with 0 failures [14:16:04] !log restarted zookeeper on conf1002 to effect OpenJDK security update [14:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:16:41] <_joe_> !log re-enabled puppet on mw1152 for testing [14:16:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:25:34] (03PS3) 10Filippo Giunchedi: cassandra: restrict data directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) [14:25:49] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: restrict data directory permissions [puppet] - 10https://gerrit.wikimedia.org/r/225300 (https://phabricator.wikimedia.org/T106133) (owner: 10Filippo Giunchedi) [14:28:19] !log restarted zookeeper on conf1003 to effect OpenJDK security update [14:28:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:29:14] (03CR) 10Matthias Mullie: [C: 031] "I've tested most of our dependencies as I wasn't sure we didn't rely on some dependencies too heavily." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [14:33:17] (03PS2) 10Filippo Giunchedi: ganglia: cleanup old temporary graphs [puppet] - 10https://gerrit.wikimedia.org/r/226087 (https://phabricator.wikimedia.org/T97637) [14:33:37] _joe_: ^ [14:40:10] PROBLEM - check_puppetrun on heka is CRITICAL Puppet has 1 failures [14:41:49] PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 47 failures [14:41:51] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [14:41:51] PROBLEM - Host virt1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:32] the virts are me [14:43:47] !log powering down logstash1002 to remove disk and install jessie [14:43:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:45:09] RECOVERY - check_puppetrun on heka is OK Puppet is currently enabled, last run 187 seconds ago with 0 failures [14:46:13] paravoid: \o/ [14:46:50] PROBLEM - check_puppetrun on bellatrix is CRITICAL Puppet has 47 failures [14:47:10] RECOVERY - Host virt1001 is UPING OK - Packet loss = 0%, RTA = 0.62 ms [14:47:20] RECOVERY - Host virt1002 is UPING OK - Packet loss = 0%, RTA = 1.09 ms [14:47:58] <_joe_> godog: will take a look [14:49:31] thanks [14:54:10] (03PS2) 10BBlack: Remove multi-level subdomains from wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/227214 (https://phabricator.wikimedia.org/T102814) [14:54:38] (03CR) 10BBlack: [C: 032] Remove multi-level subdomains from wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/227214 (https://phabricator.wikimedia.org/T102814) (owner: 10BBlack) [14:54:44] (03PS18) 10Filippo Giunchedi: Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [14:54:55] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [14:55:50] RECOVERY - puppet last run on labvirt1005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:56:03] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1488203 (10BBlack) [14:56:04] 6operations, 10Traffic: Clean up DNS/redirects for TLS - https://phabricator.wikimedia.org/T102824#1488204 (10BBlack) [14:56:06] 6operations, 6Community-Advocacy, 10Traffic, 7HTTPS, 5Patch-For-Review: Decom old multiple-subdomain wikis in wikipedia.org - https://phabricator.wikimedia.org/T102814#1488201 (10BBlack) 5Open>3Resolved a:3BBlack [14:56:49] RECOVERY - check_puppetrun on bellatrix is OK Puppet is currently enabled, last run 147 seconds ago with 0 failures [14:58:45] jouncebot: next [14:58:45] In 0 hour(s) and 1 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150728T1500) [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150728T1500). Please do the needful. [15:00:05] James_F: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:11] * James_F waves. [15:00:17] * thcipriani waves back [15:00:41] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226338 (owner: 10Jforrester) [15:01:16] (03Merged) 10jenkins-bot: Enable VisualEditor for 5% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226338 (owner: 10Jforrester) [15:02:59] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable VisualEditor for 5% of new accounts on enwiki [[gerrit:226338]] (duration: 00m 12s) [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:03:08] ^ James_F synced! [15:03:21] thcipriani: Yay. Not really testable… will have a poke though. [15:03:30] kk, thanks [15:03:40] PROBLEM - puppet last run on praseodymium is CRITICAL Puppet has 1 failures [15:04:30] 6operations, 10Wikimedia-Logstash: Setup rsyncable git fat store to host Logstash plugins - https://phabricator.wikimedia.org/T107121#1488217 (10bd808) >>! In T107121#1487727, @MoritzMuehlenhoff wrote: > Not sure how many Gems we're talking here and how often they change, if the numbers are low, it would also... [15:05:25] James_F: godspeed [15:05:34] greg-g: Thanks. [15:08:09] PROBLEM - puppet last run on cerium is CRITICAL Puppet has 1 failures [15:08:14] that's me ^ [15:11:06] greg-g: sadly you should probably take the many.bubbles ping out of SWAT [15:11:13] oh man [15:11:27] * greg-g pours one out [15:12:00] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [15:12:10] PROBLEM - puppet last run on restbase1003 is CRITICAL Puppet has 1 failures [15:12:22] 6operations, 10ops-eqiad: logstash1003 - RAID failed - https://phabricator.wikimedia.org/T104592#1488228 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Replaced the disk prior to new install [15:12:25] thcipriani: FWIW, nothing looks amiss. [15:12:25] bd808: heh, he just did https://wikitech.wikimedia.org/w/index.php?title=Deployments&oldid=172085 [15:12:47] er, whatever, wrong url, meant https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=next&oldid=172009 [15:13:17] James_F: fatalmonitor looks about the same, too. Seems like a success in my book. Thanks for checking! [15:13:25] thcipriani: Thank you! [15:13:33] (03PS1) 10BBlack: Add HSTS preload for wikipedia.org, refactor related regexes [puppet] - 10https://gerrit.wikimedia.org/r/227455 (https://phabricator.wikimedia.org/T104244) [15:13:50] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.103 second response time [15:14:10] PROBLEM - puppet last run on restbase1004 is CRITICAL Puppet has 1 failures [15:15:46] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488235 (10EWilfong_WMF) I apologize as I don't know who the question above was directed towards, but I'll add some thoughts from our end. Since there is already a wildcard cert for the site, it w... [15:16:11] (03PS2) 10BBlack: Add HSTS preload for wikipedia.org, refactor related regexes [puppet] - 10https://gerrit.wikimedia.org/r/227455 (https://phabricator.wikimedia.org/T104244) [15:16:57] 6operations, 10ops-eqiad: install 10g NIC card to labnet1002 - https://phabricator.wikimedia.org/T103849#1488237 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson Andrew and Faidon got this working finally! [15:19:08] 6operations, 10ops-eqiad: What to do with decommissioned ciscos? - https://phabricator.wikimedia.org/T103374#1488254 (10Cmjohnson) p:5Triage>3Low Until I have them all ready lowering the priority [15:20:17] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1488256 (10Cmjohnson) The thermal paste is on-site. @bblack let me know the first chunk of servers. The whole process is pretty quick. [15:20:47] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488258 (10BBlack) We're definitely not handing a 3rd party the private key to a WMF wildcard cert. I don't think we'd authorize someone else to purchase certs in our name either. At best you're... [15:21:28] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Verify visually that the labstore shelves' wiring is stable - https://phabricator.wikimedia.org/T94828#1488259 (10Cmjohnson) 5Open>3Resolved The wiring is stable. Verified all the cables were snug in their ports. Resolving this task. [15:22:10] RECOVERY - puppet last run on restbase1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:23:19] Thanks cmjohnson1 [15:23:38] anytime! [15:24:13] 6operations, 10Continuous-Integration-Infrastructure: Phase out lanthanum.eqiad.wmnet - https://phabricator.wikimedia.org/T86658#1488270 (10Cmjohnson) [15:24:15] 6operations, 10ops-eqiad: wipe disks for lanthanum - https://phabricator.wikimedia.org/T105901#1488268 (10Cmjohnson) 5Open>3Resolved This task has been completed [15:26:10] RECOVERY - puppet last run on restbase1004 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [15:28:07] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia - https://phabricator.wikimedia.org/T106499#1488278 (10hashar) 5stalled>3Open [15:28:24] (03PS1) 10Giuseppe Lavagetto: imagescalers: fix mpm worker config [puppet] - 10https://gerrit.wikimedia.org/r/227460 [15:28:41] (03Abandoned) 10Giuseppe Lavagetto: mediawiki: catch thumb_handler.php to HHVM as well [puppet] - 10https://gerrit.wikimedia.org/r/227000 (owner: 10Giuseppe Lavagetto) [15:29:01] (03PS2) 10Giuseppe Lavagetto: imagescalers: fix mpm worker config [puppet] - 10https://gerrit.wikimedia.org/r/227460 [15:29:09] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] imagescalers: fix mpm worker config [puppet] - 10https://gerrit.wikimedia.org/r/227460 (owner: 10Giuseppe Lavagetto) [15:29:13] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia - https://phabricator.wikimedia.org/T106499#1470255 (10hashar) Bumped the package to wmf3: ``` zuul (2.0.0-327-g3ebedde-wmf3precise1) precise-wikimedia; urgency=medium * 0008-Revert-... [15:29:40] RECOVERY - puppet last run on praseodymium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:52] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488283 (10CCogdill_WMF) @BBlack we're on a time crunch with this project. The first event is scheduled for 10/1, so we need this system to be live within the next 2 weeks. Who should we involve t... [15:32:19] RECOVERY - puppet last run on cerium is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:36:40] (03PS1) 10Hashar: nodepool: stop using diskimage [puppet] - 10https://gerrit.wikimedia.org/r/227461 [15:37:37] (03CR) 10Hashar: "Dan: that will cause Nodepool to no more automatically build image. We will have to ship an image named 'ci-jessie-wikimedia' in wmflabs." [puppet] - 10https://gerrit.wikimedia.org/r/227461 (owner: 10Hashar) [15:39:26] 6operations, 10Analytics-Cluster, 6Analytics-Kanban: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1488300 (10Milimetric) [15:40:36] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1488303 (10hashar) 3NEW [15:40:51] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1488310 (10hashar) 5Open>3stalled Stalled for now. [15:43:39] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1488315 (10hashar) [15:43:58] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1488316 (10chasemp) >>! In T107158#1488310, @hashar wrote: > Stalled for now. is this a `hashar is going on vacation` stall? :) [15:44:39] (03PS1) 10coren: Add manage-snapshots script [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) [15:44:55] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1488319 (10hashar) Potentially vacations will be a blocker. I have too look at the Debian packages available in apt.wikimedia.org since I think I have manually installed some ;-( [15:45:33] 7Blocked-on-Operations, 6operations, 5Continuous-Integration-Isolation: Backport python-os-client-config 1.3.0-1 from Debian Sid to jessie-wikimedia - https://phabricator.wikimedia.org/T104967#1433420 (10hashar) [15:48:26] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488328 (10BBlack) Operations, probably myself and @RobH at least on the certificate-purchasing front. I imagine @Mark and @Faidon may have input or want to follow along as well. Ideally, we sho... [15:48:50] 6operations, 10ops-eqiad: Decom and wipe cisco virt servers virt1001-1009 then remove from racks - https://phabricator.wikimedia.org/T107159#1488330 (10Cmjohnson) 3NEW [15:50:43] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1488340 (10Cmjohnson) a:5Cmjohnson>3bd808 The on-site portion of this task has been completed. Assigning to Bryan to complete and resolve. [15:52:39] !log installed logstash on logstash1002; forced puppet run [15:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:20] "CRITICAL] The Salt Master has rejected this minion's public key!" [15:54:34] can somebody accept the key for logstash1002 on the salt master? [15:54:54] <_joe_> bd808: on it [15:55:04] ori: my brine has been flavored again ;) [15:57:00] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1488353 (10Cmjohnson) i have 10 710 controllers on-site. I believe these came as spares during the c2100/R720 swap we had a few years ago. I will... [15:58:13] <_joe_> bd808: restart the salt minion please [15:58:29] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488355 (10CCogdill_WMF) I apologize about the timeline; we weren't able to request the cert until T104357 was resolved. Originally we thought we were ahead of schedule. I'll set up a meeting thi... [15:58:50] _joe_: done, but still getting rejected [15:59:05] <_joe_> bd808: look now [15:59:19] _joe_: working! thanks [15:59:24] <_joe_> I had to remove the old key [15:59:30] ah [15:59:34] <_joe_> that wasn't removed before, apparently [15:59:55] <_joe_> ok, off for real now :) [15:59:58] o/ [16:00:04] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150728T1600). [16:00:21] 6operations, 7Database: prepare for mariadb 10.0 masters - https://phabricator.wikimedia.org/T105135#1488360 (10jcrespo) [16:00:30] (03PS1) 10WMDE-leszek: phragile: Add role class [puppet] - 10https://gerrit.wikimedia.org/r/227466 (https://phabricator.wikimedia.org/T101235) [16:01:48] !log bounce cassandra on xenon to test logstash logging [16:01:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:21] !log logstash1002 conversion to jessie done; log event volume returning to normal in index [16:03:23] mobrovac urandom gwicke looks like we can attempt a bootstrap today on a new machine, disk stress test didn't surface any errors https://phabricator.wikimedia.org/T102557#1488003 [16:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:03:48] (03CR) 10Alex Monk: "Tim: Your input here would be appreciated." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206480 (https://phabricator.wikimedia.org/T18655) (owner: 10Nemo bis) [16:05:28] Memcached error for key "enwiki:preprocess-hash:ce11530ca3ddcbf3602e9ccb8815f2e3:0" on server "/var/run/nutcracker/nutcracker.sock:0": A TIMEOUT OCCURRED [16:05:34] 5000 times in the last hour [16:05:58] across a lot of mw hosts so probably at the backing mc side [16:06:30] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488397 (10CCogdill_WMF) [16:06:47] 6operations, 6Labs, 6Security: create-dbusers can be used to clobber existing files on the NFS server - https://phabricator.wikimedia.org/T107161#1488399 (10scfc) 3NEW [16:11:48] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1488430 (10Smalyshev) WDQS is very IO intensive, see https://wiki.blazegraph.com/wiki/index.php/Hardware_Configuration - so we do need the best IO... [16:11:59] PROBLEM - Hadoop NodeManager on analytics1030 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:12:48] (03Abandoned) 10Chmarkine: Add "Secure" flag to GeoIP cookie [puppet] - 10https://gerrit.wikimedia.org/r/224029 (https://phabricator.wikimedia.org/T105451) (owner: 10Chmarkine) [16:13:49] (03CR) 10Alex Monk: [C: 04-1] "Might break some code in ZeroBanner" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226971 (https://phabricator.wikimedia.org/T106206) (owner: 10Alex Monk) [16:14:00] RECOVERY - Hadoop NodeManager on analytics1030 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [16:15:20] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1488448 (10Cmjohnson) updating this to match what is on-site. We're going to need to order fibers 12 20M and 6 15M LC-LC lvs1007: eth0 -> asw2-a5:8 (home row) eth1 -> asw-c8:23 eth2 -> a... [16:15:47] (03PS1) 10Yuvipanda: labstore: Do not follow symlinks in create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/227467 [16:16:25] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Do not follow symlinks in create-dbusers [puppet] - 10https://gerrit.wikimedia.org/r/227467 (owner: 10Yuvipanda) [16:16:54] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488450 (10RobH) Brandon already covered this, but I'll add some clarification. Right now we have at least three third party sites hosting SSL, all of which do NOT use wildcards: shop.wikimedia.o... [16:20:45] (03CR) 10BryanDavis: [C: 04-1] "This is currently blocked on distributing the one logstash non-core plugin gem we need. Tracked in T107121." [puppet] - 10https://gerrit.wikimedia.org/r/226991 (https://phabricator.wikimedia.org/T99735) (owner: 10BryanDavis) [16:21:39] (03PS1) 10Tim Landscheidt: labstore: Fix path in write_credentials_file() [puppet] - 10https://gerrit.wikimedia.org/r/227469 [16:23:14] (03PS2) 10Yuvipanda: labstore: Fix path in write_credentials_file() [puppet] - 10https://gerrit.wikimedia.org/r/227469 (owner: 10Tim Landscheidt) [16:23:23] (03CR) 10Yuvipanda: [C: 032 V: 032] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/227469 (owner: 10Tim Landscheidt) [16:23:41] (03CR) 10Alex Monk: "See Ibb81fee3" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226971 (https://phabricator.wikimedia.org/T106206) (owner: 10Alex Monk) [16:24:40] !log bounced create-dbusers on labstore1002 [16:24:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:14] (03CR) 10Alex Monk: [C: 031] No need for wgSecureLogin on our wikis, HTTPS is forced everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) (owner: 10BBlack) [16:38:12] (03PS1) 10Alex Monk: Fix typo in reverse DNS for ms-fe2003.codfw.wmnet [dns] - 10https://gerrit.wikimedia.org/r/227474 [16:42:24] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488560 (10BBlack) >>! In T107059#1488355, @CCogdill_WMF wrote: > I'll set up a meeting this week. Can I get a sense of what the DNS/TLS issues are that concern you, so I know who I need to includ... [16:42:40] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:43:30] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:43:48] 6operations, 6Reading-Admin, 6Zero: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1488569 (10dr0ptp4kt) [16:44:19] ^ bblack, i emailed you the thread, but also copied the details onto the task. cc yurik [16:45:06] 6operations, 5Continuous-Integration-Isolation: Reinstall labnodepool1001.eqiad.wmnet - https://phabricator.wikimedia.org/T107158#1488574 (10hashar) Will reinstall it with @andrew on Wednesday 29th. Gotta look at it tonight to figure out which .deb package might be missing, and prepare them for upload on apt.... [16:45:24] (03PS1) 10Filippo Giunchedi: cassandra: add restbase1007 [puppet] - 10https://gerrit.wikimedia.org/r/227475 (https://phabricator.wikimedia.org/T102015) [16:47:41] 6operations, 10Beta-Cluster, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1488592 (10demon) a:3demon [16:48:15] we have some nasty write spikes on s3 from time to time [16:48:56] gwicke mobrovac urandom ^ [16:49:37] as in 100% to 200% the number of regular writes queries [16:49:46] more [16:50:19] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488605 (10BBlack) Having now read some of the traffic in the other linked tickets: can we at least be sure we get HTTPS working on both new domains as well? We're trying to eliminate all endpoin... [16:51:10] (03CR) 10GWicke: [C: 031] cassandra: add restbase1007 [puppet] - 10https://gerrit.wikimedia.org/r/227475 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [16:51:17] which leads to lag, which leads to reads being affected to, which leads to having more writes [16:52:38] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1488612 (10bd808) 5Open>3Resolved All 3 hosts are up and running jessie with elasticsearch 1.7.0 and logstash 1.4.2. [16:52:58] (03PS3) 10BBlack: Add HSTS preload for wikipedia.org, refactor related regexes [puppet] - 10https://gerrit.wikimedia.org/r/227455 (https://phabricator.wikimedia.org/T104244) [16:53:06] (03PS2) 10Filippo Giunchedi: cassandra: add restbase1007 [puppet] - 10https://gerrit.wikimedia.org/r/227475 (https://phabricator.wikimedia.org/T102015) [16:53:10] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase1007 [puppet] - 10https://gerrit.wikimedia.org/r/227475 (https://phabricator.wikimedia.org/T102015) (owner: 10Filippo Giunchedi) [16:53:12] (03CR) 10BBlack: [C: 032 V: 032] Add HSTS preload for wikipedia.org, refactor related regexes [puppet] - 10https://gerrit.wikimedia.org/r/227455 (https://phabricator.wikimedia.org/T104244) (owner: 10BBlack) [16:53:54] YuviPanda: merging your create-dbusers patch [16:54:00] godog: bah, sorry. sure [16:54:04] where did mine go? I only saw you two guys :) [16:54:26] ah merge failed [16:54:30] (03PS4) 10BBlack: Add HSTS preload for wikipedia.org, refactor related regexes [puppet] - 10https://gerrit.wikimedia.org/r/227455 (https://phabricator.wikimedia.org/T104244) [16:54:38] YuviPanda: np, was just FYI :) [16:54:44] (03CR) 10BBlack: [V: 032] Add HSTS preload for wikipedia.org, refactor related regexes [puppet] - 10https://gerrit.wikimedia.org/r/227455 (https://phabricator.wikimedia.org/T104244) (owner: 10BBlack) [16:54:46] (03PS1) 10Ori.livneh: session_redis twemproxy pool: listen on TCP, not Unix domain socket [puppet] - 10https://gerrit.wikimedia.org/r/227482 [16:55:14] (03PS2) 10Ori.livneh: session_redis twemproxy pool: listen on TCP, not Unix domain socket [puppet] - 10https://gerrit.wikimedia.org/r/227482 [16:55:34] (03CR) 10Ori.livneh: [C: 032 V: 032] session_redis twemproxy pool: listen on TCP, not Unix domain socket [puppet] - 10https://gerrit.wikimedia.org/r/227482 (owner: 10Ori.livneh) [16:55:48] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:56:59] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:59:37] 6operations, 6Analytics-Engineering, 10Wikimedia-Logstash: Convert Hadoop-Logstash logging to use Redis to address failures - https://phabricator.wikimedia.org/T85015#1488627 (10bd808) Note: the redis connector has been removed from the logstash servers after its use caused problems with MediaWiki latency in... [17:00:39] RECOVERY - Restbase root url on restbase1007 is OK: HTTP OK: HTTP/1.1 200 - 15145 bytes in 0.018 second response time [17:01:05] (03CR) 10Rush: "is this related to https://phabricator.wikimedia.org/T106986?" [puppet] - 10https://gerrit.wikimedia.org/r/227482 (owner: 10Ori.livneh) [17:01:36] jynus, can you take a look at https://gerrit.wikimedia.org/r/#/c/225702/ please? [17:01:46] chasemp: not directly [17:01:57] ok [17:02:16] chasemp: we're not using that yet and may not use it, https://phabricator.wikimedia.org/T106986 needs to be resolved one way or another [17:03:06] yes agreed but I think it's passed the trivial resolution stage even if it's a small fix it's a problem we are only see at scale [17:03:14] next troubleshooting steps are all invasive-ish [17:03:18] or the ones I can think of [17:04:06] !log start cassandra on restbase1007, tentative bootstrap [17:04:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:04:49] RECOVERY - Cassandra database on restbase1007 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [17:05:25] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1488633 (10yuvipanda) 3NEW [17:05:36] MaxSem, have you tested that locally? [17:05:38] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1488640 (10chasemp) paste I am using as a sounding board: https://phabricator.wikimedia.org/P1076 [17:05:47] chasemp: whip out gdb and set a breakpoint on the relevant failure in phpredis? [17:09:23] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1488644 (10demon) Some of them shouldn't be labs-only probably ;-) [17:12:50] 7Puppet, 6Labs: Move all labs-only puppet roles to manifests/role/labs - https://phabricator.wikimedia.org/T107167#1488657 (10yuvipanda) Ah, right. so if it is applied in both prod and labs it should *not* be a labs only role but use hiera. This is for the growing number of things that are 'deployed' to labs... [17:13:11] jynus, I tested the import command but not the manifest [17:14:39] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 3.39% of data above the critical threshold [1000.0] [17:14:58] problem is the first import could be done twice as a race condition [17:15:31] ^ expected, that's cassandra coming online -.- [17:16:42] jynus, e.g. if provisioned on Tuesday? I guess we can remove automatic initial import completely then... [17:17:30] I would create the cron [17:17:46] and run it once manually, document it [17:17:57] is that ok? [17:18:08] ^MaxSem [17:18:14] ok [17:22:32] 6operations, 6Reading-Admin, 6Zero: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1488670 (10dr0ptp4kt) @yurik, it seems in the Mozilla Developers console for the app the manifest URL changed to meta.wikimedia.... [17:25:25] high 404, let me check what they are [17:25:52] (03PS1) 10BBlack: Fix content-type for new firefoxos manifest URL [puppet] - 10https://gerrit.wikimedia.org/r/227486 (https://phabricator.wikimedia.org/T107165) [17:26:57] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1488684 (10BBlack) I copied the line for that patch from the existing one that was set up for bits.wikimedi... [17:29:28] (03CR) 10Dr0ptp4kt: [C: 031] Fix content-type for new firefoxos manifest URL [puppet] - 10https://gerrit.wikimedia.org/r/227486 (https://phabricator.wikimedia.org/T107165) (owner: 10BBlack) [17:31:05] (03PS1) 10Chad: Phabricator: Setup git config for all repositories [puppet] - 10https://gerrit.wikimedia.org/r/227488 [17:31:07] (03PS1) 10Chad: Phabricator: Fetch all references in Git [puppet] - 10https://gerrit.wikimedia.org/r/227489 [17:31:55] I do not see any pattern and it is not that high compared to other days, will ignore it for now [17:32:05] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488714 (10CCogdill_WMF) We thought using the wildcard cert for donate.wikimedia.org already has was the easiest thing, but I can see your concerns so I've edited the task description to clarify t... [17:34:10] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1488718 (10Yurik) @dr0ptp4kt, no idea, I haven't touched it in ages. Seems like it was [[ https://gerrit.wi... [17:34:26] 6operations, 10ops-eqiad, 10Traffic: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1488720 (10BBlack) The only changes I see from the previous lists' ports are the eth0 for the latter 3 (used to be asw-c8:26-28, now is 23-25, which is already taken by eth1 of the first 3?). I... [17:34:57] (03CR) 10BBlack: [C: 032] Fix content-type for new firefoxos manifest URL [puppet] - 10https://gerrit.wikimedia.org/r/227486 (https://phabricator.wikimedia.org/T107165) (owner: 10BBlack) [17:35:03] ugh. why do we still serve that FFOS thingie? [17:35:30] because it's still in the app store and apparently someone maintains it [17:35:54] Wreck it brion! [17:36:16] we should kill it, it's a bastardized version of the old phonegap app which I had the misfortune of maintaining back in the day... [17:36:18] really once you put something like that out there, it sucks to abandon it. you can't ever wipe out the old installs which will eventually break from lack of updates. All you can do is keep updating it or hope the whole platform dies :) [17:36:41] it's probably already broken and IMO the whole platform was stillborn :P [17:37:01] the whole thing with distributing "apps" from third-party domains is fucked [17:37:03] apparently it still has users though [17:39:15] 6operations, 6Release-Engineering, 7Mobile: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1488745 (10MaxSem) 3NEW [17:40:40] bblack: IE6 still has users too, doesn't mean we have to support them. [17:41:01] ostriches, excuse me? OMGTHNKOFTHECHILDREN [17:41:49] (03PS5) 10Jcrespo: role::maps::master: Import waterlines on init and then weekly [puppet] - 10https://gerrit.wikimedia.org/r/225702 (owner: 10MaxSem) [17:42:22] (03CR) 10Jcrespo: [C: 032] role::maps::master: Import waterlines on init and then weekly [puppet] - 10https://gerrit.wikimedia.org/r/225702 (owner: 10MaxSem) [17:43:05] on a second thought, the refresh only is on the script, not on the downloaded files [17:43:16] so i think it is good enough [17:43:23] MaxSem: It's already a submodule... [17:43:41] It clones mobile frontend [17:43:43] lolool [17:44:56] ostriches: we don't support IE6 at all, but we had to do some stats and soul-searching on that I guess :) [17:45:24] I'll bet you $20 we killed more IE6 users than WikipediaFirefoxOS users have ever existed. [17:45:30] We should find stats :) [17:45:35] IE8 is next on the chopping block at some point in the future. But we need to do more advocacy against it first, and more waiting for the stats to die off [17:45:44] (03CR) 10Jcrespo: [V: 032] role::maps::master: Import waterlines on init and then weekly [puppet] - 10https://gerrit.wikimedia.org/r/225702 (owner: 10MaxSem) [17:45:53] ostriches, it still gets deployed as part of mw-config [17:46:05] I'll clarify the bug [17:46:21] ballpark, IE8 is still about 0.6% of traffic, and probably skewed considerably higher in some countries [17:46:24] MaxSem: Doesn't that crud have to be in the docroot though? [17:46:36] 6operations: detail now many XFP/SFP+ tranceivers are needed per peering site - https://phabricator.wikimedia.org/T105827#1488772 (10RobH) a:5faidon>3RobH reclaiming as I chatted with mark about this in irc. each router can handle 8. for chicago, shipping with them populated is ideal, since we don't have a... [17:46:43] If we move it, we break existing installs. If we break existing installs, go all the way and kill the stupid thing. [17:46:53] (03PS1) 10Tim Landscheidt: Labs: Remove reboot-if-idmap [puppet] - 10https://gerrit.wikimedia.org/r/227492 (https://phabricator.wikimedia.org/T95555) [17:47:05] ostriches, it doesn't have to break stuff [17:47:32] 6operations: detail now many XFP/SFP+ tranceivers are needed per peering site - https://phabricator.wikimedia.org/T105827#1488776 (10RobH) p:5Normal>3High [17:47:43] 6operations, 6Release-Engineering, 7Mobile: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1488778 (10MaxSem) [17:49:05] anyway, this should have the lowest priority while killing the app altogether should probably be higher:P [17:49:57] lol [17:50:18] I can't find any Firefox under OS listings on https://stats.wikimedia.org/wikimedia/squids/SquidReportOperatingSystems.htm [17:50:56] this report is unmaintained and is scheduled for death [17:51:11] Yeah [17:51:40] jynus, thank you! [17:53:51] hmm [17:53:55] YuviPanda, urllib.error.HTTPError: HTTP Error 403: Bad Behavior [17:55:16] (03PS1) 10Tim Landscheidt: Labs: Remove various obsolete migration code [puppet] - 10https://gerrit.wikimedia.org/r/227493 [17:56:23] (03CR) 10Tim Landscheidt: [C: 04-1] "This should only be merged a day or so after I19f5734c6290ca3175f33f0a561ee48f7bcb9b06 has been merged." [puppet] - 10https://gerrit.wikimedia.org/r/227493 (owner: 10Tim Landscheidt) [17:57:31] 6operations, 10Traffic, 5HTTPS-by-default, 5Patch-For-Review: Preload HSTS - https://phabricator.wikimedia.org/T104244#1488805 (10BBlack) Status update: We've submitted all of the primary domains from our unified cert to the HSTS preload list, with the exception of wikimedia.org. This means the following... [18:00:04] twentyafterfour greg-g: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150728T1800). [18:02:21] 6operations, 10ops-codfw: check both mx80s for spare XFPs - https://phabricator.wikimedia.org/T107177#1488829 (10RobH) 3NEW a:3Papaul [18:04:15] 6operations: Update nutcracker/twemproxy package for 0.4.1 - https://phabricator.wikimedia.org/T107178#1488843 (10ori) 3NEW [18:07:43] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1488864 (10BBlack) [18:07:48] YuviPanda, ah, it was ukwikimedia [18:07:56] closed wiki, redirects to WMUK's external hosting [18:12:09] mobrovac: link to that ticket again? [18:12:49] PROBLEM - puppet last run on elastic1025 is CRITICAL puppet fail [18:13:05] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1488879 (10ori) > submitted to the Chromium HSTS Preload list The Chromium project indicates that they are willing to special-case certain requests, and that petitioners... [18:13:07] jzerebecki: https://gerrit.wikimedia.org/r/#/c/224374/ [18:13:37] YuviPanda: you mean https://phabricator.wikimedia.org/project/profile/1305/ ? [18:13:50] (03PS1) 10EBernhardson: Disable search feedback survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227496 (https://phabricator.wikimedia.org/T103131) [18:13:52] (03PS1) 10RobH: adding in split pd escalations [puppet] - 10https://gerrit.wikimedia.org/r/227497 [18:14:19] mobrovac: nice. none of that is node specific tho [18:14:34] nope [18:14:46] mobrovac: yeah, so that's great, etc. [18:14:51] (03CR) 10RobH: [C: 032] adding in split pd escalations [puppet] - 10https://gerrit.wikimedia.org/r/227497 (owner: 10RobH) [18:15:28] YuviPanda: yup, we now have to see everything behind that for python [18:15:37] running the services in prod, logging, metrics etc [18:15:41] all stuff we have for node [18:16:49] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [18:18:07] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1488890 (10Papaul) [18:18:09] 6operations, 10ops-codfw: check both mx80s for spare XFPs - https://phabricator.wikimedia.org/T107177#1488888 (10Papaul) 5Open>3Resolved cr2 pmtpa : 8 cr1 sdtpa : 5 [18:21:03] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1488907 (10BBlack) >>! In T107059#1488714, @CCogdill_WMF wrote: > We thought using the wildcard cert for donate.wikimedia.org already has was the easiest thing, but I can see your concerns so I've... [18:25:21] 6operations, 10ops-codfw: check both mx80s for spare XFPs - https://phabricator.wikimedia.org/T107177#1488936 (10Papaul) {F280193} [18:26:27] mobrovac: right. so currently everything is running as systemd units logging to syslog [18:27:10] mobrovac: no graphite support yet, going to get graphite added [18:28:12] cool [18:28:30] we'll probably want non-blocking logstash support too [18:29:06] mobrovac: can't we just tail that from syslog? [18:29:19] mobrovac: would be nice to not need deb packages :P but I guess that's not happening [18:29:46] that's an option too, YuviPanda, but only for the first stage, if you ask me [18:30:08] mobrovac: indeed, but I think I can churn out python packages fairly quickly now so nbd. [18:30:09] if we are going to have more python services, then setting tailing up on each box for each service might get messy [18:30:15] cool [18:30:28] mobrovac: sure. logging is configurable, so we could do that not too difficultly I guess [18:33:38] !log disabling puppet and nova-network on labnet1002 to avoid possible conflict between two different dhcp servers [18:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:35:51] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#1489013 (10BBlack) I think that's mostly about agl special-casing exemptions to the list of rules for automatic inclusion to the master list in git (the ones about restric... [18:36:14] RECOVERY - puppet last run on elastic1025 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [18:37:33] PROBLEM - nova-network process on labnet1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network [18:44:05] !log Twiddling with nutcracker on mw1041 [18:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:18] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1489093 (10RobH) [18:44:55] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1472363 (10RobH) I'm in the process of documenting our pagerduty setup and pushing it somewhere so others can modify it. Right now, I basically have everyone in a rotation for 8AM to 11PM in their local time zon... [18:47:33] PROBLEM - nutcracker process on mw1041 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 108 (nutcracker), command name nutcracker [18:48:24] PROBLEM - nutcracker port on mw1041 is CRITICAL: Connection refused [18:49:43] RECOVERY - nutcracker process on mw1041 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [18:50:33] RECOVERY - nutcracker port on mw1041 is OK: TCP OK - 0.000 second response time on port 11212 [18:53:01] (03PS7) 10BBlack: No need for wgSecureLogin on our wikis, HTTPS is forced everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) [18:55:08] (03CR) 10BBlack: [C: 032] No need for wgSecureLogin on our wikis, HTTPS is forced everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219265 (https://phabricator.wikimedia.org/T103021) (owner: 10BBlack) [18:56:11] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1489236 (10csteipp) @bblack, we should definitely **NOT** give them our wildcard key. We will need to work out a safe way to get them subdomain certificates-- I think we should probably generate t... [18:56:48] !log bblack Synchronized wmf-config/InitialiseSettings.php: remove wgSecureLogin (duration: 00m 12s) [18:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:57:08] !log bblack Synchronized wmf-config/InitialiseSettings-labs.php: remove wgSecureLogin (duration: 00m 12s) [18:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:01:51] (03PS1) 10Yuvipanda: toollabs: Add support for uwsgi-plain webservice type [puppet] - 10https://gerrit.wikimedia.org/r/227503 (https://phabricator.wikimedia.org/T104374) [19:02:03] valhallasw`cloud: ^ uwsgi-plain, then figure out python3 config, and then eventually make that as uwsgi-python3 :) [19:02:13] need to figure out routing in uwsgi to rewrite /toolname to / [19:03:03] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1489249 (10Cmjohnson) I found the XFP's from Tampa's RX-16. They made it to eqiad. Additionally I have 7 on-site that belong to eqiad. 6 Finisar FTLX1412M3BCL 10GBAE-LR/LW (Tampa) 1 Foundry FTLX1412D2BCL-F1... [19:08:05] YuviPanda: we should take a look at dh-virtualenv at some point [19:08:28] valhallasw`cloud: hmm for? [19:08:32] oh, the packaging? [19:08:34] yeah, perhaps. [19:08:34] (03CR) 10Merlijn van Deen: [C: 031] toollabs: Add support for uwsgi-plain webservice type [puppet] - 10https://gerrit.wikimedia.org/r/227503 (https://phabricator.wikimedia.org/T104374) (owner: 10Yuvipanda) [19:08:45] on the other hand: ugh, debuild [19:13:00] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1489292 (10chasemp) Problem outline: On july 23rd redis connection failures in prod jumped by an order of magnitude Current thoughts: I'm at a... [19:16:51] (03CR) 10Hashar: "there is some diskimage related files left over." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227461 (owner: 10Hashar) [19:18:05] (03PS2) 10Yuvipanda: toollabs: Add support for uwsgi-plain webservice type [puppet] - 10https://gerrit.wikimedia.org/r/227503 (https://phabricator.wikimedia.org/T104374) [19:18:13] (03CR) 10Yuvipanda: [C: 032 V: 032] toollabs: Add support for uwsgi-plain webservice type [puppet] - 10https://gerrit.wikimedia.org/r/227503 (https://phabricator.wikimedia.org/T104374) (owner: 10Yuvipanda) [19:19:35] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1489330 (10RobH) [19:19:37] 6operations: detail now many XFP/SFP+ tranceivers are needed per peering site - https://phabricator.wikimedia.org/T105827#1489328 (10RobH) 5Open>3Resolved Papaul updated associate t107177 with the onsite xfp, so this is resolved. [19:20:37] (03CR) 10Yuvipanda: "This should also be called 'clean old volumes' or something. manage is just too generic." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [19:21:24] Coren: ^ some comments [19:21:28] Coren: lookinga t it more now. [19:21:38] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1489342 (10RobH) [19:22:14] Coren: can you also +1 https://gerrit.wikimedia.org/r/#/c/227492/1 and the subsequent patch? [19:23:07] (03CR) 10coren: [C: 031] "While it's conceivable that some instances remain (old self-hosted puppet master, etc) - we can fix those when the time comes." [puppet] - 10https://gerrit.wikimedia.org/r/227492 (https://phabricator.wikimedia.org/T95555) (owner: 10Tim Landscheidt) [19:24:14] (03CR) 10coren: [C: 031] "With the caveat that this should not be merged in yet." [puppet] - 10https://gerrit.wikimedia.org/r/227493 (owner: 10Tim Landscheidt) [19:24:33] (03PS2) 10Yuvipanda: Labs: Remove reboot-if-idmap [puppet] - 10https://gerrit.wikimedia.org/r/227492 (https://phabricator.wikimedia.org/T95555) (owner: 10Tim Landscheidt) [19:24:49] (03CR) 10Yuvipanda: [C: 032 V: 032] Labs: Remove reboot-if-idmap [puppet] - 10https://gerrit.wikimedia.org/r/227492 (https://phabricator.wikimedia.org/T95555) (owner: 10Tim Landscheidt) [19:25:00] 6operations, 7Monitoring: Switch Icinga from smsglobal - https://phabricator.wikimedia.org/T106589#1489367 (10chasemp) thanks rob! [19:25:24] (03CR) 10coren: "It used to be 'manage' because the original script behind this also created the snapshots; I agree the name is overly generic for its new " (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [19:26:15] (03PS1) 10Alex Monk: Add python3 script to populate meta_p [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) [19:26:19] (03PS2) 10coren: Add cleanup-snapshots script [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) [19:28:02] (03CR) 10Yuvipanda: Add cleanup-snapshots script (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [19:28:16] Coren: ^ that's all I got. let's merge after those nits and see what happens! [19:28:27] Coren: after that I think we'll need a systemd unit + timers [19:31:06] godog: hiya, got a weird debian packaging issue, maybe you can shed some light? [19:34:38] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1489416 (10EWilfong_WMF) Thanks, @bblack and @robh for the insight into current operating procedure for third-party sites hosted over HTTPS under the wikimedia.org domain. We were not given direc... [19:36:09] YuviPanda: I /really/ don't get that __name__ == '__main__' thing. It's completely pointless and just obscures things under a level of indirection for a standalone script. It's not like this can be imported as a module in any way that can make sense. [19:36:41] YuviPanda: It has the smell of cargo programming to me. [19:38:02] https://docs.python.org/2/library/__main__.html [19:39:11] You are of course correct that it is superfluous in a stand alone script, but it's a widely used idiom in python land [19:39:54] Coren: bd808 you can python -i and import a script for testing and debugging [19:39:56] bd808: I know what it /does/. But as far as I can tell it's an idiom that is used even when it is useless for the sole reason that others have used it even when it is useless. Hence "cargo programming" [19:39:59] and it doesn't execute it as __main__ [19:40:30] Coren: so that or def main():. def main() has the advantage of not leaking stuff into global scope [19:40:38] I've used it many times in this way for testing and debugging outside of the context of lib [19:41:05] so an 'ideal' would be to do the if and call main(), but just the if is ok too for the reasons others have mentioned [19:41:09] aside from having a main() as entry point for sanity the __main__ has a large use case [19:41:34] and then if you have a main(), you can actually pass sys.argv there, and suddenly the script can be used as library [19:41:57] not necessarily with the sanest interface, though [19:42:03] indeed :P [19:42:11] the everything is a library argument is kind of insanity [19:42:28] None of this sound even remotely compelling to me - but meh. I'll build your wooden plane. :-) [19:42:47] Coren: let's write it in perl and call out to sed every 5 lines with 4 types of quoting instead :) [19:42:52] but defining a common entry point and allowing a script to be loaded for namespace by the interpreter is good [19:43:27] YuviPanda: The rare cases where you see code of mine that does that is when a shell script was hastily converted to perl. :-) [19:43:42] YuviPanda: Which never ever does a good job regardless of the destination language. [19:43:42] 'rare' :P [19:44:13] Yes, rare. [19:44:32] indeed - so these are small things that we can do from the beginning that'll help, I think. 'code smells' [19:45:04] Down for anyone else? https://phabricator.wikimedia.org/project/sprint/burn/483/ "Request: GET http://phabricator.wikimedia.org/project/sprint/burn/483/, from 10.64.0.172 via cp1044 cp1044 ([10.64.0.172]:80), Varnish XID 771163268 - Forwarded for: 80.217.41.169, 10.64.0.172 - Error: 503, Service Unavailable at Tue, 28 Jul 2015 19:42:35 GMT" [19:46:17] chasemp: ^ [19:47:06] I don't know anything about that now :) ostriches^ or mukunda? [19:48:12] hmm... [19:48:22] 504 Gateway Time-out - nginx/1.9.3 ---urgh [19:48:30] (03PS3) 10coren: Add cleanup-snapshots script [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) [19:48:43] probably a really slow query? [19:49:10] could chagne the debug.limit and try to get a trace [19:50:44] max execution timeout ... [19:51:01] doesn't look like a mysql query, it's just a really intensive request [19:52:10] Should I create a Phabricator (irony) task for it? [19:53:08] (03CR) 10Yuvipanda: [C: 031] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/227462 (https://phabricator.wikimedia.org/T106474) (owner: 10coren) [19:53:11] Coren: ^ [19:54:52] https://phabricator.wikimedia.org/T107197 so it is documented at least... [19:56:34] PROBLEM - puppet last run on mw1245 is CRITICAL Puppet has 1 failures [20:00:34] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia and trusty-wikimedia - https://phabricator.wikimedia.org/T106499#1489468 (10hashar) [20:01:00] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia and trusty-wikimedia - https://phabricator.wikimedia.org/T106499#1470255 (10hashar) Finally rebuild the package for Trusty as zuul_2.0.0-327-g3ebedde-wmf3trusty1 . I have updated the ta... [20:05:32] (03CR) 10Yuvipanda: "Thank you for working on this!" (039 comments) [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [20:15:13] twentyafterfour: Not sure if this helps at all but... https://phabricator.wikimedia.org/T107197#1489511 [20:17:16] (03PS18) 10Gergő Tisza: [WIP] Basic role for Sentry [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [20:18:21] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, 10Unplanned-Sprint-Work: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1489518 (10DStrine) [20:21:04] RECOVERY - puppet last run on mw1245 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:23:57] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1489562 (10JKrauska) Close task -- agreed name overload is too confusing, and not that important. Naming google groups like this invite-NAME --J [20:24:06] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1489563 (10JKrauska) 5Open>3declined [20:24:41] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1489574 (10srijan) Yes I can ssh into stat1003 now! Thanks! [20:25:24] cajoel: enjoy the masses of 'wait, why is there an invite-wmfall group?' or 'where did my email to invite-wmfall' go or 'why did someone invite wmfall?' ;) [20:25:51] yes, but that's exactly what it's for [20:26:15] got a better name? [20:26:22] no :) [20:26:39] then again, I'm not a stakeholder because no employment so doesn't bother me :D [20:26:41] I can hide it from the directory too [20:26:51] so it should be only used sparingly [20:27:20] oh, right.. (stops listening to johnf on this one) [20:29:03] wfm [20:32:42] twentyafterfour: did we deploy yet? [20:33:08] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1489621 (10CCogdill_WMF) The concern for us lies in what @EWilfong_WMF just said — SNI incompatibility on XP applies to all versions of IE, not just IE 6. So for XP users with IE 7+, they would be... [20:35:59] * aude thinks not [20:38:10] 6operations, 10Deployment-Systems, 6Release-Engineering: Corrupt /srv/deployment/scap/scap checkouts on WMF prod cluster - https://phabricator.wikimedia.org/T103441#1489642 (10greg) [20:38:13] 6operations, 6Release-Engineering, 10Wikidata, 10Wikimedia-General-or-Unknown: Wikidata and Wikiversity logo 404ing on wikimedia.org - https://phabricator.wikimedia.org/T103296#1489644 (10greg) [20:38:18] 6operations, 6Labs, 6Release-Engineering, 10wikitech.wikimedia.org, 5Patch-For-Review: silver / scap - Could not get latest version: 403 Forbidden - https://phabricator.wikimedia.org/T103138#1489646 (10greg) [20:39:07] !log Upgraded nutcracker to 0.4.1-1+wm1 across fleet [20:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:39:25] since when did draggin between columnsmake wikibugs say things? [20:39:28] aude: no [20:39:29] 7Puppet, 6Labs: Could not find data item labs_recursor - https://phabricator.wikimedia.org/T107205#1489652 (10Tgr) 3NEW [20:39:39] (sorry for the spam) [20:39:40] 6operations: Update nutcracker/twemproxy package for 0.4.1 - https://phabricator.wikimedia.org/T107178#1489659 (10ori) 5Open>3Resolved a:3ori [20:39:53] greg-g: it shouldn't... [20:40:05] aude: I was trying to implement semver with this week's branch but that didn't work out. branch is just getting cut now [20:40:11] bd808: any tips about https://phabricator.wikimedia.org/T107205 ? [20:40:39] legoktm: maybe the combo of that and herald adding matan.ya ? [20:40:40] twentyafterfour: ok [20:41:05] we have a new branch for wikidata [20:41:14] aude: I saw that [20:41:27] it should pick up your new branch [20:41:30] unfortunately i might be out for a couple hours, but think it won't cause critical issues for the test wikis [20:41:41] (03PS1) 10Ori.livneh: Re-add redis_auth to redis nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227572 [20:41:48] if so, then can be reverted and i look when i come back [20:42:00] aude: no problem, thanks for the heads up [20:42:05] ok [20:42:11] tgr: that's a new one to me. Looks like the sort of madness that is usually cleared up by restarting the puppetmaster and trying again though. [20:42:13] I think I can handle it if anything comes up [20:42:16] k [20:42:21] (03CR) 10Ori.livneh: [C: 032 V: 032] Re-add redis_auth to redis nutcracker group [puppet] - 10https://gerrit.wikimedia.org/r/227572 (owner: 10Ori.livneh) [20:42:50] legoktm: I saw wikibugs alert on column changes this morning too but hoped it was just a weird fluke [20:42:53] PROBLEM - puppet last run on tin is CRITICAL Puppet has 1 failures [20:44:06] legoktm: bd808 here ya go: https://phabricator.wikimedia.org/T107208 ;) [20:47:12] oh [20:47:15] this is a lame bug. [20:47:23] usually [20:48:40] Puppet tin failure is due to there already being an actual redis instance on 6379; fixing. [20:49:00] ori: that's probably trebuchet [20:49:09] (03CR) 10Alex Monk: Add python3 script to populate meta_p (039 comments) [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [20:49:12] yes [20:49:23] (03PS2) 10Alex Monk: Add python3 script to populate meta_p [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) [20:50:22] (03PS1) 10Ori.livneh: nutcracker redis pool: listen on 6380, not 6379 [puppet] - 10https://gerrit.wikimedia.org/r/227573 [20:52:26] 6operations, 7Mail: add aliases to catch two main corp mailing lists with specified without 'lists' - https://phabricator.wikimedia.org/T107079#1489740 (10Krenair) To me it sounded like you had a reasonable request, but your choice... [21:07:24] bblack: would it be possible to get a cache expiration of the object at https://meta.wikimedia.org/WikipediaMobileFirefoxOS/manifest.webapp ? [21:07:37] looks to be a stale object [21:08:23] bblack: ^ looks like your thing? [21:09:02] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:09:33] jynus, hi, around? [21:10:36] bd808: looks like the whole sentry project is screwed :( [21:10:54] puppet doesn't even get to the point of installing the SSH key [21:11:06] tgr: :(( for a new instance? [21:11:18] labs pukes like that sometimes [21:11:27] woah what happened? [21:11:36] yes [21:11:44] old instances still work [21:11:53] no idea what triggered [21:11:54] if you recreate an instance with the same name as older instance that's probably still screwed for a bit [21:11:58] try a different instance name [21:14:32] YuviPanda: I see a slightly different set of errors on the console but still cannot login [21:14:42] sentry-alpha3 login: 2015-07-28T21:13:39.260334+00:00 sentry-alpha3 nslcd[1105]: [334873] (re)loading /etc/nsswitch.conf [21:14:42] !log Deployed patch for T107170 to wmf/1.26wmf15 and wmf/1.26wmf16 [21:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:15:22] and SSH just gives permission denied [21:18:16] YuviPanda: it all started with https://phabricator.wikimedia.org/T107205 , don't know if that's related or just a one-time puppet failure [21:18:56] tgr: what project is this? [21:19:03] sentry [21:24:14] tgr: try creating a new one now? [21:25:23] tgr: I deleted the wikitech page instead of blanking it, wonder if that helps [21:28:04] (03PS1) 1020after4: 1.26wmf16 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227581 [21:28:25] (03CR) 1020after4: [C: 032] 1.26wmf16 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227581 (owner: 1020after4) [21:28:31] (03Merged) 10jenkins-bot: 1.26wmf16 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227581 (owner: 1020after4) [21:29:00] YuviPanda: that fixed it, thanks! [21:29:13] tgr: need to file a bug now, I guess. I'll do it [21:32:32] legoktm: thank you very much for the code review... I tests PS 29 quite well. I converted the change of the hash key back... so that there is no additional overhead ... if you have nothing else to critisize I'll test the new PS30 again tomorrow [21:36:01] swat coming up in 30 minutes? should I scap or wait until after teh swat? [21:36:31] I guess I'll go ahead with it? usually doesn't run much longer than 35 minutes or so [21:36:42] greg-g: what do you think? [21:37:23] 6operations, 6Release-Engineering, 7Mobile: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489889 (10hashar) So my questions are: what the hell is that code base for? was it a one off experiment? is that actually receiving traffic? can we shoot it? which tea... [21:39:02] ok gonna just doooo it [21:39:16] !log twentyafterfour Started scap: new branch: testwiki to 1.26wmf16 [21:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:39:27] twentyafterfour: as greg would likely say, JFDI ;) [21:40:35] JohnFLewis: twentyafterfour indeed, mostly because SWAT isn't for another 1.5 hours, not .5 hours :) [21:41:01] (03CR) 10Tim Landscheidt: Add python3 script to populate meta_p (033 comments) [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [21:41:25] 6operations, 6Release-Engineering, 7Mobile: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489893 (10demon) >>! In T107172#1489889, @hashar wrote: > So my questions are: > > what the hell is that code base for? Firefox OS, lol. > was it a one off experimen... [21:43:01] 6operations, 6Release-Engineering, 7Mobile: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489895 (10greg) @brion tell us we can kill the WikipediaMobileFirefoxOS thingy, please? [21:47:33] 6operations, 6Release-Engineering, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489899 (10hashar) [21:48:32] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [21:48:34] 6operations, 6Release-Engineering, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489902 (10dr0ptp4kt) The app is part of the Partnerships portfolio. It's in maintenance / bugfix mode. [21:48:34] !log canary restbase ca30b69 deploy to restbase1001.eqiad [21:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:49:42] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [21:50:09] 6operations, 6Release-Engineering, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489909 (10greg) >>! In T107172#1489902, @dr0ptp4kt wrote: > The app is part of the Partnerships portfolio. It's in maintenance / bugfix mode. Adam:... [21:50:12] (03CR) 10Krinkle: Add python3 script to populate meta_p (031 comment) [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [21:50:13] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.066 second response time [21:51:24] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.068 second response time [21:53:27] 6operations, 5Patch-For-Review, 5WMF-deploy-2015-07-21_(1.26wmf15): High number of (session) redis connection failures - https://phabricator.wikimedia.org/T106986#1489920 (10ori) >>! In T106986#1489292, @chasemp wrote: > explore nutcracker'ing redis traffic and hope the connection pooling helps this in the s... [21:53:52] PROBLEM - puppet last run on db2001 is CRITICAL puppet fail [21:54:41] 6operations, 6Release-Engineering, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489922 (10dr0ptp4kt) @greg, FFOS is more targeted at Global South regions, so probably the simplest would be #zero. [21:55:34] 6operations, 6Release-Engineering, 6Zero, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1489923 (10greg) [22:01:00] !log restbase ca30b69 deployed to eqiad cluster [22:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:01:22] PROBLEM - HHVM rendering on mw1203 is CRITICAL - Socket timeout after 10 seconds [22:02:22] PROBLEM - Apache HTTP on mw1203 is CRITICAL - Socket timeout after 10 seconds [22:03:10] 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1489934 (10RobH) 5Open>3stalled I've set the two pagerduty email to sms gateway addresses into the alerts contact group in catchpoint. This should allow us to use the PD scheduling... [22:03:17] 6operations, 7Monitoring: Migrate monitoring alerts from watchmouse to catchpoint - https://phabricator.wikimedia.org/T107092#1489936 (10RobH) p:5Triage>3Low [22:04:01] bd808: btw, did you create the mw lxc container yourself/ [22:04:02] ? [22:04:38] * YuviPanda needs to create a jessie one at some point [22:05:42] !log twentyafterfour Finished scap: new branch: testwiki to 1.26wmf16 (duration: 26m 26s) [22:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:07:05] (03PS1) 10Ori.livneh: Add nutcracker-redis object cache instance, unused for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227586 [22:07:17] (03PS2) 10Ori.livneh: Add nutcracker-redis object cache instance, unused for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227586 [22:07:43] (03CR) 10Ori.livneh: [C: 032] Add nutcracker-redis object cache instance, unused for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227586 (owner: 10Ori.livneh) [22:07:50] (03Merged) 10jenkins-bot: Add nutcracker-redis object cache instance, unused for now [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227586 (owner: 10Ori.livneh) [22:08:24] PROBLEM - HHVM busy threads on mw1203 is CRITICAL 75.00% of data above the critical threshold [115.2] [22:08:25] !log ori Synchronized wmf-config/CommonSettings.php: Iecddb3bf24: Add nutcracker-redis object cache instance, unused for now (duration: 00m 11s) [22:08:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:52] PROBLEM - HHVM queue size on mw1203 is CRITICAL 100.00% of data above the critical threshold [80.0] [22:20:04] RECOVERY - puppet last run on db2001 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [22:22:14] RECOVERY - Apache HTTP on mw1203 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [22:22:46] (03PS1) 10Gergő Tisza: Make the files relocatable [software/sentry] - 10https://gerrit.wikimedia.org/r/227597 [22:23:23] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 66093 bytes in 0.306 second response time [22:23:36] !log on mw1203 restarted hhvm due to StatCache lockup [22:23:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:24:31] (03PS2) 10Gergő Tisza: Make the files relocatable [software/sentry] - 10https://gerrit.wikimedia.org/r/227597 [22:26:37] YuviPanda: yes I did. It was based on an existing one that was just missing puppet [22:27:05] I don't remember how I did it but google will tell you I suppose [22:28:04] 7Puppet, 6Labs: Could not find data item labs_recursor - https://phabricator.wikimedia.org/T107205#1490025 (10scfc) The underlying problem (I think) is that `role::puppet::self` down the line includes `puppetmaster::hiera` which sets up: ``` file { '/etc/puppet/hiera.yaml': ensure => $ensure,... [22:30:33] RECOVERY - HHVM busy threads on mw1203 is OK Less than 30.00% above the threshold [76.8] [22:31:03] RECOVERY - HHVM queue size on mw1203 is OK Less than 30.00% above the threshold [10.0] [22:34:53] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [22:35:06] (03CR) 10Krinkle: [C: 032] Add temporary rl-test.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227358 (https://phabricator.wikimedia.org/T105255) (owner: 10Krinkle) [22:35:38] (03Merged) 10jenkins-bot: Add temporary rl-test.php entry point [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227358 (https://phabricator.wikimedia.org/T105255) (owner: 10Krinkle) [22:36:20] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1490053 (10RobH) 5Open>3Resolved @madhuvishy, Your access has been granted and is now live. [22:36:32] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1490056 (10RobH) a:5Ottomata>3None [22:36:33] (03PS3) 10BBlack: enable ipsec for all codfw caches [puppet] - 10https://gerrit.wikimedia.org/r/219813 (https://phabricator.wikimedia.org/T81543) [22:36:43] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.107 second response time [22:40:26] !log krinkle Synchronized w/rl-test.php: T105255 (duration: 00m 12s) [22:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:43:44] (03PS1) 1020after4: group0 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227599 [22:48:03] (03PS2) 1020after4: group0 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227599 [22:48:29] (03CR) 1020after4: [C: 032] group0 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227599 (owner: 1020after4) [22:48:37] (03Merged) 10jenkins-bot: group0 wikis to 1.26wmf16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227599 (owner: 1020after4) [22:49:15] (03PS1) 10RobH: adding mobrovac to graphoid-admins [puppet] - 10https://gerrit.wikimedia.org/r/227601 [22:49:41] 6operations, 10Datasets-Archiving: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1490078 (10Hydriz) [22:50:01] (03CR) 10jenkins-bot: [V: 04-1] adding mobrovac to graphoid-admins [puppet] - 10https://gerrit.wikimedia.org/r/227601 (owner: 10RobH) [22:51:18] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group0 wikis to 1.26wmf16 [22:51:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:52:32] .... [22:52:39] wtf kind of error is that. [22:52:41] 6operations, 6Release-Engineering, 6Zero, 7Mobile, 7Technical-Debt: Pull WikipediaMobileFirefoxOS from mediawiki-config - https://phabricator.wikimedia.org/T107172#1490081 (10greg) Zero team: Can one of you please assist with this request to move the Fireofx OS App out of the mediawiki-config repository?... [22:53:07] 22:49:26 File "/mnt/jenkins-workspace/workspace/operations-puppet-tox-py27/modules/admin/data/data_test.py", line 41, in testDataDotYaml [22:53:07] 22:49:26 'Users assigned that do not exist' [22:53:10] that user totally exists. [22:55:09] robh: no it doesn't :) [22:55:47] oh, parsing the line above i suppse, new check or something [22:55:51] cuz i didnt touch that line =P [22:55:53] (03CR) 10John F. Lewis: [C: 04-1] adding mobrovac to graphoid-admins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227601 (owner: 10RobH) [22:56:09] bleh [22:56:19] freaking spelling... im obviously overdue for a break. [22:56:32] you wrote it right in the commit at least :) [22:57:11] (03PS2) 10RobH: adding mobrovac to graphoid-admins [puppet] - 10https://gerrit.wikimedia.org/r/227601 [22:57:17] that was the odd part =P [22:57:38] (03CR) 10John F. Lewis: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/227601 (owner: 10RobH) [22:57:51] (03CR) 10RobH: [C: 032] adding mobrovac to graphoid-admins [puppet] - 10https://gerrit.wikimedia.org/r/227601 (owner: 10RobH) [22:58:02] robh: linting ftw :D [22:58:29] admin module stops me from fucking it over [22:58:33] nice testing and module. [22:58:47] integration testing works! [22:58:57] you know I did that because kept doing the same damn thing [22:59:18] although hashar jenkins'd it up [23:00:04] RoanKattouw ostriches Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150728T2300). [23:00:04] ebernhardson James_F: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:15] * James_F waves. [23:00:19] 10Ops-Access-Requests, 6operations, 10Graphoid: Allow mobrovac to restart Graphoid - https://phabricator.wikimedia.org/T106814#1490102 (10RobH) 5Open>3Resolved a:3RobH This has been pushed live, and @mobrovac is now in the graphoid-admins group. [23:00:19] * ebernhardson waves [23:00:19] hey [23:00:41] Krenair: Are you SWATing? [23:00:48] yea i like it. admins as a module is so much nicer. [23:00:56] okay [23:01:08] modular admins, I like that idea [23:01:22] robh: the temptation to go on that ticket and go "*@mobrovak" :P [23:01:31] * ToAruShiroiNeko imagines a crane hand lifting admins in server room from task to task [23:01:41] i admit i checked to ensure it linked correctly in the preview pane ;D [23:01:47] (03CR) 10Alex Monk: [C: 032] Disable search feedback survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227496 (https://phabricator.wikimedia.org/T103131) (owner: 10EBernhardson) [23:02:12] (03Merged) 10jenkins-bot: Disable search feedback survey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/227496 (https://phabricator.wikimedia.org/T103131) (owner: 10EBernhardson) [23:02:41] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/227496/ (duration: 00m 12s) [23:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:08:00] Krenair: looks to have worked, survey is gone from en.wiki, i think might have to wait for some cache busting to remove it from en.m.wiki [23:08:16] I admit I only checked desktop [23:08:20] :) [23:15:54] 6operations, 6Phabricator, 10VisualEditor: Unable to load https://phabricator.wikimedia.org/tag/visualeditor/ - https://phabricator.wikimedia.org/T107229#1490160 (10Josve05a) 3NEW [23:17:14] it disappeared from en.m. now too, all good [23:17:38] great [23:20:01] !log krenair Synchronized php-1.26wmf16/extensions/VisualEditor/modules/ve-mw/init/ve.init.mw.ApiResponseCache.js: https://gerrit.wikimedia.org/r/#/c/227607/ (duration: 00m 12s) [23:20:02] James_F, ^ [23:20:24] Krenair: let me know when you're done, please; I have something to test on tin. [23:20:37] yeah [23:20:38] I'm done [23:20:44] cool, thanks [23:20:56] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1490181 (10BBlack) For reference, IE7 and 8 on WinXP account for at most 0.6% of total traffic on a global weekly average basis, and the bulk of that tends to look like it comes from office hours... [23:23:08] Krenair: Testing. [23:23:36] Krenair: Yup. [23:32:18] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1490219 (10dr0ptp4kt) @bblack, would it be possible to get https://meta.wikimedia.org/WikipediaMobileFirefo... [23:33:04] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1490225 (10dr0ptp4kt) See Mozilla bug at https://bugzilla.mozilla.org/show_bug.cgi?id=1188593 [23:36:40] !log rebooting cp20xx.codfw.wmnet for kernel updates (downtimed) [23:36:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:39:49] 6operations, 7HTTPS: SSL cert needed for new fundraising events domain - https://phabricator.wikimedia.org/T107059#1490234 (10CCogdill_WMF) > We can delay adding HTTPS for the events site, but we'd still like to commit to eventually correcting that (and the other cases that are outstanding, all of which are bl... [23:41:38] 6operations, 6Reading-Admin, 6Zero, 5Patch-For-Review: Set Content-Type to application/x-web-app-manifest+json for Wikipedia for Firefox OS webapp.manifest - https://phabricator.wikimedia.org/T107165#1490235 (10BBlack) Done [23:43:26] (03CR) 10Alex Monk: Add python3 script to populate meta_p (033 comments) [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) (owner: 10Alex Monk) [23:43:41] (03PS3) 10Alex Monk: Add python3 script to populate meta_p [software] - 10https://gerrit.wikimedia.org/r/227505 (https://phabricator.wikimedia.org/T107094) [23:50:00] !log ori Synchronized php-1.26wmf16/includes/objectcache/RedisBagOStuff.php: I3812ec5a0b: RedisBagOStuff: if no alternatives, skip master link status check (duration: 00m 12s) [23:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:50:24] !log ori Synchronized php-1.26wmf15/includes/objectcache/RedisBagOStuff.php: I3812ec5a0b: RedisBagOStuff: if no alternatives, skip master link status check (duration: 00m 12s) [23:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master