[00:06:07] Krenair: Jenkins has been merging my patch for 51 minutes now. I think it’s broken. [00:06:30] looking.. [00:06:38] https://integration.wikimedia.org/zuul/ [00:07:43] {{fixed}} [00:08:01] Krenair: :) [00:12:01] 6operations, 10SUL-Finalization: centralauth database on dbstore1002 is out of date, replication stuck? - https://phabricator.wikimedia.org/T95927#1205389 (10Springle) Resync is done and replication rules corrected. [00:15:24] !log kaldari Synchronized php-1.26wmf1/extensions/MobileFrontend: syncing MobileFrontend for 1.26wmf1 (duration: 00m 14s) [00:16:13] all done [00:16:59] (03PS1) 10Ori.livneh: Enable SVG GZip compression on cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/204000 [00:17:18] bblack: ^ [00:17:45] (03CR) 10BBlack: [C: 031] Enable SVG GZip compression on cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/204000 (owner: 10Ori.livneh) [00:18:13] (03PS2) 10Ori.livneh: Enable SVG GZip compression on cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/204000 [00:18:24] (03CR) 10Ori.livneh: [C: 032 V: 032] Enable SVG GZip compression on cp1064 [puppet] - 10https://gerrit.wikimedia.org/r/204000 (owner: 10Ori.livneh) [00:19:17] !log enabling gzip compression for SVGs on cp1064 upload varnish; coordinated with bblack [00:19:31] morebots is dead? [00:19:35] yeah [00:20:20] don't worry, it will be back in 216 days when some raid6 volume finishes syncing :) [00:25:07] PROBLEM - puppet last run on mw2102 is CRITICAL Puppet has 1 failures [00:25:17] PROBLEM - puppet last run on ytterbium is CRITICAL Puppet has 1 failures [00:25:47] PROBLEM - puppet last run on mw2099 is CRITICAL Puppet has 1 failures [00:26:27] PROBLEM - puppet last run on mw2106 is CRITICAL Puppet has 1 failures [00:26:27] PROBLEM - puppet last run on mw1048 is CRITICAL Puppet has 2 failures [00:26:27] PROBLEM - puppet last run on db1007 is CRITICAL Puppet has 1 failures [00:26:36] PROBLEM - puppet last run on mw1240 is CRITICAL Puppet has 1 failures [00:26:37] PROBLEM - puppet last run on mw2202 is CRITICAL Puppet has 1 failures [00:26:37] PROBLEM - puppet last run on mw1013 is CRITICAL Puppet has 1 failures [00:26:46] PROBLEM - puppet last run on mw2179 is CRITICAL Puppet has 1 failures [00:26:47] PROBLEM - puppet last run on mw1109 is CRITICAL Puppet has 1 failures [00:26:48] PROBLEM - puppet last run on mw1138 is CRITICAL Puppet has 2 failures [00:27:07] PROBLEM - puppet last run on mw1245 is CRITICAL Puppet has 1 failures [00:27:09] ^ the mediawiki appservers are revolting against your gzip patch :P [00:27:16] PROBLEM - puppet last run on mw1096 is CRITICAL Puppet has 1 failures [00:31:07] (03PS4) 10Yuvipanda: Tools: Make list of proxies for portgrabber configurable [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [00:31:17] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Make list of proxies for portgrabber configurable [puppet] - 10https://gerrit.wikimedia.org/r/201991 (https://phabricator.wikimedia.org/T91954) (owner: 10Tim Landscheidt) [00:31:45] (03PS2) 10Yuvipanda: Tools: Make proxylistener project-independent [puppet] - 10https://gerrit.wikimedia.org/r/202322 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [00:32:00] * ori boggles [00:32:00] Could not evaluate: getaddrinfo: Name or service not known Could not retrieve file metadata for puppet:///modules/hhvm/debug/install-pkg-src: getaddrinfo: Name or service not known [00:32:20] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Make proxylistener project-independent [puppet] - 10https://gerrit.wikimedia.org/r/202322 (https://phabricator.wikimedia.org/T87387) (owner: 10Tim Landscheidt) [00:33:24] (03PS2) 10Yuvipanda: gridengine: Remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/201878 (owner: 10Tim Landscheidt) [00:33:32] (03CR) 10Yuvipanda: [C: 032 V: 032] gridengine: Remove unused file [puppet] - 10https://gerrit.wikimedia.org/r/201878 (owner: 10Tim Landscheidt) [00:35:46] (03PS3) 10Yuvipanda: Tools: Allow proxy certificate to be manually managed [puppet] - 10https://gerrit.wikimedia.org/r/198665 (owner: 10Tim Landscheidt) [00:35:56] (03CR) 10Yuvipanda: [C: 032 V: 032] Tools: Allow proxy certificate to be manually managed [puppet] - 10https://gerrit.wikimedia.org/r/198665 (owner: 10Tim Landscheidt) [00:38:38] (03PS2) 10Yuvipanda: dynamicproxy: Provide list of active proxy entries for urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [00:39:30] (03CR) 10Yuvipanda: "Should we backport the package ourselves for now?" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [00:39:46] RECOVERY - puppet last run on mw1013 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [00:39:57] RECOVERY - puppet last run on ytterbium is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [00:40:18] RECOVERY - puppet last run on mw1096 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:40:38] (03CR) 10Yuvipanda: "Can someone with more knowledge of exim than me +1?" [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [00:41:16] RECOVERY - puppet last run on db1007 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [00:41:26] RECOVERY - puppet last run on mw2202 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [00:41:36] RECOVERY - puppet last run on mw2179 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [00:41:37] RECOVERY - puppet last run on mw1109 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:41:37] RECOVERY - puppet last run on mw1138 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [00:41:56] RECOVERY - puppet last run on mw1245 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:42:08] RECOVERY - puppet last run on mw2099 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [00:42:49] RECOVERY - puppet last run on mw1048 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [00:42:56] RECOVERY - puppet last run on mw2106 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [00:42:57] RECOVERY - puppet last run on mw1240 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:43:16] RECOVERY - puppet last run on mw2102 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [00:43:56] (03CR) 10Yuvipanda: [C: 04-1] "Also I'm not too much a fan of one liners, I think having them be in separate lines with comments if needed is much better." [puppet] - 10https://gerrit.wikimedia.org/r/148917 (owner: 10Tim Landscheidt) [01:23:17] (03PS3) 10Yuvipanda: dynamicproxy: Provide list of active proxy entries for urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [01:32:47] (03PS4) 10Yuvipanda: dynamicproxy: Provide list of active proxy entries for urlproxy [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [01:33:42] (03CR) 10Yuvipanda: [C: 032 V: 032] "Backported :D" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [01:39:56] (03CR) 10Yuvipanda: "Works great :D You're awesome, etc :)" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [02:03:03] (03PS8) 10GWicke: Set up /api/v1/ entry point for restbase [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) [02:13:01] (03CR) 10GWicke: "Patch 8:" [puppet] - 10https://gerrit.wikimedia.org/r/203871 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [02:28:56] !log l10nupdate Synchronized php-1.25wmf24/cache/l10n: (no message) (duration: 06m 37s) [02:33:49] !log LocalisationUpdate completed (1.25wmf24) at 2015-04-14 02:32:46+00:00 [02:54:26] !log l10nupdate Synchronized php-1.26wmf1/cache/l10n: (no message) (duration: 05m 40s) [02:58:52] !log LocalisationUpdate completed (1.26wmf1) at 2015-04-14 02:57:49+00:00 [03:00:17] PROBLEM - puppet last run on mw2021 is CRITICAL Puppet has 1 failures [03:01:37] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [03:02:07] (03PS1) 10Yuvipanda: tools: Have uwsgi and nodejs attempt to pick a random port [puppet] - 10https://gerrit.wikimedia.org/r/204009 (https://phabricator.wikimedia.org/T93046) [03:03:37] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Have uwsgi and nodejs attempt to pick a random port [puppet] - 10https://gerrit.wikimedia.org/r/204009 (https://phabricator.wikimedia.org/T93046) (owner: 10Yuvipanda) [03:12:57] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [03:16:18] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [03:18:40] 7Puppet, 6operations: Allow defining hiera data for all of labs - https://phabricator.wikimedia.org/T1291#1205668 (10yuvipanda) 5Open>3Resolved hieradata/labs.yaml :) [03:24:42] (03PS1) 10Yuvipanda: tools: Don't use portgranter anywhere [puppet] - 10https://gerrit.wikimedia.org/r/204012 (https://phabricator.wikimedia.org/T93046) [03:28:47] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Don't use portgranter anywhere [puppet] - 10https://gerrit.wikimedia.org/r/204012 (https://phabricator.wikimedia.org/T93046) (owner: 10Yuvipanda) [03:48:17] 7Puppet, 10Tool-Labs: Develop and publish a gridengine provider for Puppet - https://phabricator.wikimedia.org/T95525#1205726 (10yuvipanda) p:5Low>3Lowest I personally still think that a proper solution for the gridengine puppetization problem is to not use gridengine :) [04:24:59] (03PS1) 10Yuvipanda: tools: Remove remnants of portgranter code [puppet] - 10https://gerrit.wikimedia.org/r/204014 (https://phabricator.wikimedia.org/T93046) [05:28:58] 6operations, 10SUL-Finalization: centralauth database on dbstore1002 is out of date, replication stuck? - https://phabricator.wikimedia.org/T95927#1205830 (10Legoktm) 5Open>3Resolved a:3Legoktm thanks, looks good to me! [05:29:12] 6operations, 10SUL-Finalization: centralauth database on dbstore1002 is out of date, replication stuck? - https://phabricator.wikimedia.org/T95927#1205833 (10Legoktm) a:5Legoktm>3Springle [05:29:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [05:32:32] (03CR) 10Santhosh: [C: 031] Beta: Add requested languages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/203812 (owner: 10KartikMistry) [05:41:26] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:03:47] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 14 06:02:44 UTC 2015 (duration 2m 43s) [06:29:57] PROBLEM - puppet last run on cp1061 is CRITICAL Puppet has 1 failures [06:30:07] PROBLEM - puppet last run on mw2136 is CRITICAL puppet fail [06:30:28] PROBLEM - puppet last run on mw2104 is CRITICAL Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:32:07] PROBLEM - puppet last run on mw2127 is CRITICAL Puppet has 2 failures [06:32:17] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on mw2093 is CRITICAL Puppet has 1 failures [06:34:37] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 2 failures [06:34:37] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 3 failures [06:34:47] PROBLEM - puppet last run on mw1118 is CRITICAL Puppet has 1 failures [06:35:18] PROBLEM - puppet last run on mw1144 is CRITICAL Puppet has 1 failures [06:35:46] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:36:37] PROBLEM - puppet last run on wtp2006 is CRITICAL puppet fail [06:45:47] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:46:16] RECOVERY - puppet last run on mw2093 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:17] RECOVERY - puppet last run on cp1061 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:46] RECOVERY - puppet last run on mw1144 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:56] RECOVERY - puppet last run on mw2127 is OK Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:46:57] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:16] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:46] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [06:47:47] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:47:56] RECOVERY - puppet last run on mw1118 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:57] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:48:27] RECOVERY - puppet last run on mw2104 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:54:36] RECOVERY - puppet last run on wtp2006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:04] <_joe_> we bow before you, mighty mod_passenger [07:37:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [07:49:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60610 bytes in 0.686 second response time [08:01:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [08:09:17] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60631 bytes in 1.164 second response time [08:13:00] (03CR) 10Hashar: "vim --version comparaison between the two packages https://phabricator.wikimedia.org/M52" [puppet] - 10https://gerrit.wikimedia.org/r/203342 (owner: 10Hashar) [08:14:26] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [08:20:42] (03CR) 10Hashar: "The Ubuntu package has:" [puppet] - 10https://gerrit.wikimedia.org/r/203342 (owner: 10Hashar) [08:24:34] !log Attached Manfred Strumpf@enwiki to the global account of the same name [08:26:01] Is the bot logging stuff down still? [08:26:12] My deploys yesterday also didn't make it into the SAL [08:26:19] that's awry :( [08:27:27] hoo: there was a labs outage yesterday :/ [08:27:41] Can someone restart that bot? [08:28:10] I prefer to have these things logged [08:29:51] hashar: ^ [08:30:57] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 870.724782885 [08:31:41] (03Abandoned) 10Faidon Liambotis: Clean up the mess that is SSL certificate installation [puppet] - 10https://gerrit.wikimedia.org/r/15561 (owner: 10Catrope) [08:37:13] (03CR) 10Faidon Liambotis: [C: 04-1] "Drop it from the other two as well :)" [dns] - 10https://gerrit.wikimedia.org/r/196605 (https://phabricator.wikimedia.org/T92438) (owner: 10Dzahn) [08:38:34] paravoid: Can you bring morebots back? [08:45:36] * hoo looks for someone to walk through https://wikitech.wikimedia.org/wiki/Morebots#Example:_restart_the_ops_channel_morebot [08:45:50] ah yeah stupid morebot [08:46:01] will do [08:46:11] the morebot are horrible :( [08:47:32] oh, I filed https://phabricator.wikimedia.org/T96002 befre noticing that icinga already noticed [08:48:10] $ jstart -N production /usr/lib/adminbot/adminlogbot.py --config ./confs/production-logbot.py [08:48:11] Your job 9928708 ("production") has been submitted [08:48:27] !log Testing log bot [08:48:32] Gitblit needs to be kicked, btw [08:48:37] Logged the message, Master [08:48:43] hoo: seems !log is back [08:49:33] !log Attached Manfred Strumpf@enwiki to the global account of the same name [08:49:36] Logged the message, Master [08:49:41] Thanks :) [08:55:21] <_joe_> hoo: I'll kick it, FWIW [08:55:27] :) [08:57:14] <_joe_> uuuh it's in a bad shape [08:59:23] <_joe_> INFO Found more URL path parts then expected, these will be ignored. Url: 'http://git.wikimedia.org/tree/mediawiki/extensions/Math.git/0aa8d8ccd40ad146678de7291e959d77eec52c55', mountpath: 'tree', urlPath: 'mediawiki/extensions/Math.git/0aa8d8ccd40ad146678de7291e959d77eec52c55', expected 3 parameters was 4 [08:59:37] <_joe_> ERROR Failed to find commit "extensions" in mediawiki! for anonymous [08:59:40] <_joe_> sigh [08:59:57] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1205990 (10faidon) >>! In T95229#1192818, @GWicke wrote: > - on high-latency connections, setting up a brand new TLS connection to a relatively obscure host nam... [09:00:36] <_joe_> !log restarting gitblit [09:00:39] Logged the message, Master [09:04:07] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60610 bytes in 0.885 second response time [09:05:43] (03CR) 10Filippo Giunchedi: ipsec-global: fix bug in non-verbose mode, exit if not root (038 comments) [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [09:07:09] 6operations, 10MediaWiki-General-or-Unknown, 10MediaWiki-JobRunner, 7Graphite: jobrunner metrics audit - https://phabricator.wikimedia.org/T95913#1205997 (10fgiunchedi) a:5fgiunchedi>3None [09:07:38] (03CR) 10Muehlenhoff: "The use of eval() seems ok here, anyone with the privileges to alter the strongswan configuration file might just as well execute arbitrar" [puppet] - 10https://gerrit.wikimedia.org/r/202975 (https://phabricator.wikimedia.org/T88536) (owner: 10Gage) [09:08:15] YuviPanda: did we figure anything out with this? https://phabricator.wikimedia.org/T92273 (wildcard domain for reflex) [09:30:48] 6operations: Convert ircecho init script to an upstart job - https://phabricator.wikimedia.org/T95055#1206042 (10MoritzMuehlenhoff) a:3MoritzMuehlenhoff I'd like to use this ticket as an exercise to go through the internal package build/reprepro process by building it for jessie-wikimedia (and providing a syst... [09:33:56] (03CR) 10Hashar: "recheck" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [09:34:16] (03CR) 10jenkins-bot: [V: 04-1] Use tox lint yaml and run flake8 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [09:46:02] (03PS1) 10Hashar: contint: +libxml2-dev +libxslt1-dev [puppet] - 10https://gerrit.wikimedia.org/r/204031 (https://phabricator.wikimedia.org/T95894) [09:52:02] (03CR) 10Hashar: [C: 031 V: 032] "Cherry picked on integration puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/204031 (https://phabricator.wikimedia.org/T95894) (owner: 10Hashar) [09:52:26] (03CR) 10Hashar: "recheck" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [09:54:33] (03PS1) 10Nikerabbit: Re-enable Special:SupportedLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204032 (https://phabricator.wikimedia.org/T54728) [09:55:39] (03CR) 10Hashar: Use tox lint yaml and run flake8 (031 comment) [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [09:55:47] (03PS4) 10Hashar: Use tox lint yaml and run flake8 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [09:56:10] (03CR) 10Hashar: [C: 032] "Excellent, thanks a ton to have added tests!" [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [09:58:09] (03Merged) 10jenkins-bot: Use tox lint yaml and run flake8 [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203984 (https://phabricator.wikimedia.org/T95894) (owner: 10BryanDavis) [10:01:37] (03PS2) 10Hashar: Fix PEP-8 style [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203985 (owner: 10BryanDavis) [10:04:59] (03CR) 10Hashar: [C: 031] Fix PEP-8 style [wikimedia/bots/jouncebot] - 10https://gerrit.wikimedia.org/r/203985 (owner: 10BryanDavis) [10:37:12] (03PS1) 10Giuseppe Lavagetto: runcommand: Twisted 13.x compatibility [debs/pybal] - 10https://gerrit.wikimedia.org/r/204035 [10:38:53] (03PS2) 10Giuseppe Lavagetto: runcommand: Twisted 13.x compatibility [debs/pybal] - 10https://gerrit.wikimedia.org/r/204035 [10:41:47] <_joe_> mmh a lot of automatic whitespace reaping, sorry [10:45:26] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [10:47:50] :) [10:48:03] i used eclipse at the time, it's horrible like that [10:48:10] no gerrit to highlight it then ;) [10:48:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60611 bytes in 5.572 second response time [10:56:57] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [10:58:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60611 bytes in 1.018 second response time [11:03:17] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [11:03:47] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [11:16:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:23:27] PROBLEM - puppet last run on mw2099 is CRITICAL puppet fail [11:35:36] PROBLEM - puppet last run on mw2153 is CRITICAL puppet fail [11:35:47] _joe_: when you have a minute - https://gerrit.wikimedia.org/r/#/c/203886/ [11:36:47] <_joe_> mobrovac: oh, yes, looks like it's needed [11:38:47] _joe_: an alternative would be to change ordered_yaml to emit nil for undef, but i'm not sure whether that's desirable in the general case [11:39:04] <_joe_> yeah let's think about that for a second :) [11:39:09] (03PS2) 10Giuseppe Lavagetto: service::node: fix the look-up of undefined variables [puppet] - 10https://gerrit.wikimedia.org/r/203886 (https://phabricator.wikimedia.org/T95533) (owner: 10Mobrovac) [11:39:09] :) [11:39:25] <_joe_> I'm merging it [11:40:18] <_joe_> mobrovac: on second thoughts, what if cvars[k] is false? [11:40:33] Hej hej, anyone around to restart git.wikimedia.org (or already on that)? See https://phabricator.wikimedia.org/T96002 and icinga's line above: [11:40:34] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [11:40:37] _joe_: re ordered_yaml, https://github.com/wikimedia/operations-puppet/blob/production/modules/wmflib/lib/puppet/parser/functions/ordered_yaml.rb#L30 already does that for :undef, it would suffice to add 'undef' [11:40:56] _joe_: ups, right [11:40:58] <_joe_> andre__: I already restarted it once this morning [11:41:07] will fix it [11:41:09] to cover false [11:41:18] _joe_, :( [11:41:20] thanks [11:41:29] (03CR) 10Giuseppe Lavagetto: [C: 04-1] service::node: fix the look-up of undefined variables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203886 (https://phabricator.wikimedia.org/T95533) (owner: 10Mobrovac) [11:41:37] RECOVERY - puppet last run on mw2099 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [11:43:07] <_joe_> mobrovac: yeah, it's probably a bug there, but let's do that later maybe [11:43:15] kk [11:43:17] <_joe_> for now let's hotfix this [11:43:34] <_joe_> (re: ordered_yaml) [11:44:57] (03PS3) 10Mobrovac: service::node: fix the look-up of undefined variables [puppet] - 10https://gerrit.wikimedia.org/r/203886 (https://phabricator.wikimedia.org/T95533) [11:45:05] _joe_: there ^^ [11:45:48] it's rather dumb though that lookupvar returns undef and not nil [11:47:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "typo to fix, then we're GTG" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203886 (https://phabricator.wikimedia.org/T95533) (owner: 10Mobrovac) [11:49:00] <_joe_> mobrovac: puppet is rather dumb, I agree [11:49:39] oh _joe_, apart from the typo you found, i also need to surround keys in quotes ;P [11:49:42] fixing now [11:50:21] <_joe_> oh lol, right, that is ruby [11:50:38] (03PS4) 10Mobrovac: service::node: fix the look-up of undefined variables [puppet] - 10https://gerrit.wikimedia.org/r/203886 (https://phabricator.wikimedia.org/T95533) [11:51:34] (03CR) 10Giuseppe Lavagetto: [C: 032] service::node: fix the look-up of undefined variables [puppet] - 10https://gerrit.wikimedia.org/r/203886 (https://phabricator.wikimedia.org/T95533) (owner: 10Mobrovac) [11:55:27] RECOVERY - puppet last run on mw2153 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [11:58:13] <_joe_> mobrovac: merged, a noop in production [11:58:20] <_joe_> that's expected, right? [11:58:25] yup [11:58:34] the change should be visible in beta only [11:58:36] * mobrovac checks ... [11:58:38] <_joe_> ok I got it right then [11:58:47] <_joe_> mobrovac: well beta will get merged "eventually" [11:58:54] <_joe_> in the next hour I guess [11:59:43] <_joe_> andre__: I see git.wm.o is down again, restarting it again FWIW [12:01:11] <_joe_> looks like something is asking gitblit some very heavy processing, repeatedly, and that seems to starve it. [12:01:31] <_joe_> I don't have time to investigate this properly right now though, sorry [12:01:39] it's been broken forever [12:01:47] <_joe_> I know [12:01:55] <_joe_> today is worse than usual though [12:02:04] has no owner, noone in the org really cares about it anymore [12:02:13] <_joe_> !log restarting gitblit [12:02:39] <_joe_> if that's the case we should just turn it off? [12:03:26] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60610 bytes in 0.257 second response time [12:05:57] users use it [12:14:50] akosiaris: around? [12:26:23] 6operations: merge cassandra submodule into operations/puppet - https://phabricator.wikimedia.org/T96016#1206302 (10fgiunchedi) 3NEW a:3fgiunchedi [12:26:38] 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206310 (10mobrovac) 3NEW [12:28:57] re: https://phabricator.wikimedia.org/T96016 and merging the cassandra submodule, I'm skipping pushing every single commit to review [12:29:48] 6operations: merge cassandra submodule into operations/puppet - https://phabricator.wikimedia.org/T96016#1206318 (10mobrovac) Will this process ensure `modules/cassandra` to be correctly updated on the puppetmaster as well ? See https://gerrit.wikimedia.org/r/#/c/196335/ for a discussion of problems we ran into... [12:30:13] godog: ^^ [12:31:48] 6operations: merge cassandra submodule into operations/puppet - https://phabricator.wikimedia.org/T96016#1206319 (10fgiunchedi) [12:31:49] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1206320 (10fgiunchedi) [12:31:55] mobrovac: hah, thanks completely forgot about that one [12:34:58] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [12:37:52] 6operations, 10MediaWiki-extensions-Graph, 6Services, 10service-template-node, 7service-runner: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1206326 (10mobrovac) [12:37:56] 6operations, 6Services, 10service-template-node, 5Patch-For-Review: Unify SCA Service Puppet Modules / Roles - https://phabricator.wikimedia.org/T95533#1206322 (10mobrovac) 5Open>3Resolved The module has been merged into ops/puppet. Additionally, citoid has been switched to use it and has been confirme... [12:37:59] 6operations, 6Mobile-Apps, 6Services: Deployment of Mobile App's service on the SCA cluster - https://phabricator.wikimedia.org/T92627#1206325 (10mobrovac) [12:38:03] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1206324 (10mobrovac) [12:38:07] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60631 bytes in 1.203 second response time [12:43:05] (03PS1) 10Muehlenhoff: * Simplify package build, also the stepping stone for adding a systemd unit file (T95055) [debs/ircecho] - 10https://gerrit.wikimedia.org/r/204045 [12:43:46] \o/ [12:45:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [12:55:31] (03PS11) 10BBlack: cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 [12:55:34] (03PS1) 10BBlack: move parsoid applayer backends from $active_nodes to $backends [puppet] - 10https://gerrit.wikimedia.org/r/204046 [12:56:38] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60631 bytes in 6.819 second response time [12:59:04] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1206346 (10mobrovac) >>! In T94821#1174154, @GWicke wrote: > We could consider directly using the test/example request/response pairs from the swagger spec. An attribute could m... [13:01:26] (03PS1) 10BBlack: parsoid applayer backends: reuse $lvs::config for prod addrs [puppet] - 10https://gerrit.wikimedia.org/r/204047 [13:05:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [13:06:33] 6operations, 10ops-codfw: rack/wire/initial setup of db2043-db2070 - https://phabricator.wikimedia.org/T89368#1206351 (10mark) @RobH @papaul: yes, let's move ahead and use D6 at this point. [13:06:44] 6operations, 10ops-codfw: rack/wire/initial setup of db2043-db2070 - https://phabricator.wikimedia.org/T89368#1206352 (10mark) a:5mark>3None [13:07:08] 7Blocked-on-Operations, 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: move cassandra submodule into puppet repo - https://phabricator.wikimedia.org/T92560#1206356 (10fgiunchedi) I've attempted this too though via a merge commit as described in https://phabricator.wikimedia.org/T96016... [13:08:06] (03CR) 10BBlack: [C: 04-1] "IMHO, the bash scripts are too complex to be bash scripts. I'm not sure if we ever defined a line in the sand on this, but the emerging c" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [13:11:25] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1206363 (10mobrovac) Regarding the monitoring check responses, I think that for the first iteration we should make it as flexible as possible. For example, ``` response: { st... [13:16:48] (03CR) 10coren: "FWIW, I chose to do shell scripts because of the requirement to invoke a number of system utilities in a chain (mountpoint, lvs, ionice, r" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [13:18:16] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60610 bytes in 2.173 second response time [13:26:18] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2887.9529178 [13:29:37] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=64.80 Read Requests/Sec=46.50 Write Requests/Sec=18.80 KBytes Read/Sec=676.00 KBytes_Written/Sec=191.90 [13:30:00] 6operations, 10MediaWiki-Debug-Logging, 6Release-Engineering, 6Security-Team, 5Patch-For-Review: Store unsampled API and XFF logs - https://phabricator.wikimedia.org/T88393#1206390 (10fgiunchedi) just for reference, sampling is defined in $wgDebugLogGroups in InitialiseSettings.php and currently at 1000... [13:30:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [13:31:16] (03CR) 10Mark Bergsma: [C: 04-1] "That would make a lot of sense to do in Python, yes..." [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [13:33:16] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60632 bytes in 3.220 second response time [13:34:28] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=252.30 Read Requests/Sec=140.90 Write Requests/Sec=26.30 KBytes Read/Sec=17955.20 KBytes_Written/Sec=527.90 [13:36:18] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Add requested languages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/203812 (owner: 10KartikMistry) [13:36:26] (03PS4) 10Alexandros Kosiaris: Beta: Add requested languages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/203812 (owner: 10KartikMistry) [13:42:52] (03CR) 10Alexandros Kosiaris: [C: 032] Beta: Add requested languages for ContentTranslation [puppet] - 10https://gerrit.wikimedia.org/r/203812 (owner: 10KartikMistry) [13:44:37] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=128.30 Read Requests/Sec=75.70 Write Requests/Sec=22.80 KBytes Read/Sec=553.60 KBytes_Written/Sec=704.30 [13:45:06] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [13:46:36] PROBLEM - nutcracker port on silver is CRITICAL - Socket timeout after 2 seconds [13:46:37] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60632 bytes in 0.833 second response time [13:48:08] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [13:51:43] (03PS12) 10BBlack: cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 [13:51:45] (03PS2) 10BBlack: move parsoid applayer backends from $active_nodes to $backends [puppet] - 10https://gerrit.wikimedia.org/r/204046 [13:51:47] (03PS2) 10BBlack: parsoid applayer backends: reuse $lvs::config for prod addrs [puppet] - 10https://gerrit.wikimedia.org/r/204047 [13:51:49] (03PS1) 10BBlack: get rid of $active_nodes[api] (unused) [puppet] - 10https://gerrit.wikimedia.org/r/204052 [13:53:39] 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206426 (10akosiaris) I 'd be happy if SCA moves on to Jessie relatively soon. So +1 from me. That being said, we first need to make sure SCA services actually work on Jessie. So in alphabetical order we need to ma... [13:59:40] !log barium down for disk swap [13:59:50] Logged the message, Master [14:04:23] 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206446 (10Joe) One thing that we'll need is to make all those services to use base::service_unit to manage their services, and write systemd unit files as well as the current upstart jobs. [14:08:46] (03Abandoned) 10Andrew Bogott: Explicitly set default_schedule_zone to 'nova' [puppet] - 10https://gerrit.wikimedia.org/r/203872 (owner: 10Andrew Bogott) [14:09:22] (03Abandoned) 10Andrew Bogott: Add labvirt1001 to the compute pool [puppet] - 10https://gerrit.wikimedia.org/r/203666 (owner: 10Andrew Bogott) [14:09:45] (03PS13) 10BBlack: cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 [14:09:47] (03PS2) 10BBlack: get rid of $active_nodes[api] (unused) [puppet] - 10https://gerrit.wikimedia.org/r/204052 [14:10:28] (03Abandoned) 10BBlack: parsoid applayer backends: reuse $lvs::config for prod addrs [puppet] - 10https://gerrit.wikimedia.org/r/204047 (owner: 10BBlack) [14:10:56] Krinkle or Krenair can I get a +2 for https://gerrit.wikimedia.org/r/#/c/202160/? [14:15:18] (03PS1) 10Muehlenhoff: Add a systemd unit file (T95055) [debs/ircecho] - 10https://gerrit.wikimedia.org/r/204054 [14:18:53] (03CR) 10BBlack: [C: 032] "no-op in puppet-compiler" [puppet] - 10https://gerrit.wikimedia.org/r/204046 (owner: 10BBlack) [14:19:01] (03PS1) 10Jgreen: disable customer account creation in OTRS for phab T96023 [puppet] - 10https://gerrit.wikimedia.org/r/204055 [14:19:48] (03PS2) 10Jgreen: disable customer account creation in OTRS for phab T96023 [puppet] - 10https://gerrit.wikimedia.org/r/204055 [14:20:24] (03CR) 10Jgreen: [C: 032 V: 031] disable customer account creation in OTRS for phab T96023 [puppet] - 10https://gerrit.wikimedia.org/r/204055 (owner: 10Jgreen) [14:23:47] PROBLEM - Host barium is DOWN: PING CRITICAL - Packet loss = 100% [14:24:37] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=148.40 Read Requests/Sec=130.20 Write Requests/Sec=12.20 KBytes Read/Sec=4028.40 KBytes_Written/Sec=156.25 [14:26:14] Jeff_Green: that's you right? [14:26:19] barium I mean [14:27:27] it's chris, replacing a failed HDD [14:28:32] ha, barium is a noisy bugger in terms of monitoring, got watchmoused too [14:28:59] 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206502 (10GWicke) Alternatively, we could set up two Jessie boxes as `scb` and then use that for new services / migrate over existing services one by one. > One thing that we'll need is to make all those services... [14:29:01] I silenced icinga for 15mins but I ran overtime adding the new disk back to the raid cfg [14:29:01] okey dokey [14:29:05] its booting now [14:31:16] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=4.40 Read Requests/Sec=0.00 Write Requests/Sec=4.80 KBytes Read/Sec=0.00 KBytes_Written/Sec=36.85 [14:31:33] 6operations, 5Patch-For-Review: Convert ircecho init script to a systemd unit - https://phabricator.wikimedia.org/T95055#1206509 (10MoritzMuehlenhoff) [14:31:45] (03CR) 10Giuseppe Lavagetto: [C: 031] cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 (owner: 10BBlack) [14:35:17] RECOVERY - Host barium is UPING OK - Packet loss = 0%, RTA = 0.74 ms [14:35:42] Krinkle, shall we just go ahead and merge andrewbogott's commit? [14:35:53] ci meeting in #wikimedia-office [14:39:28] oh, ok [14:42:05] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1206538 (10GWicke) > How sure are you about this? While a performance impact sounds plausible, 1.5s feels like too much. NavTiming should have the data to back... [14:43:18] 6operations: deploy db2043-2066 - https://phabricator.wikimedia.org/T89365#1206542 (10Papaul) [14:43:20] 6operations, 10ops-codfw: rack/wire/initial setup of db2043-db2070 - https://phabricator.wikimedia.org/T89368#1206540 (10Papaul) 5Open>3stalled [14:46:24] (03CR) 10Alexandros Kosiaris: [C: 032] contint: +libxml2-dev +libxslt1-dev [puppet] - 10https://gerrit.wikimedia.org/r/204031 (https://phabricator.wikimedia.org/T95894) (owner: 10Hashar) [14:46:41] 6operations, 10RESTBase, 10RESTBase-Cassandra: Set up multiple Cassandra instances per hardware node - https://phabricator.wikimedia.org/T95253#1206549 (10fgiunchedi) a:5Joe>3fgiunchedi I'm looking at hw provisioning as well and this is related, I'll pick it up [14:51:07] I suppose I'm SWATting my own patch this morning. Maybe I'll be lucky and no one else will turn up last-minute ;) [14:52:54] akosiaris: If you have a moment, I have a puppet question. In short, I need labs puppet certs to get cleaned up when the instances are deleted. [14:53:10] It’s easy enough to hook instance deletion, but the hooks will most likely not execute on the puppet master. [14:53:19] Would you use the puppet rest api for that, or something else? [14:56:10] andrewbogott: could the hook be to run the command via ssh on the master? [14:56:27] mutante: yes, that’s an option if there’s a keypair set up [14:56:38] that would make the salt case trivial as well. [14:56:51] But, I wonder if a more correct solution is something like this: https://ask.puppetlabs.com/question/3347/revoke-and-delete-cert-via-the-rest-api/ [14:56:56] 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1206571 (10mobrovac) >>! In T96017#1206446, @Joe wrote: > One thing that we'll need is to make all those services to use base::service_unit to manage their services, and write systemd unit files as well as the curr... [14:57:27] hm, of course doing puppet the right way doesn’t help with salt [14:58:56] 6operations: Allow access to https://archiva.wikimedia.org from analytics nodes. - https://phabricator.wikimedia.org/T95712#1206572 (10akosiaris) a:3akosiaris [14:59:45] mutante: can you point me at an existing example of a remote-execution keypair like that? [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, anomie: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150414T1500). Please do the needful. [15:00:16] * anomie does the needful. [15:00:22] andrewbogott: maybe the hook could just be adding it to a list of deleted instances, and then separately you run a script on the master that gets the list and revokes puppet and salt keys/certs [15:00:56] mutante: polling on the master? Could be a bit racy. [15:01:37] it's not time critical, is it. it's just to clean up the list every once in a while? [15:01:50] Seriously, Jenkins? Estimated time 20 minutes? [15:02:18] mutante: if I delete instance x and then recreate it with the same name, I’ll get cert errors if the old cert isn’t cleaned up. [15:02:28] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1206591 (10GWicke) @mobrovac, the service template does not need to be fully spec-driven to be able to expose a simple spec for monitoring purposes. A minimal spec looks like th... [15:02:45] anomie: just waiting on mediawiki-phpunit-zend ;) [15:02:49] Krenair: So I don't know what that commit is about [15:02:51] s/;)/:(/ [15:03:11] Krenair: I'm also not on the related Task [15:03:17] ah. [15:03:20] ok, never mind [15:03:28] greg-g: Did that test get so much worse suddenly? It didn't used to be *that* slow. [15:04:23] Krenair: Oh, it's that security bug [15:04:31] mutante: I guess I should really use rabbitmq for this. I’ll just have to re-learn how to do that [15:04:34] * Krinkle looks [15:05:23] Krinkle: thanks [15:05:33] Shoot, missed it. [15:05:45] I'm on SF time this week, so I'm still useless. [15:06:04] anomie: the switch from pg to mysql slowed it down, especially given that it's a "unit test" job that runs a lot of non-unit tests [15:06:26] but, since we don't care about pg, and only about mysql, well... it only makes sense :) [15:06:38] greg-g: It used to use pg? [15:06:43] yeah [15:06:48] I thought it used sqlite [15:06:54] er, yeah, that's it [15:06:59] sorry, still waking up [15:07:21] * anomie likes PG, but lacks time to be responsible for it in MediaWiki. [15:09:45] https://phabricator.wikimedia.org/T2384 PostgreSQL/pgsql support (tracking) [15:10:18] Sep 6 2004, domas.mituzas wrote: PostgreSQL support is experimental yet, but is getting better and better :) [15:11:12] catrope said : It's a tracking bug, so it's not supposed to be closed at all, ever. [15:12:58] greg-g: If merging a single patch now takes 20 minutes, we should probably adjust SWAT from "max 8 patches" to some sort of point system where core and extension changes cost more than config changes (such that there's only enough points for about 3 core or extension patches). :( [15:13:33] greg-g: And a full-scap would probably cost all the points. [15:13:46] 6operations, 10ops-eqiad: dysprosium memory failure - https://phabricator.wikimedia.org/T95423#1206671 (10Cmjohnson) 5Open>3Resolved replaced the DIMM and the error has gone away. Sending DIMM back Track number for return 9611918 2393026 47987929 [15:14:52] 6operations, 3Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1206674 (10Cmjohnson) [15:14:54] 6operations, 10ops-eqiad: labnodepool1001 setup tasks: labels/ports/racktables - https://phabricator.wikimedia.org/T95048#1206673 (10Cmjohnson) 5Open>3Resolved [15:16:02] anomie: agree. Scap yesterday took 41m 12s in my naiveté I started it at 8:34 [15:17:03] * anomie is glad that Jenkins was overestimating, but 15 minutes is still too long. [15:18:08] !log anomie Synchronized php-1.26wmf1/includes/page/Article.php: SWAT: Continued debugging of [[phab:T92046]] ([[gerrit:204050]]) (duration: 00m 11s) [15:18:09] anomie: ^ Check please [15:18:13] Logged the message, Master [15:18:33] anomie: Editing isn't broken, anwyay. [15:18:44] * anomie is done with the needful [15:20:10] mutante: hasharConfcall: DNS has regressed again. " Couldn't resolve host name saucelabs.com" T92351 [15:20:32] 6operations, 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1206701 (10Krinkle) 5declined>3Open https://integration.wikimedia.org/ci/job/npm/2590/console ``` 00... [15:20:43] Krinkle: is that on labs or prod? [15:20:51] andrewbogott: CI [15:20:56] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=105.30 Read Requests/Sec=21.80 Write Requests/Sec=81.90 KBytes Read/Sec=414.00 KBytes_Written/Sec=888.35 [15:21:01] ok, I will ignore for now then :) [15:21:12] andrewbogott: Which means projects can't merge any changes. [15:21:17] Again. [15:21:24] Krinkle: you mean the CI project in labs? [15:21:27] yes [15:21:29] Ah! [15:21:37] ok, can you direct me to a commandline example? [15:21:47] (i have a meeting in 5, so am only barely useful for this) [15:21:47] It hasn't changed since last month. [15:22:01] ... [15:22:03] It was fixed with a cherrypicked patch but seems to have regressed [15:22:11] https://phabricator.wikimedia.org/T92351#1206701 [15:22:26] "The issue has been worked around on the CI side" "This has been worked around in beta" unfortunately it doesn't say _how_ [15:22:36] mutante: With your patch [15:22:38] https://gerrit.wikimedia.org/r/#/c/196731/ [15:22:42] no, i abandoned that [15:22:58] Yeah, that doesn't change our puppetmaster [15:23:00] it wasn't the right solution [15:23:04] It works [15:23:08] per coren [15:23:17] CI doesn't care about labsdb domains [15:23:46] 6operations, 10ops-eqiad: ms-be1005.eqiad.wmnet: slot=5 dev=sdf failed - https://phabricator.wikimedia.org/T95268#1206722 (10Cmjohnson) return disk info fedex tracking9611918 2393026 47981446 [15:23:52] We can't block projects for months from making a simple TCP connection due to some some hypothetical solution... [15:23:52] That's going to be made moot as Andrew finishes testing the real dns server for labs. [15:24:07] The issue is that dnsmasq is broken. [15:24:47] It returns a bad answer instead of NXDOMAIN in many cases, and that breaks search. [15:24:59] I don't know anything about dns and don't want to :P [15:25:08] actually, I do want to, but not when I'm unbreaking CI. [15:25:09] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1206724 (10mobrovac) @GWicke I am not sure we are talking about the same thing any more. This ticket is about the contract between the monitoring utility and the service, while... [15:25:21] So I'll re-apply mutante's patch unless we have other solutions [15:25:46] Krinkle: It'll work around the issue until we have a real DNS server. [15:26:15] Assuming that real DNS server is introduced without breaking our current setup or with notice to migrate, that sounds great. [15:27:23] Krinkle: oh my god [15:27:57] Krinkle: hold on how the hell are karma jobs using saucelabs? That is neat but I had no idea about that usage :D [15:28:09] 6operations, 6Services, 7Service-Architecture: Set up monitoring automation for services - https://phabricator.wikimedia.org/T94821#1206726 (10GWicke) @mobrovac, I think we are talking about the same thing. An API spec documents (relevant properties of) an API. In this case we are interested primarily in end... [15:28:26] hashar: That's been that way for over a year. [15:28:40] * hashar feels old [15:28:41] Projects decide in their configuration which browsers they use [15:28:58] Most use Chrome, some Firefox, some cross-browser IE/Opera windows et. [15:29:07] oojs uses saucelabs after testing in local Chrome [15:29:09] Krinkle: https://phabricator.wikimedia.org/T89342 :D [15:29:22] None of my concern. [15:29:55] I know about that, I dealt with it 5 years ago at jQuery and with Travis CI. [15:30:01] Nobody cares, including Saucelabs themselves. [15:30:09] They give accounts away for free. Fair usage and everybody's happy. [15:30:13] Put them in public for all we care. [15:30:50] Of course, we shouldnt. And that;s why for oojs I created a separate account. Inside Jenkins it uses ours, and in the repo for local testing, people add their own. [15:30:52] would you mind giving a short note on the task ? [15:31:07] would be interesting to see how you pass the credentials [15:31:15] They're in Gruntfile.js [15:31:18] oh [15:31:21] Like very other GitHub project on earh does [15:31:21] Aka, badly. :-) [15:31:47] I'ts fine until someone abuses it, and then we'll figure something out given appropriate resources. [15:32:04] can you state some of your repos have saucelabs credentials publicly available in their source repo so ? [15:32:25] I'm not cc-ed on that task, I wasnt' aware of that initiaitive [15:32:25] * James_F nods. [15:32:26] I am pairing with Zeljkof tomorrow morning [15:32:28] Will do later. [15:32:32] okkk [15:32:40] comment whenever you can :) [15:32:50] I have cced you [15:33:05] hashar: Just note, avoid investing in a complicated solution with proxies or encryption. The solution can and should be simple and yet secure. [15:33:13] anyway for the DNS resolution issue, that would be a task for labs infra [15:33:19] the resolver might have issues from time to time [15:33:21] I already fixed it 10 minutes ago [15:33:28] We don't care about labs db weird hostnames in CI [15:33:49] We care about being able to resolve actual hostnames that eg. end in .com [15:34:05] so the puppet patch we had until last week that we agreed on last year, is now re-applied. [15:34:06] what do you mean by fixed it ? [15:34:13] I gues s it got lost when we re0created the instance [15:34:18] See SAL [15:34:21] RelEng [15:34:25] come on timo [15:34:27] RECOVERY - configured eth on labvirt1002 is OK - interfaces up [15:34:31] I am old! I need extra care :) [15:34:37] RECOVERY - configured eth on labvirt1006 is OK - interfaces up [15:34:38] RECOVERY - configured eth on labvirt1004 is OK - interfaces up [15:34:56] RECOVERY - configured eth on labvirt1005 is OK - interfaces up [15:35:03] Krinkle: !log puppetmaster: Re-apply I05c49e5248cb operations/puppet patch to re-fix T91524. Somehow the patch got lost. [15:35:34] so you are reapplying a patch that has been abandonned [15:35:36] RECOVERY - configured eth on labvirt1001 is OK - interfaces up [15:35:39] and adds in resolv.conf ndots:2 [15:35:43] no clue what it is doing [15:35:43] (03CR) 10Faidon Liambotis: [C: 04-1] "${prefix}-${user} is a bad idea, as a "foo-bar" user is valid and sshd won't be able to differentiate between user "bar" with prefix "foo"" [puppet] - 10https://gerrit.wikimedia.org/r/202731 (owner: 10Alexandros Kosiaris) [15:35:55] but if it fix the issues, that should land / be merged in operations/puppet.git [15:35:57] RECOVERY - configured eth on labvirt1003 is OK - interfaces up [15:35:57] hashar: No,it removes it. [15:36:00] ah [15:36:14] (03Restored) 10Hashar: don't use 'ndots: 2' in labs resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [15:36:18] hashar: It makes anything foo.bar NOT resolve to the outside world but intead thin it is foo.bar.eqiad.wmflabs [15:36:30] Which is a hack for labsdb I think [15:36:42] but we don't need that and it breaks normal hostnames like saucelabs.com and google.com and what not [15:37:04] I don't know why this doesn't break other things, but I don't have time for that right now [15:37:49] (03CR) 10Hashar: "This has been reapplied on the integration puppetmaster because ndots: 2 prevent us from resolving public dns records such as www.saucelab" [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [15:39:28] 6operations, 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1206779 (10hashar) On CI we do DNS requests for pubic DNS entry so we apparently need to remove the `ndo... [15:39:34] Krinkle: agreed [15:39:43] Krinkle: I have reopened the change and commented on the task [15:39:53] no clue what that ndots:2 is for really [15:40:11] Krinkle: the patch probaqbly went away when I recreated the puppetmaster to migrate it from Trusty back to Precise [15:41:01] hashar: It's so that "s4.labsdb" resolves to wmflabs [15:48:15] (03PS2) 10Krinkle: base: Don't use 'ndots: 2' in labs resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [15:48:34] (03CR) 10Krinkle: "Resolved merge conflict. Deployed on integration-puppetmaster." [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [15:48:38] !log Restarted logstash on logstash1003.eqiad.wmnet; subbu reported missing parsoid log events [15:48:41] Logged the message, Master [15:48:50] bd808, thanks. [15:48:55] yw [15:49:00] (03PS3) 10Krinkle: base: Don't use 'ndots: 2' in labs resolv.conf [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [15:52:07] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [15:52:22] Krenair: We've got a few newFromText warnings on fluorine [15:56:45] 6operations, 10Continuous-Integration, 6Labs, 10Wikimedia-Labs-Infrastructure: dnsmasq returns SERVFAIL for (some?) names that do not exist instead of NXDOMAIN - https://phabricator.wikimedia.org/T92351#1206842 (10Krinkle) 5Open>3Resolved [15:56:46] yeah, haven't had time to look through the ones we've collected so far [15:58:35] Krenair: Hm.. they're quite a few. Maybe we should file them as #wikimedia-log-errors [15:58:54] UploadFromChunks::stashFile/UploadFromChunks::verifyChunk/UploadBase::verifyPartialFile/UploadBase::getTitle/Title::newFromText [15:59:11] I knew about this one but hadn't had a chance to deal with it yet [15:59:52] that makes up at least 40% of the problem [16:00:04] legoktm: Dear anthropoid, the time has come. Please deploy SUL/CentralAuth backports (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150414T1600). [16:00:09] o/ [16:00:26] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=170.10 Read Requests/Sec=48.20 Write Requests/Sec=9.20 KBytes Read/Sec=1612.40 KBytes_Written/Sec=117.00 [16:02:07] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=106.20 Read Requests/Sec=5.50 Write Requests/Sec=100.80 KBytes Read/Sec=23.20 KBytes_Written/Sec=1101.70 [16:02:31] krenair@fluorine:/a/mw-log$ grep -v T92046 AdHocDebug.log | grep UploadFromChunks -v | cut -d ' ' -f 8- | wc -l [16:02:31] 575 [16:02:31] krenair@fluorine:/a/mw-log$ grep -v T92046 AdHocDebug.log | grep UploadFromChunks -v | cut -d ' ' -f 8- | sort -u | uniq | wc -l [16:02:31] 23 [16:02:34] Krinkle, ^ [16:04:22] 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1206883 (10chasemp) a:5chasemp>3mark [16:04:56] _joe_: hey, could you please review and (hopefully) merge https://gerrit.wikimedia.org/r/#/c/203379/ ? [16:05:11] <_joe_> sorry, interview [16:05:49] (it’s a jobrunner patch) [16:06:30] (03PS1) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [16:06:42] (03CR) 10Andrew Bogott: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [16:07:07] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=140.00 Read Requests/Sec=66.70 Write Requests/Sec=57.30 KBytes Read/Sec=2461.20 KBytes_Written/Sec=2585.05 [16:07:27] (03CR) 10jenkins-bot: [V: 04-1] Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [16:07:56] (03PS14) 10BBlack: cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 [16:07:58] (03PS3) 10BBlack: get rid of $active_nodes[api] (unused) [puppet] - 10https://gerrit.wikimedia.org/r/204052 [16:08:00] (03PS1) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [16:09:47] (03CR) 10jenkins-bot: [V: 04-1] r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [16:12:49] (03PS2) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [16:13:36] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=27.80 Read Requests/Sec=7.30 Write Requests/Sec=38.90 KBytes Read/Sec=67.60 KBytes_Written/Sec=739.15 [16:13:47] (03CR) 10jenkins-bot: [V: 04-1] r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 (owner: 10BBlack) [16:16:58] (03Abandoned) 10Yuvipanda: Revert "Tool Labs: forget puppet for now" [puppet] - 10https://gerrit.wikimedia.org/r/187006 (owner: 10Alexandros Kosiaris) [16:17:01] (03PS3) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [16:22:21] !log legoktm Started scap: Updating CentralAuth [16:22:27] Logged the message, Master [16:22:41] 6operations, 10Continuous-Integration, 3Continuous-Integration-Isolation: install/setup/deploy cobalt as replacement for gallium - https://phabricator.wikimedia.org/T95959#1206932 (10chasemp) I talked to @mark about this and how it relates to the labnodepool box placement. mark wants to go over things a bit... [16:27:05] what project do i give to a request to change my ssh pubkey? it would have thought RT before, but now i dont know [16:27:54] ebernhardson: Just make a changeset in gerrit? [16:28:10] Reedy: oh is that all? i thought i had to go through ops for it. i'll grep the puppet repo [16:28:41] Admins yaml file IIRC [16:28:52] ebernhardson: yeah, data.yaml in admin module [16:29:21] Permission denied (publickey). [16:29:24] ebernhardson, anyone can submit a commit for review to ops/puppet [16:29:32] i'm not allowed to checkout the operations/admin repo :P [16:29:36] you don't need ops' permission to do it [16:29:43] ebernhardson, operations/puppet.git [16:29:51] oh silly me, yea :) [16:30:11] modules/admin/data/data.yaml [16:35:31] (03PS1) 10EBernhardson: Update SSH pubkey for ebernhardson@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/204072 [16:36:29] i know things are only merged to puppet when deployed, do i have to do anything in particular or will ops just merge and deploy that sometime today/this week? [16:37:21] ebernhardson: you basically poke someone from ops and they do it [16:37:27] YuviPanda: poke :) [16:37:34] (03PS2) 10Yuvipanda: Update SSH pubkey for ebernhardson@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/204072 (owner: 10EBernhardson) [16:37:41] (03CR) 10Yuvipanda: [C: 032 V: 032] Update SSH pubkey for ebernhardson@wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/204072 (owner: 10EBernhardson) [16:39:01] ebernhardson: done :) want me to force a run on bastion now? or content to wait 20min? [16:39:09] YuviPanda: 20min is fine, thanks! [16:39:26] ebernhardson: yw. can you verify after 20min? [16:40:01] YuviPanda: sure [16:40:07] cool [16:50:23] (03PS15) 10BBlack: cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 [16:50:25] (03PS4) 10BBlack: get rid of $active_nodes[api] (unused) [puppet] - 10https://gerrit.wikimedia.org/r/204052 [16:50:27] (03PS4) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [16:53:56] !log Deleting oaiaudit entries that are pre 2015 [16:54:01] Logged the message, Master [16:56:23] 7Puppet, 10Deployment-Systems: Trebuchet master should be separate from scap - https://phabricator.wikimedia.org/T96042#1207023 (10Tgr) 3NEW [16:57:35] * Reedy checks he didn't cause any db lag [16:58:17] there's some on s3 [16:58:19] * Reedy watches [17:01:09] * greg-g waves to Reedy :) [17:01:15] ohai [17:01:18] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=146.70 Read Requests/Sec=115.80 Write Requests/Sec=20.70 KBytes Read/Sec=3121.20 KBytes_Written/Sec=192.60 [17:01:34] seems the hotel wifi has decided it's not going to block various things for no apparent reason [17:01:46] improvement :) [17:03:37] yeah, an improvement unlike the florida weather [17:03:39] And my laptop [17:04:03] Reedy: :( [17:04:25] * YuviPanda remembers Snowden’s response to John Oliver askign him if he missed Florida [17:04:27] have to use an external battery charger for my x220, and the nipple/touchpad cause cursor drift [17:04:34] (03CR) 10Gergő Tisza: "I wasted a fair amount of time trying to test the trebuchet part but couldn't figure out any way to do it. I tried to set up a VM which is" [puppet] - 10https://gerrit.wikimedia.org/r/199598 (https://phabricator.wikimedia.org/T84956) (owner: 10Gilles) [17:04:50] !log legoktm Finished scap: Updating CentralAuth (duration: 42m 28s) [17:04:53] yay [17:04:54] Logged the message, Master [17:04:57] (03CR) 10Legoktm: [C: 032] Set $wgCentralAuthCheckSULMigration = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203381 (https://phabricator.wikimedia.org/T95735) (owner: 10Legoktm) [17:04:59] (03CR) 10Legoktm: [C: 032] Add 'CentralAuthSULRename' log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203992 (owner: 10Legoktm) [17:05:07] (03Merged) 10jenkins-bot: Set $wgCentralAuthCheckSULMigration = true [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203381 (https://phabricator.wikimedia.org/T95735) (owner: 10Legoktm) [17:05:10] (03Merged) 10jenkins-bot: Add 'CentralAuthSULRename' log group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203992 (owner: 10Legoktm) [17:05:17] PROBLEM - puppet last run on mw2074 is CRITICAL puppet fail [17:05:31] ok, good, s3 replag has finished [17:06:03] Reedy: what was that? (paged for replag) [17:06:10] springle: ahh, sorry [17:06:17] springle: Was clearing out crap from the oaiaudit table [17:06:25] :) [17:06:30] !log legoktm Synchronized wmf-config/CommonSettings.php: Set = true (duration: 00m 13s) [17:06:33] Logged the message, Master [17:06:38] err, oops. [17:06:51] !log ^ was Set $wgCentralAuthCheckSULMigration = true [17:06:52] * Reedy hands legoktm a \ [17:06:54] Logged the message, Master [17:06:55] deleted maybe 3M rows [17:07:31] !log legoktm Synchronized wmf-config/InitialiseSettings.php: Add 'CentralAuthSULRename' log group (duration: 00m 14s) [17:07:34] Logged the message, Master [17:08:11] ok, all done :) [17:09:04] (03PS3) 10Cenarium: Give patrol to reviewers for testwiki/enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) [17:09:06] 6operations, 10ops-eqiad, 6Labs: labvirt100x boxes 'no carrier' on eth1 - https://phabricator.wikimedia.org/T95973#1207090 (10Cmjohnson) Added second eth cable ge-3/0/8 up up labvirt1001 eth0 ge-3/0/20 up up labvirt1002 eth0 ge-3/0/21 up up labvirt1003 eth0 ge-3/0/37... [17:12:37] (03CR) 10Dereckson: "Second hunk looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium) [17:12:46] (03CR) 10Dereckson: [C: 031] Give patrol to reviewers for testwiki/enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/199321 (https://phabricator.wikimedia.org/T93798) (owner: 10Cenarium) [17:18:13] (03CR) 10Tim Landscheidt: [C: 04-1] "I still think this change is fundamentally flawed as it will block *.labsdb aliases from working. In addition, T92351 has suggested that " [puppet] - 10https://gerrit.wikimedia.org/r/196731 (https://phabricator.wikimedia.org/T92351) (owner: 10Dzahn) [17:22:48] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207123 (10RobH) 3NEW a:3RobH [17:23:06] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1207139 (10RobH) [17:23:08] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207138 (10RobH) [17:23:27] RECOVERY - puppet last run on mw2074 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [17:23:27] Uhhh train deploys today at all? [17:23:50] Oh, hm. Something's funky. [17:24:00] 6operations, 6Labs, 10hardware-requests: Replace virt1000 with a newer warrantied server - https://phabricator.wikimedia.org/T90626#1207140 (10RobH) 5Open>3Resolved I've created the setup task T96048 for setting up system wmf3152 as labcontrol1001. I'm resolving this task, as hardware has been allocated. [17:24:02] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207123 (10RobH) [17:24:27] (03PS18) 10Ori.livneh: Gzip SVGs on back upload varnishes. [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) [17:25:05] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207123 (10RobH) [17:28:36] (03CR) 10BBlack: [C: 031] Gzip SVGs on back upload varnishes. [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [17:28:39] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1207169 (10mmodell) 5Open>3stalled @chasemp: any further info would be welcome [17:28:52] Son of a BITCH. It's UploadWizard. [17:29:10] Everything OK, marktraceur? [17:29:10] http://commons.wikimedia.beta.wmflabs.org/wiki/Special:SpecialPages [17:29:22] Whoops. [17:29:25] That's broken. [17:29:34] Don't deploy wmf1 to commons [17:29:41] 6operations, 10ops-fundraising, 7network: network setup for beryllium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T95893#1207182 (10Jgreen) [17:29:56] twentyafterfour, ^ [17:30:17] I mean [17:30:22] It's not *super* critical [17:30:26] But it'll break specialpages [17:30:30] let's not :) [17:30:52] I'll file some bugs and deal with it this afternoon, SWAT it out [17:30:57] Will this only affect commons? [17:30:58] (03PS1) 10RobH: setting up labcontrol1001 mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/204078 (https://phabricator.wikimedia.org/T96048) [17:31:05] Yeah, it's only on wikis with UW [17:31:17] what's the UW issue? [17:31:20] But I don't think we can selectively do all group1 but commons [17:31:24] bblack: http://commons.wikimedia.beta.wmflabs.org/wiki/Special:SpecialPages [17:31:32] Something about ContextSource, haven't figured it out yet [17:31:33] I clicked the link and Firefox crashed :< [17:31:34] marktraceur broke special pages and wants to avoid the breakage hitting production [17:31:36] ok I was just about to start prepping the deployment so good catch [17:31:37] RECOVERY - Host analytics1020 is UPING OK - Packet loss = 0%, RTA = 3.02 ms [17:31:40] It's not my fauuuuuult [17:31:49] commonswiki is not the only wiki running UW, marktraceur [17:31:52] * Reedy finds a dunce hat for marktraceur [17:31:58] Prrrrobably [17:32:18] rowiki, donatewiki, foundationwiki [17:32:24] marktraceur: I think I can deploy all except commons - it's just a manual editing of the wikiversions.json [17:32:28] and test/test2 [17:32:34] marktraceur: could it be https://gerrit.wikimedia.org/r/#/c/203001/ ? [17:32:41] twentyafterfour: Sounds like fun. I can SWAT the fix and the change to wv.json tonight. [17:32:52] legoktm: That's what I think, yes. [17:33:08] But I'm not sure how really [17:33:17] (03CR) 10RobH: [C: 032] setting up labcontrol1001 mgmt dns [dns] - 10https://gerrit.wikimedia.org/r/204078 (https://phabricator.wikimedia.org/T96048) (owner: 10RobH) [17:33:37] 2015-04-14 17:31:30 deployment-mediawiki02 commonswiki: [92726ea9] /wiki/Special:SpecialPages ErrorException from line 264 of /srv/mediawiki/php-master/includes/exception/MWExceptionHandler.php: Fatal Error: Argument 1 passed to ContextSource::setContext() must implement interface IContextSource, null given [17:33:38] hehe [17:33:41] https://test.wikipedia.org/wiki/Special:SpecialPages works though [17:34:07] marktraceur: that change isn't in wmf1 [17:34:14] 10Ops-Access-Requests, 6operations: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1207190 (10Ottomata) 3NEW [17:34:16] I think it's just master [17:34:24] legoktm: Oh, well then, ignore me [17:34:32] We'll fix it [17:34:34] 10Ops-Access-Requests, 6operations: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1207198 (10Ottomata) [17:34:40] I thought for sure it had made it into the cut. [17:34:43] 10Ops-Access-Requests, 6operations: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1207190 (10Ottomata) @tnegrin please approve. [17:35:06] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:07] (03PS1) 10Thcipriani: Add submodules to master checkoutMediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) [17:36:15] twentyafterfour: OK, sorry, go ahead with wmf1 to commons [17:36:31] Whew. [17:40:56] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=161.40 Read Requests/Sec=72.00 Write Requests/Sec=26.70 KBytes Read/Sec=2507.20 KBytes_Written/Sec=139.45 [17:41:05] !log running removeHHVMTag on testwiki [17:41:09] Logged the message, Master [17:41:47] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1207226 (10Tgr) 3NEW [17:41:50] 7Puppet, 6Multimedia, 6Release-Engineering, 6Scrum-of-Scrums, and 3 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1207233 (10Tgr) [17:41:53] 7Puppet: Puppet resource for creating a postgresql database - https://phabricator.wikimedia.org/T96054#1207234 (10Tgr) [17:45:47] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=26.10 Read Requests/Sec=0.60 Write Requests/Sec=28.30 KBytes Read/Sec=4.80 KBytes_Written/Sec=147.40 [17:46:56] legoktm: thanks very much for that [17:47:13] 10Ops-Access-Requests, 6operations, 10Analytics: Grant Sati access to geowiki - https://phabricator.wikimedia.org/T95494#1207253 (10Ottomata) Sati, this is done. I need a way to get the password to you securely. Unless you know how to use gpg, probably the simplest thing to do would be for us to hop into a... [17:47:58] np [17:53:13] 6operations, 10ops-eqiad: relabel/add ssd to labcontrol1001/wmf3152 - https://phabricator.wikimedia.org/T96056#1207287 (10RobH) 3NEW a:3Cmjohnson [17:53:53] 6operations, 10Deployment-Systems, 6Release-Engineering, 6Services: Streamline our service development and deployment process - https://phabricator.wikimedia.org/T93428#1207306 (10mmodell) [17:54:16] 6operations, 5Patch-For-Review: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207308 (10RobH) [17:54:25] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207123 (10RobH) [17:56:44] guess when legoktm started his script.. https://tendril.wikimedia.org/host/view/db1038.eqiad.wmnet/3306 [17:57:08] uhh [17:57:11] should I pause? [17:57:24] Reedy: ? [17:57:37] I don't think so [17:57:42] Not causing replags [17:57:46] Just a lot of queries [17:57:56] RECOVERY - Host analytics1020 is UPING OK - Packet loss = 0%, RTA = 2.35 ms [17:58:17] RECOVERY - salt-minion processes on analytics1020 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [17:58:17] RECOVERY - DPKG on analytics1020 is OK: All packages OK [17:59:27] RECOVERY - Disk space on analytics1020 is OK: DISK OK [17:59:27] RECOVERY - Hadoop DataNode on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [17:59:27] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:59:27] RECOVERY - dhclient process on analytics1020 is OK: PROCS OK: 0 processes with command name dhclient [17:59:27] RECOVERY - configured eth on analytics1020 is OK: NRPE: Unable to read output [17:59:27] RECOVERY - RAID on analytics1020 is OK no disks configured for RAID [17:59:57] ok I'm deploying 1.25wmf1 as soon as the window opens... [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150414T1800). Please do the needful. [18:00:07] (03CR) 10Tim Landscheidt: "Was this necessary? I explicitly asked this change not to be merged because neither the lua-json issue was sorted out nor the list.php re" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [18:00:29] oooo cmjohnson1 it is coming back? [18:00:31] (03PS1) 1020after4: Group1 wikis to 1.26wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204085 [18:01:21] ottomata: yep...all fixed..need to update MAC address in dhcp file [18:01:21] (03CR) 1020after4: [C: 032] Group1 wikis to 1.26wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204085 (owner: 1020after4) [18:01:26] (03Merged) 10jenkins-bot: Group1 wikis to 1.26wmf1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204085 (owner: 1020after4) [18:02:48] RECOVERY - puppet last run on analytics1020 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:04:13] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.26wmf1 [18:04:21] Logged the message, Master [18:07:50] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207353 (10Cmjohnson) [18:07:51] 6operations, 10ops-eqiad: relabel/add ssd to labcontrol1001/wmf3152 - https://phabricator.wikimedia.org/T96056#1207351 (10Cmjohnson) 5Open>3Resolved Added the disks, updated racktables and update the switch ge-5/0/32 up up labcontrol1001 private1-c-eqiad [18:07:55] (03PS2) 10Ori.livneh: Set $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201751 [18:09:32] ottomata: i need to take an1020 down ...idrac cfg is wrong [18:09:54] go ahead [18:09:55] its fine [18:10:04] (03CR) 10Yuvipanda: "Hmm, I wasn't expecting it to cause unnecessary stress, sorry if it ended up doing that." [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [18:10:35] thx [18:11:01] 10Ops-Access-Requests, 6operations: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1207369 (10madhuvishy) Thanks @Ottomata. I've signed the Acknowledgement and here's the [[ https://office.wikimedia.org/wiki/User:MViswanathan_(WMF) | link to my public key on Office W... [18:11:55] (03CR) 10Yuvipanda: "To clarify," [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [18:13:32] (03CR) 10Yuvipanda: "(jlocal was totally mea-culpa for not grepping docs but only grepping puppet before acting, and then acting poorly)" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [18:13:40] (03CR) 10Ori.livneh: [V: 032] Set $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201751 (owner: 10Ori.livneh) [18:13:47] (03CR) 10Ori.livneh: [C: 032] Set $wgStatsdServer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/201751 (owner: 10Ori.livneh) [18:14:19] (03CR) 10Yuvipanda: ">We have also agreed that we try to avoid deploying local packages because that creates maintenance cost." [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [18:14:37] PROBLEM - Host analytics1020 is DOWN: PING CRITICAL - Packet loss = 100% [18:14:42] !log ori Synchronized wmf-config/CommonSettings.php: If7f77996b: Set $wgStatsdServer (duration: 00m 15s) [18:14:49] Logged the message, Master [18:15:39] 6operations: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207374 (10RobH) [18:15:42] (03CR) 10Yuvipanda: "Is this just puppetizing status quo?" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [18:16:04] (03CR) 10Yuvipanda: "(oh, it is)" [puppet] - 10https://gerrit.wikimedia.org/r/203656 (https://phabricator.wikimedia.org/T63160) (owner: 10Tim Landscheidt) [18:18:08] (03CR) 10Yuvipanda: "Have you tested this on toolsbeta?" [puppet] - 10https://gerrit.wikimedia.org/r/196125 (https://phabricator.wikimedia.org/T92379) (owner: 10Tim Landscheidt) [18:19:42] (03PS1) 10RobH: setting labcontrol1001.wikimedia.org dns [dns] - 10https://gerrit.wikimedia.org/r/204087 (https://phabricator.wikimedia.org/T96048) [18:19:51] (03CR) 10jenkins-bot: [V: 04-1] setting labcontrol1001.wikimedia.org dns [dns] - 10https://gerrit.wikimedia.org/r/204087 (https://phabricator.wikimedia.org/T96048) (owner: 10RobH) [18:19:55] (03PS1) 10Cmjohnson: Updating new mac address for analytics1020 https://phabricator.wikimedia.org/T95263 [puppet] - 10https://gerrit.wikimedia.org/r/204088 [18:20:50] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207392 (10csteipp) >>! In T95229#1206538, @GWicke wrote: > As for the client side: Our cookies are [HTTPOnly](https://www.owasp.org/index.php/HttpOnly), and we... [18:21:15] (03CR) 10Cmjohnson: [C: 032] Updating new mac address for analytics1020 https://phabricator.wikimedia.org/T95263 [puppet] - 10https://gerrit.wikimedia.org/r/204088 (owner: 10Cmjohnson) [18:22:25] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: analytics1020 hardware failure - https://phabricator.wikimedia.org/T95263#1207402 (10Cmjohnson) 5Open>3Resolved 2 mother boards later and this is fixed. Resolving [18:24:56] (03CR) 10Yuvipanda: "You are however right that I could have totally just waited another day for review, and was impatient. I shall think about this and change" [puppet] - 10https://gerrit.wikimedia.org/r/203313 (https://phabricator.wikimedia.org/T88216) (owner: 10Tim Landscheidt) [18:25:51] (03PS2) 10Thcipriani: Add submodules to master checkoutMediaWiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204080 (https://phabricator.wikimedia.org/T88442) [18:27:56] PROBLEM - Host mw1031 is DOWN: PING CRITICAL - Packet loss = 100% [18:30:24] (03PS2) 10RobH: setting labcontrol1001.wikimedia.org dns [dns] - 10https://gerrit.wikimedia.org/r/204087 (https://phabricator.wikimedia.org/T96048) [18:32:08] (03CR) 10RobH: [C: 032] setting labcontrol1001.wikimedia.org dns [dns] - 10https://gerrit.wikimedia.org/r/204087 (https://phabricator.wikimedia.org/T96048) (owner: 10RobH) [18:32:27] RECOVERY - Host mw1031 is UPING OK - Packet loss = 0%, RTA = 1.00 ms [18:36:56] (03PS1) 10RobH: labcontrol1001 install-server updates [puppet] - 10https://gerrit.wikimedia.org/r/204091 [18:38:05] (03PS2) 10RobH: labcontrol1001 install-server updates [puppet] - 10https://gerrit.wikimedia.org/r/204091 [18:38:45] (03CR) 10RobH: [C: 032] "[" [puppet] - 10https://gerrit.wikimedia.org/r/204091 (owner: 10RobH) [18:39:09] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1207452 (10yuvipanda) 3NEW [18:39:38] 6operations, 10ops-eqiad: mw1031 has a bad uplink - https://phabricator.wikimedia.org/T95896#1207462 (10Cmjohnson) replaced the cable and eth speed did not change, checked with another new cable and still no changes. Power off, drained flea power and still nothing. May need to try an f/w update. [18:41:46] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=97.30 Read Requests/Sec=37.50 Write Requests/Sec=29.60 KBytes Read/Sec=408.40 KBytes_Written/Sec=445.90 [18:47:11] 6operations, 6Labs: Investigate heavy NFS users and see if they can move IO to local storage - https://phabricator.wikimedia.org/T96065#1207477 (10yuvipanda) 3NEW [18:50:57] PROBLEM - puppet last run on install2001 is CRITICAL puppet fail [18:52:36] 10Ops-Access-Requests, 6operations: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1207497 (10Tnegrin) approved [19:08:56] RECOVERY - puppet last run on install2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:17:26] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=2.70 Read Requests/Sec=8.10 Write Requests/Sec=4.10 KBytes Read/Sec=55.20 KBytes_Written/Sec=39.25 [19:18:54] hey, ops, i don't get js/css [19:19:24] where? [19:19:37] meta.wikimedia.org [19:19:56] now it is intermediate [19:20:27] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 35.71% of data above the critical threshold [500.0] [19:20:45] andre__: ^ :) [19:21:37] and now totally down [19:21:38] wfm it seems [19:22:02] Request: GET http://meta.wikimedia.org/wiki/Special:GlobalRenameQueue/open, from 10.20.0.165 via cp3014 cp3014 ([10.20.0.114]:3128), Varnish XID 1312017507 [19:22:02] Forwarded for: MYIP, 10.20.0.165, 10.20.0.165 [19:22:02] Error: 503, Service Unavailable at Tue, 14 Apr 2015 19:21:27 GMT [19:22:07] and I get a ping back (but via esams) [19:22:25] i'm via esams as well [19:22:40] * matanya pokes akosiaris _joe_ bblack [19:23:20] https://meta.wikimedia.org/wiki/Main_Page still loads for me as usual :-/ Weird [19:23:31] actully, it is US TZ, should poke us based ops [19:23:38] maybe just cp3014? [19:24:09] jgage: can you please give a hand here ? [19:24:16] PROBLEM - puppet last run on cp3032 is CRITICAL puppet fail [19:25:26] bblack: ^ [19:25:51] Anybody seeing issues with Jenkins postmerge and beta code update aside from me? [19:27:19] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207557 (10GWicke) @csteipp: I think that a big rethink of how we do authentication in general in order to protect old browsers is big enough to deserve its own... [19:27:26] https://integration.wikimedia.org/ci/view/Beta/ [19:28:06] !log running updateUsersToRename.php (CentralAuth) [19:28:12] Logged the message, Master [19:38:57] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1207581 (10coren) p:5Triage>3Low Raid 6 is a performance bottleneck but gives us 66% more effective storage than raid10 would in the current configuration. It doesn't mean that moving away f... [19:40:26] RECOVERY - puppet last run on cp3032 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [19:45:27] (03PS1) 10Merlijn van Deen: Include -exec role on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 [19:48:17] PROBLEM - puppet last run on mw2066 is CRITICAL puppet fail [19:51:07] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [19:53:05] git.wikimedia.org is down at this moment [19:56:07] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [20:03:54] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207649 (10csteipp) >>! In T95229#1207557, @GWicke wrote: > In the meantime, this patch does not change the status quo of content being primarily served from th... [20:04:57] (03CR) 10Tim Landscheidt: [C: 04-1] "sudo complains about it not being 440." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) (owner: 10Merlijn van Deen) [20:06:10] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:06:53] (03PS2) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [20:08:59] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [20:09:29] (03PS3) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [20:18:10] hm, cmjohnson1 is analytics1020 back online? [20:18:20] i can't reach it [20:18:29] yes, it should be [20:20:04] hm, i can log in via console root [20:20:09] but networking seems funky [20:20:13] i'm going to restart networking [20:20:27] okay [20:21:19] I see the spike from the 5xx issue earlier, seems relatively-brief but significant. Did anyone get anywhere on it? [20:21:32] hm, cmjohnson1 no good... [20:21:37] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207703 (10akosiaris) > If we need restbase on the primary domain for VE performance, can we put it there without proxying other services? I would be ok with t... [20:22:01] well shit..it worked earlier and I didn't do anything to it [20:22:17] okay to try and powercycle it [20:22:24] sure, i'm there now if you want me to [20:22:40] yeah go for it [20:23:29] (03PS4) 10Hashar: Initial Debian packaging [debs/nodepool] (debian) - 10https://gerrit.wikimedia.org/r/203961 (https://phabricator.wikimedia.org/T89142) [20:23:54] !log powercycled analytics1020 [20:24:03] Logged the message, Master [20:28:46] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [20:28:52] hmm, no good cmjohnson1 [20:28:52] From 10.64.53.12 icmp_seq=1 Destination Host Unreachable [20:28:57] tried to ping something [20:29:03] that's weird [20:29:08] dns not getting out [20:29:16] nuttin [20:29:47] okay, i left the data center for the day, okay to get it in the morning? [20:30:50] labvirt1001 errors are mine, nothing to worry about [20:31:26] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 1.29 ms [20:32:45] 6operations, 6Labs: Investigate heavy NFS users and see if they can move IO to local storage - https://phabricator.wikimedia.org/T96065#1207759 (10BBlack) I think we've gone around looking for these at various times in the past and always been able to cull a few of the worst offenders. I was thinking more alo... [20:33:06] cmjohnson1: ja [20:33:07] thanks [20:33:29] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207763 (10GWicke) > Graphoid is 530 kloc's of javascript. If the codebase is too large to review, then why don't we focus on sanitizing the output? It produce... [20:33:42] oh, i konw what I did ....i disconnected the network cable while i was messing with it and forgot to put it back ottomata [20:33:48] haha :) [20:35:37] (03PS4) 10Merlijn van Deen: Tool labs: silence sudo security e-mails [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) [20:35:49] (03CR) 10Merlijn van Deen: "Derp. I knew it had to be 0440, but clearly forgot to set it correctly. Should be OK now..." [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) (owner: 10Merlijn van Deen) [20:37:20] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207773 (10Anomie) >>! In T95229#1207763, @GWicke wrote: > then you probably want to extract actions that really need authentication & stop setting cookies on t... [20:40:44] bblack: not that i know off [20:40:49] *of [20:40:54] but it looks better now [20:43:40] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207819 (10GWicke) @akosiaris: Regarding RB as SPOF: We have that risk wherever we unify the URL space, be that Varnish or RB. Your likely preferred solution of... [20:44:39] (03CR) 10Alexandros Kosiaris: [C: 032] cache config: remove decommed nodes list [puppet] - 10https://gerrit.wikimedia.org/r/203557 (owner: 10BBlack) [20:45:11] chasemp: I need to create a system user with some sudo rights. Would you do that using user {} or via something in the admins yaml data? [20:49:02] (03CR) 10Tim Landscheidt: [C: 031] "Tested to work." [puppet] - 10https://gerrit.wikimedia.org/r/203876 (https://phabricator.wikimedia.org/T95882) (owner: 10Merlijn van Deen) [20:52:20] andrewbogott: meeting give me a few? [20:52:59] andrewbogott: I don't think the admin module can deal with system users [20:53:03] chasemp: sure, whenever [21:00:04] rmoen, kaldari: Respected human, time to deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150414T2100). Please do the needful. [21:00:33] rmoen: You ready? [21:00:37] ya [21:01:47] PROBLEM - puppet last run on mw2142 is CRITICAL Puppet has 1 failures [21:01:57] PROBLEM - puppet last run on mw1047 is CRITICAL Puppet has 2 failures [21:02:17] PROBLEM - puppet last run on cp4013 is CRITICAL puppet fail [21:02:33] bblack: now getting 400 [21:07:17] (03CR) 10Robmoen: [C: 032] Enable CirrusSearch event logging in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203250 (owner: 10Bmansurov) [21:07:30] (03Merged) 10jenkins-bot: Enable CirrusSearch event logging in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203250 (owner: 10Bmansurov) [21:07:59] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207909 (10akosiaris) >>! In T95229#1207819, @GWicke wrote: > @akosiaris: Regarding RB as SPOF: We have that risk wherever we unify the URL space, be that Varni... [21:09:02] (03PS1) 10Dzahn: create user for Madhu Viswanathan [puppet] - 10https://gerrit.wikimedia.org/r/204151 (https://phabricator.wikimedia.org/T96053) [21:11:16] (03PS2) 10Dzahn: create user for Madhu Viswanathan [puppet] - 10https://gerrit.wikimedia.org/r/204151 (https://phabricator.wikimedia.org/T96053) [21:12:17] (03CR) 10Alexandros Kosiaris: [C: 032] ssh: remove lucid special casing for authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202394 (owner: 10Alexandros Kosiaris) [21:12:31] (03CR) 10Alexandros Kosiaris: [C: 032] sodium: specify the position of authorized_keys_file [puppet] - 10https://gerrit.wikimedia.org/r/202393 (owner: 10Alexandros Kosiaris) [21:12:42] (03CR) 10Alexandros Kosiaris: [C: 032] ssh: allow parameterization of authorized_keys [puppet] - 10https://gerrit.wikimedia.org/r/202392 (owner: 10Alexandros Kosiaris) [21:13:12] 6operations, 10Continuous-Integration, 5Continuous-Integration-Isolation, 5Patch-For-Review, 7Upstream: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1207917 (10hashar) Some upstream requirements are not matched by Jessie: | Upstream | Jessie | python-d... [21:16:26] !log rmoen Synchronized wmf-config/CirrusSearch-production.php: enable cirrus search eventlogging in production (duration: 00m 13s) [21:16:34] Logged the message, Master [21:17:36] matanya: 400 tiny fishes? [21:17:57] RECOVERY - puppet last run on mw1047 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [21:18:09] sorta bblack :) bad request 400 [21:18:24] is this specific to one host, one special API call, one cache node? anything? [21:18:26] RECOVERY - puppet last run on cp4013 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [21:18:52] bblack: one link to one image [21:18:57] (03PS1) 10Dzahn: access: add madhuvishy to various analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/204152 (https://phabricator.wikimedia.org/T96053) [21:19:10] https://gdash.wikimedia.org/dashboards/reqerror/ <- this spike is around when you guys first discussed issues [21:19:26] what link to what image? [21:19:27] RECOVERY - puppet last run on mw2142 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:19:35] https://commons.wikimedia.org/wiki/File:Messerschmitt%20Bf109E-4-B%20%u20184101%20-%20Black%2012%u2019%20%28DG200%29%20%2816527677294%29.jpg [21:19:35] (03PS2) 10Dzahn: access: add madhuvishy to various analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/204152 (https://phabricator.wikimedia.org/T96053) [21:19:37] PROBLEM - RAID on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:47] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:19:54] PROBLEM - SSH on eventlog1001 is CRITICAL - Socket timeout after 10 seconds [21:19:54] PROBLEM - puppet last run on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:10] https://commons.wikimedia.org/wiki/File:Messerschmitt_Bf109E-4-B_%E2%80%984101_-_Black_12%E2%80%99_%28DG200%29_%2816527677294%29.jpg [21:20:16] PROBLEM - DPKG on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:18] PROBLEM - configured eth on eventlog1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:20:18] is that in fact a bad request? [21:20:39] (03Abandoned) 10Alexandros Kosiaris: Allow hiera role_backend to be debuggable via hiera CLI [puppet] - 10https://gerrit.wikimedia.org/r/202756 (owner: 10Alexandros Kosiaris) [21:20:48] i guess, seems like i should blame magnus in this case [21:20:55] ok! [21:21:18] thanks, and sorry for the noise in this case, the 5xx is real though :) [21:21:20] 6operations, 10RESTBase, 10VisualEditor, 7Performance: Set up an API base path for REST and action APIs - https://phabricator.wikimedia.org/T95229#1207937 (10GWicke) >>! In T95229#1207909, @akosiaris wrote: >>>! In T95229#1207819, @GWicke wrote: >>Your likely preferred solution of separate domains for each... [21:22:13] (03CR) 10Dzahn: [C: 031] create user for Madhu Viswanathan [puppet] - 10https://gerrit.wikimedia.org/r/204151 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [21:22:17] (03CR) 10Madhuvishy: [C: 031] access: add madhuvishy to various analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/204152 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [21:24:17] 6operations, 6Labs: Investigate ways of getting off raid6 for labs store - https://phabricator.wikimedia.org/T96063#1207942 (10BBlack) As we've seen, RAID6 has serious performance issues under write heavy load, and isn't especially great at recovering from unclean shutdown (but we could address the latter to... [21:25:05] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1207948 (10Dzahn) uploaded patches, one to create the user and one to add it to the groups. the second one should be considered the actual access request. added @j... [21:25:11] (03PS4) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [21:27:46] RECOVERY - RAID on eventlog1001 is OK no disks configured for RAID [21:27:56] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK All defined EventLogging jobs are runnning. [21:27:59] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 24 minutes ago with 0 failures [21:27:59] RECOVERY - SSH on eventlog1001 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:28:07] RECOVERY - DPKG on eventlog1001 is OK: All packages OK [21:28:17] RECOVERY - configured eth on eventlog1001 is OK - interfaces up [21:28:39] (03PS5) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [21:29:35] (03PS1) 10Alexandros Kosiaris: Enable debug output in hiera_lookup [puppet] - 10https://gerrit.wikimedia.org/r/204155 [21:31:29] (03CR) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 (owner: 10Andrew Bogott) [21:35:53] mutante: could you review https://gerrit.wikimedia.org/r/#/c/203379/ please? it's a job runner change for SULF [21:39:14] legoktm: cannot merge [21:39:31] (03PS3) 10Legoktm: Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) (owner: 10Aaron Schulz) [21:39:39] Nemo_bis: weird, the rebase button worked [21:39:50] Funny [21:40:18] (03CR) 10Madhuvishy: [C: 031] create user for Madhu Viswanathan [puppet] - 10https://gerrit.wikimedia.org/r/204151 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [21:40:28] https://gdash.wikimedia.org/dashboards/reqerror/ + [21:40:34] oops [21:44:04] (03PS6) 10Andrew Bogott: Set up ssh keys so that designate can clear salt and puppet certs. [puppet] - 10https://gerrit.wikimedia.org/r/204067 [21:45:28] 6operations, 5Patch-For-Review: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1208028 (10RobH) [21:49:46] 6operations, 5Patch-For-Review: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1208064 (10RobH) [21:49:48] (03PS19) 10Ori.livneh: Gzip SVGs on back upload varnishes. [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) [21:50:06] (03CR) 10Ori.livneh: [C: 032 V: 032] Gzip SVGs on back upload varnishes. [puppet] - 10https://gerrit.wikimedia.org/r/108484 (https://bugzilla.wikimedia.org/54291) (owner: 10Ori.livneh) [21:50:27] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60542 bytes in 1.385 second response time [21:53:03] (03PS1) 10MaxSem: WIP: OSM: rename osm and similar classes to osm::db [puppet] - 10https://gerrit.wikimedia.org/r/204161 [21:53:11] (03PS2) 10Dzahn: releases: move zip install out of node into role [puppet] - 10https://gerrit.wikimedia.org/r/203501 [21:55:09] (03PS3) 10Dzahn: releases: move zip install out of node into role [puppet] - 10https://gerrit.wikimedia.org/r/203501 [21:59:09] (03PS4) 10Dzahn: releases: move zip install out of node into role [puppet] - 10https://gerrit.wikimedia.org/r/203501 [21:59:16] (03PS6) 10Alexandros Kosiaris: ssh::userkey: Allow a prefix to be specified for a key [puppet] - 10https://gerrit.wikimedia.org/r/202731 [21:59:18] (03PS6) 10Alexandros Kosiaris: Specify ssh userkey policy for ganeti clusters [puppet] - 10https://gerrit.wikimedia.org/r/202730 [21:59:51] how do I easily OOM PHP? [22:00:03] like, 4Gigs of it... [22:00:06] * YuviPanda ponders [22:00:12] inside 30s [22:00:22] and yet toollabs user end up doing it all the time... [22:00:44] recursion + giant wasteful structures copied onto the stack at each layer? [22:00:52] (03PS5) 10Dzahn: releases: install unzip in module, not on node level [puppet] - 10https://gerrit.wikimedia.org/r/203501 [22:01:22] (03CR) 10Alexandros Kosiaris: "After going down the rabbit hole of which characters are valid in a username, I opted for /etc/ssh/userkeys/${user.d}/${skey} where skey s" [puppet] - 10https://gerrit.wikimedia.org/r/202731 (owner: 10Alexandros Kosiaris) [22:01:34] bblack: yeah, I had an exponentially growing string in a while (true) [22:01:37] apparently not enough [22:01:39] oooh but stacks [22:02:19] yeah, like parse some huge data structure with tons of references, and keep passing *copying* it as arguments into recursion until you reach the recursion limit? [22:02:35] yeah but that all sounds like ‘work’ :D [22:02:35] I don't know but I would expect PHP has a fixed recursion limit like Perl does (or did?). Something like 100? [22:02:40] yeah, probably [22:03:11] oh, the default is 100,000 lol [22:03:15] (I have https://phabricator.wikimedia.org/P519 now) [22:03:21] (03PS1) 10RobH: labcontrol1001 needs to be trusty [puppet] - 10https://gerrit.wikimedia.org/r/204166 [22:03:23] aaarg [22:03:24] I’m an idiot [22:03:27] of course that doesn’t work [22:03:30] because it is ‘.’ in PHP [22:03:31] and not + [22:03:31] so yeah, recursion + copying big stuff through stack arguments would do it [22:03:40] jdlrobson: your fix is deployed to test.wiki. Please test. [22:03:56] on the plus side, it means I’ve been away from PHP enough to forget shit like that [22:04:39] works fine on test wiki kaldari [22:04:48] (03CR) 10Alexandros Kosiaris: [C: 032] ganeti: Reference correctly the ganeti cluster nodes [puppet] - 10https://gerrit.wikimedia.org/r/203035 (owner: 10Alexandros Kosiaris) [22:05:18] try something like "function foo($a) { foo($a . $a) }" -ish [22:05:41] !log rebooting all labvirt100x hosts to enable virtualization in the bios [22:05:49] Logged the message, Master [22:06:07] but really it's not *that* much better than a 100000-iteration loop. A little, though! [22:06:32] !log rmoen Synchronized php-1.26wmf1/extensions/Gather/: Update gather with cherry picks (duration: 00m 11s) [22:06:38] Logged the message, Master [22:06:40] but I think with big data structures instead of a slowly-growing string, recursion is the more likely candidate for how users accidentally do it in the real world, especially with a default 100K limit [22:07:09] maybe reduce the default recursion limit in labs ? [22:07:46] PROBLEM - Host labvirt1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:08:16] bblack: hmm, so even with that, I get PHP terminating the request for having too much memory [22:08:16] oh I read the wrong docs lol, PHP's limit is 100 not 100K [22:08:25] > 2015-04-14 22:05:20: (mod_fastcgi.c.2673) FastCGI-stderr: PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate [22:08:33] and the grid engine does *not* kill them [22:08:37] (03CR) 10John F. Lewis: [C: 031] releases: install unzip in module, not on node level [puppet] - 10https://gerrit.wikimedia.org/r/203501 (owner: 10Dzahn) [22:08:43] I googled the wrong thing and got PHP PCRE recursion instead, but it all looked the same when the page was open for 0.5s :) [22:08:47] which makes me question why these are dying otherwise [22:09:01] (03PS6) 10Dzahn: releases: install unzip in module, not on node level [puppet] - 10https://gerrit.wikimedia.org/r/203501 (https://phabricator.wikimedia.org/T83213) [22:09:09] (03CR) 10RobH: [C: 032] labcontrol1001 needs to be trusty [puppet] - 10https://gerrit.wikimedia.org/r/204166 (owner: 10RobH) [22:11:27] robla: merge conflict, i'm doing both on the master [22:11:38] robla: sorry, i meant robh, tab complete [22:11:43] robh: ^ [22:11:55] mutante: k [22:15:58] (03PS6) 10Alexandros Kosiaris: Provision the ssh key added in 3c8c524 [puppet] - 10https://gerrit.wikimedia.org/r/201462 [22:17:00] andrewbogott: ^ look, akosiaris provisions keys as well, and just uses file [22:17:14] well, both [22:17:39] (03Abandoned) 10Ori.livneh: Gzip .svg and .ico files on bits.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/113687 (https://bugzilla.wikimedia.org/61442) (owner: 10Brion VIBBER) [22:17:39] ssh::userkey and file for the private part and then another file for completeness [22:17:58] in another case for CI i just used 2 file{}s [22:22:23] (03PS5) 10BBlack: get rid of $active_nodes[api] (unused) [puppet] - 10https://gerrit.wikimedia.org/r/204052 [22:22:25] (03PS5) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [22:22:44] (03CR) 10BBlack: [C: 032 V: 032] get rid of $active_nodes[api] (unused) [puppet] - 10https://gerrit.wikimedia.org/r/204052 (owner: 10BBlack) [22:28:07] (03PS3) 10Gage: create user for Madhu Viswanathan [puppet] - 10https://gerrit.wikimedia.org/r/204151 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [22:29:22] (03PS3) 10Gage: access: add madhuvishy to various analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/204152 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [22:29:41] (03CR) 10Gage: [C: 032] create user for Madhu Viswanathan [puppet] - 10https://gerrit.wikimedia.org/r/204151 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [22:31:08] (03PS2) 10MaxSem: WIP: OSM: rename osm and similar classes to osm::db [puppet] - 10https://gerrit.wikimedia.org/r/204161 [22:31:37] (03CR) 10Gage: [C: 032] access: add madhuvishy to various analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/204152 (https://phabricator.wikimedia.org/T96053) (owner: 10Dzahn) [22:33:28] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Access for new Analytics Dev: Madhu Viswanathan - https://phabricator.wikimedia.org/T96053#1208187 (10Gage) 5Open>3Resolved a:3Gage Patches merged. User account is created and access is granted. [22:33:37] RECOVERY - Host labvirt1001 is UPING OK - Packet loss = 0%, RTA = 0.62 ms [22:35:44] (03PS6) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [22:37:46] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=259.40 Read Requests/Sec=54.30 Write Requests/Sec=2.50 KBytes Read/Sec=2400.80 KBytes_Written/Sec=59.60 [22:44:27] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=96.30 Read Requests/Sec=70.30 Write Requests/Sec=79.20 KBytes Read/Sec=865.60 KBytes_Written/Sec=631.75 [22:46:47] PROBLEM - puppet last run on netmon1001 is CRITICAL puppet fail [22:46:53] (03PS1) 10Dzahn: mailman: adjust io stat monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/204179 [22:47:28] RoanKattouw: can I add a CentralAuth patch to SWAT? you've filled up the patch limit [22:47:38] Sure [22:47:49] Some of my patches are in the same extension [22:48:27] thanks [22:49:14] I'm starting early so we don't end up waiting past 5pm for everything to clear Jenkins [22:49:18] Although we're still gonna be late [22:49:29] I'm gonna do SWAT today obviously, since most of the patches are mine [22:53:47] 6operations, 5Patch-For-Review: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1208237 (10RobH) a:5RobH>3Andrew [22:54:08] 6operations, 5Patch-For-Review: setup/install/deploy labcontrol1001 - https://phabricator.wikimedia.org/T96048#1207123 (10RobH) ready for puppet key acceptance and service implementation, handing off to @andrew [22:56:47] (03PS2) 10Dzahn: mailman: adjust io stat monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/204179 [22:58:11] (03PS3) 10Dzahn: mailman: adjust io stat monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/204179 [22:58:50] (03CR) 10Dzahn: [C: 032] mailman: adjust io stat monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/204179 (owner: 10Dzahn) [22:58:52] (03PS7) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [22:59:25] (03PS4) 10Dzahn: mailman: adjust io stat monitoring threshold [puppet] - 10https://gerrit.wikimedia.org/r/204179 [23:00:05] RoanKattouw, ^d, Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150414T2300). Please do the needful. [23:00:21] * RoanKattouw takes it [23:08:48] PROBLEM - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=130.10 Read Requests/Sec=37.76 Write Requests/Sec=56.54 KBytes Read/Sec=2296.10 KBytes_Written/Sec=1248.00 [23:11:25] sodium, you've made your point. Now be quiet. [23:12:11] yea, it doesnt know until next run on neon [23:12:22] didnt even force it [23:14:32] icinga-wm: ack please [23:14:34] OK mutante [23:14:44] ACKNOWLEDGEMENT - mailman I/O stats on sodium is CRITICAL - I/O stats: Transfers/Sec=151.10 Read Requests/Sec=50.50 Write Requests/Sec=27.80 KBytes Read/Sec=489.20 KBytes_Written/Sec=444.20 daniel_zahn raising threshold [23:15:18] RECOVERY - mailman I/O stats on sodium is OK - I/O stats: Transfers/Sec=29.00 Read Requests/Sec=0.20 Write Requests/Sec=24.20 KBytes Read/Sec=0.80 KBytes_Written/Sec=219.05 [23:15:22] icinga-wm: woah, when did *that* happen [23:15:32] mutante: ^ [23:15:32] lol, nice timing @ recovery [23:15:45] or is that just timing? [23:15:48] icinga-wm: ack please [23:15:51] noooo [23:15:57] YuviPanda: unfortunately that isn't real :/ [23:16:00] hehe [23:16:05] where did ‘OK mutante’ come from? [23:16:08] i was demonstrating what i want [23:16:12] echo :p [23:16:16] :D [23:16:21] great trolling vector however [23:16:44] i have the shell command to create the ACK though [23:17:00] well almost, a scheduled downtime but very similar [23:18:47] YuviPanda: i'd be mainly unsure about how to authenticate users on IRC to make that happen [23:19:01] cloaks :) [23:19:18] if the bot was also in production? [23:19:29] but otherwise it would be labs bot talking to prod monitoring server.. meh [23:19:59] mutante: icinga-wm is in prod. [23:20:14] of course, true [23:20:25] RoanKattouw: https://gerrit.wikimedia.org/r/204191 [23:20:29] adding it to wikitech as well, thats the core bump [23:20:36] Cool [23:24:29] (03PS7) 10Alex Monk: Upgrade dbtree to jquery 2.1.1 and jquery-ui 1.11.2 [software] - 10https://gerrit.wikimedia.org/r/125883 (owner: 10Reedy) [23:24:45] (03PS4) 10Alex Monk: Rename chapcomwiki to affcomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169939 (https://bugzilla.wikimedia.org/39482) (owner: 10Reedy) [23:30:34] (03PS3) 10MaxSem: OSM: rename osm and similar classes to osm::db [puppet] - 10https://gerrit.wikimedia.org/r/204161 [23:32:34] (03PS4) 10MaxSem: OSM: rename osm and similar classes to osm::db [puppet] - 10https://gerrit.wikimedia.org/r/204161 [23:32:39] (03CR) 10Chad: "I don't like this approach. Having dbname not match the wiki name is just asking for trouble." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/134962 (owner: 10Reedy) [23:35:03] can tell Chad reviewed the "faux-renaming" patch without clicking [23:35:08] (03PS1) 10Yuvipanda: tools: Separate registration / unregistreation for proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) [23:35:30] (03CR) 10Yuvipanda: [C: 04-2] "WIP" [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) (owner: 10Yuvipanda) [23:41:21] Oh dammit I forgot WikiEditor [23:42:46] (03PS8) 10BBlack: r::c::config::active_nodes -> hiera cache::nodes [puppet] - 10https://gerrit.wikimedia.org/r/204068 [23:46:41] RoanKattouw: Are you going to deploy the Gather change as well? [23:47:00] Yes [23:47:10] Sorry I just gave up on waiting for Jenkins and overrode it for the remaining changes [23:47:19] RoanKattouw: Cool. Jon is leaving in a bit, so if he’s not around I can help test it [23:47:21] It rejected the CentralAuth change twice with a bogus qunit error [23:47:37] Going to deploy now [23:47:49] RoanKattouw: Yeah, I had to do the same yesterday (after waiting on Jenkins for 50 minutes) [23:48:04] I started prepping and merging things hours ago [23:48:07] It's ridiculous [23:48:40] RoanKattouw: We went ahead and merged the Gather change a couple hours ago so you wouldn’t have to wait for it to build. [23:48:54] (03PS2) 10Yuvipanda: tools: Separate registration / unregistreation for proxylistener [puppet] - 10https://gerrit.wikimedia.org/r/204193 (https://phabricator.wikimedia.org/T96059) [23:49:39] Yeah, thanks [23:49:47] Unfortunately the MW core change is the one that takes the longest [23:50:24] 6operations, 10Wikimedia-Mailing-lists: Update mailman listinfo.txt template - https://phabricator.wikimedia.org/T96108#1208416 (10JohnLewis) 3NEW [23:50:40] (03CR) 10Aaron Schulz: [C: 031] Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) (owner: 10Aaron Schulz) [23:51:32] !log catrope Synchronized php-1.26wmf1/extensions/WikiEditor: SWAT (duration: 00m 11s) [23:51:41] Logged the message, Master [23:51:46] !log catrope Synchronized php-1.26wmf1/extensions/Flow: SWAT (duration: 00m 13s) [23:51:52] Logged the message, Master [23:51:58] !log catrope Synchronized php-1.26wmf1/extensions/CentralAuth: SWAT (duration: 00m 12s) [23:52:03] Logged the message, Master [23:52:11] !log catrope Synchronized php-1.26wmf1/extensions/VisualEditor: SWAT (duration: 00m 12s) [23:52:16] Logged the message, Master [23:52:36] (03PS4) 10Ori.livneh: Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) (owner: 10Aaron Schulz) [23:52:48] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [23:53:47] (03PS1) 10BBlack: remove squid references from torrus cdn stuff [puppet] - 10https://gerrit.wikimedia.org/r/204196 [23:54:12] !log catrope Synchronized php-1.25wmf24/extensions/WikiEditor: SWAT (duration: 00m 11s) [23:54:17] Logged the message, Master [23:54:25] !log catrope Synchronized php-1.25wmf24/extensions/Gather: SWAT (duration: 00m 12s) [23:54:30] Logged the message, Master [23:54:32] kaldari: ---^^ [23:54:37] (03CR) 10Dzahn: [C: 031] "no more squid and puppet complains it can't parse parts of it" [puppet] - 10https://gerrit.wikimedia.org/r/204196 (owner: 10BBlack) [23:54:38] !log catrope Synchronized php-1.25wmf24/extensions/CentralAuth: SWAT (duration: 00m 13s) [23:54:43] Logged the message, Master [23:54:50] ebernhardson: legoktm: Your things just went out too [23:54:50] !log catrope Synchronized php-1.25wmf24/extensions/VisualEditor: SWAT (duration: 00m 12s) [23:54:51] RoanKattouw: looking [23:54:55] Logged the message, Master [23:55:02] RoanKattouw: thanks, checking [23:55:28] (03CR) 10BBlack: [C: 032 V: 032] remove squid references from torrus cdn stuff [puppet] - 10https://gerrit.wikimedia.org/r/204196 (owner: 10BBlack) [23:55:44] RoanKattouw: works a charm. thanks [23:55:50] RoanKattouw: thanks, lgtm! [23:56:38] (03CR) 10Ori.livneh: [C: 032] Set dedicated SUL rename runner loop [puppet] - 10https://gerrit.wikimedia.org/r/203379 (https://phabricator.wikimedia.org/T87397) (owner: 10Aaron Schulz) [23:56:47] ori: thanks :) [23:57:01] no problem [23:58:30] (03Abandoned) 10Reedy: Upgrade dbtree to jquery 2.1.1 and jquery-ui 1.11.2 [software] - 10https://gerrit.wikimedia.org/r/125883 (owner: 10Reedy)