[00:01:03] (03PS1) 10Dzahn: mira, codfw deploy server, set $cluster [puppet] - 10https://gerrit.wikimedia.org/r/210836 [00:02:53] (03Abandoned) 10BBlack: Enable Varnish caching for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/210747 (https://phabricator.wikimedia.org/T98803) (owner: 10Yurik) [00:03:46] !log ebernhardson Synchronized php-1.26wmf5/extensions/Gather/: SWAT Submodule bump for Gather extension (duration: 00m 12s) [00:03:51] Logged the message, Master [00:03:53] rmoen: ^^ please check [00:04:05] ls [00:04:35] ebernhardson, all good. thanks [00:04:49] (03PS1) 10Dzahn: add role::deployment::server to mira.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/210837 (https://phabricator.wikimedia.org/T95436) [00:05:22] ok, and with that swat is done [00:06:13] (03CR) 10Dzahn: [C: 04-1] "setting the $cluster variable that way is outdated, either we don't need it at all or we set it in hiera. it's already in ganglia in "misc" [puppet] - 10https://gerrit.wikimedia.org/r/209874 (https://phabricator.wikimedia.org/T95436) (owner: 10John F. Lewis) [00:08:29] i wonder if incluging "scap::master" on a second node would influence the existing scap in any way [00:14:25] (03PS6) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [00:14:57] (03CR) 10Gage: [C: 032] puppetmaster: Do not manage certmanager's home [puppet] - 10https://gerrit.wikimedia.org/r/209261 (owner: 10Alexandros Kosiaris) [00:16:43] (03PS7) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [00:19:27] (03PS8) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [00:21:30] 6operations, 7database: Database connection failure issues on s7 shard - https://phabricator.wikimedia.org/T98998#1283975 (10Springle) Not an overreaction. But logical backups run on dbstore1001, not on the production shard itself, and the "dumps"[1], if that's what you saw, should only affect the s7 "vslow" s... [00:21:37] (03PS9) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [00:22:48] (03CR) 10Dzahn: [C: 04-1] "role deployment::server definitely needs work for this to be possible. for example tin is set to default rsync_host in scap::master which " [puppet] - 10https://gerrit.wikimedia.org/r/210837 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [00:22:52] (03PS1) 10Dzahn: WIP: deployment: make rsync_host configurable [puppet] - 10https://gerrit.wikimedia.org/r/210838 (https://phabricator.wikimedia.org/T95436) [00:27:00] (03PS10) 10Yuvipanda: [WIP] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 [00:33:41] (03PS11) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [00:39:39] (03PS1) 10Dominics: Fix invalid hash in example documentation [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/210840 [00:49:26] !log restarting elasticsearch on elastic1006 [00:49:33] Logged the message, Master [00:49:59] ori: is there a puppet change corresponding to https://gerrit.wikimedia.org/r/#/c/175007/ ? Looking at, eg, foreachwiki, which passes full paths to getRealmSpecificFilename [00:51:09] ori: also some hardcoded paths like maintenance.pp:119 we need to trawl for, and fix [00:58:14] springle: it hadn't occurred to me to look in operations/puppet. i'll submit a companion change. [01:03:28] http://justdelete.me/#wikipedia [01:05:43] (03CR) 10Springle: "Update:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [01:08:32] 6operations, 6Analytics-Engineering: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1284064 (10Dzahn) We would install packages via puppet, not manually. So this needs a puppet code change with a list of all the package names. [01:11:46] 6operations, 6Analytics-Engineering: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1284065 (10Dzahn) There is no common role that is used on both stat1002 and stat1003. So i'm wondering where this should be added. Is it definitely needed on both of them? stat1003 is only... [01:12:53] mutante, yeah, info queues would get quite a few requests about 'deleting accounts' [01:14:11] It can't be allowed for accounts with edits of course [01:26:57] (03PS1) 10Springle: repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210844 [01:27:34] (03CR) 10Springle: [C: 032] repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210844 (owner: 10Springle) [01:27:40] (03Merged) 10jenkins-bot: repool db1060 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210844 (owner: 10Springle) [01:28:53] !log springle Synchronized wmf-config/db-eqiad.php: repool db1060, warm up (duration: 00m 14s) [01:29:00] Logged the message, Master [01:30:12] (03PS3) 10Springle: expand list of CODFW slaves ready for traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 [01:32:14] (03CR) 10Springle: [C: 032] expand list of CODFW slaves ready for traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 (owner: 10Springle) [01:32:20] (03Merged) 10jenkins-bot: expand list of CODFW slaves ready for traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210004 (owner: 10Springle) [01:33:52] !log springle Synchronized wmf-config/db-codfw.php: pool new codfw slaves (duration: 00m 11s) [01:33:59] Logged the message, Master [01:40:44] (03PS1) 10Dzahn: add class to install enchant and myspell packages [puppet] - 10https://gerrit.wikimedia.org/r/210846 (https://phabricator.wikimedia.org/T99030) [01:44:07] (03PS1) 10Dzahn: stats: incl enchant role in statistics::cruncher [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [01:46:20] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1284074 (10Dzahn) @halfak here's the list i get on stat1002: https://gerrit.wikimedia.org/r/#/c/210846/1/modules/statistics/manifests/enchant.pp including that in the... [01:46:47] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1284075 (10Dzahn) a:3Dzahn [01:47:57] !log restarting elastic1007 [01:48:04] Logged the message, Master [01:48:06] !log sorry - restarting elasticsearch on elastic1007 [01:48:12] Logged the message, Master [02:23:42] (03PS7) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [02:23:45] (03CR) 10jenkins-bot: [V: 04-1] Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [02:24:03] (03PS8) 10Ori.livneh: Move *.dblist to dblists/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 [02:24:20] (03CR) 10Ori.livneh: "PS7: Add symlinks at the old locations, using: for f in dblists/*.dblist; do ln -s $f "$(basename $f)"; done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/175007 (owner: 10Ori.livneh) [02:24:36] !log l10nupdate Synchronized php-1.26wmf5/cache/l10n: (no message) (duration: 05m 51s) [02:24:46] Logged the message, Master [02:25:18] (03PS1) 10Springle: move db1019 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/210848 [02:26:34] (03CR) 10Springle: [C: 032] move db1019 to s4 [puppet] - 10https://gerrit.wikimedia.org/r/210848 (owner: 10Springle) [02:29:05] !log LocalisationUpdate completed (1.26wmf5) at 2015-05-14 02:28:02+00:00 [02:29:12] Logged the message, Master [02:37:33] (03PS1) 10Jalexander: Open external links on voteWiki in new tab/window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210849 (https://phabricator.wikimedia.org/T98013) [02:40:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [02:43:33] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [02:44:34] !log xtrabackup clone db1056 to db1019 [02:44:42] Logged the message, Master [02:47:31] !log l10nupdate Synchronized php-1.26wmf6/cache/l10n: (no message) (duration: 04m 16s) [02:47:38] Logged the message, Master [02:48:13] PROBLEM - are wikitech and wt-static in sync on silver is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (91069s 90000s) [02:50:56] !log LocalisationUpdate completed (1.26wmf6) at 2015-05-14 02:49:53+00:00 [02:51:05] Logged the message, Master [02:55:01] !log restarting elasticsearch on elastic1008 [02:55:07] Logged the message, Master [02:57:43] (03PS1) 10Yuvipanda: Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 [02:58:44] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1284109 (10Wwes) While we wait for a simple non tedious manual solve, how about we start with these for wikipedias: en, de, zh, ru, it, es, fr, ja, pt, tr, nl, pl, ar, ko, hi and wikipedia.org The... [03:00:50] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1284110 (10Halfak) I was planning to use them on stat1003, but stat1002 fills a similar use-case for me. It's hard for me to answer this question, but if I were to choo... [03:02:31] (03PS12) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [03:02:33] (03PS2) 10Yuvipanda: Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 [03:18:03] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 883.913629338 [03:24:09] (03Abandoned) 10Yuvipanda: [WIP] Add lighttpd webservice type [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210266 (owner: 10Yuvipanda) [04:30:03] (03PS1) 10Tim Landscheidt: Labs: Include public IPs in ferm's $INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/210853 (https://phabricator.wikimedia.org/T96924) [04:44:18] 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1284201 (10GWicke) Today iojs joined the node foundation, and it sounds like the iojs branch will be the basis for ongoing node.js development: - https://github.com/iojs/io.js/issues/1664#issuecomment-1018283... [04:44:55] (03CR) 10Tim Landscheidt: "I tested that this indeed adds the public IPs.to iptables rules. The IP range is somewhat redundant with the various constants in manifes" [puppet] - 10https://gerrit.wikimedia.org/r/210853 (https://phabricator.wikimedia.org/T96924) (owner: 10Tim Landscheidt) [05:04:12] RECOVERY - are wikitech and wt-static in sync on silver is OK: wikitech-static OK - wikitech and wikitech-static in sync (10733 90000s) [05:07:12] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu May 14 05:06:09 UTC 2015 (duration 6m 8s) [05:07:17] Logged the message, Master [05:25:42] PROBLEM - puppet last run on mw1031 is CRITICAL Puppet has 1 failures [05:26:43] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [05:31:33] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [05:41:52] RECOVERY - puppet last run on mw1031 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:01:45] (03CR) 10John F. Lewis: [C: 04-1] "Transferring comments from my patch I made :" [puppet] - 10https://gerrit.wikimedia.org/r/210838 (https://phabricator.wikimedia.org/T95436) (owner: 10Dzahn) [06:06:03] 6operations, 10Hackathon-Lyon-2015, 10Wikimedia-Site-requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1284225 (10Verdy_p) It would be much simpler to just add within the MediaWiki software only the simple support of database name aliases. And s... [06:17:49] good morning [06:18:20] morning jynus [06:29:13] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on db2036 is CRITICAL Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 1 failures [06:33:13] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:33:32] PROBLEM - puppet last run on cp4004 is CRITICAL Puppet has 1 failures [06:33:43] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:34:03] PROBLEM - puppet last run on db1034 is CRITICAL Puppet has 1 failures [06:34:13] PROBLEM - puppet last run on mw2206 is CRITICAL Puppet has 1 failures [06:34:22] PROBLEM - puppet last run on db2040 is CRITICAL Puppet has 1 failures [06:34:54] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:35:03] PROBLEM - puppet last run on mw2123 is CRITICAL Puppet has 1 failures [06:35:13] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 2 failures [06:35:14] PROBLEM - puppet last run on mw1213 is CRITICAL Puppet has 1 failures [06:35:23] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:35:24] PROBLEM - puppet last run on mw2092 is CRITICAL Puppet has 1 failures [06:35:24] PROBLEM - puppet last run on mw2096 is CRITICAL Puppet has 1 failures [06:35:24] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:32] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:35:54] PROBLEM - puppet last run on mw2059 is CRITICAL Puppet has 1 failures [06:45:52] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on mw2123 is OK Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:46:34] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw1213 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on mw2096 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:46:54] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on db1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:22] RECOVERY - puppet last run on mw2206 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:23] RECOVERY - puppet last run on db2040 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on mw2059 is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [06:47:53] RECOVERY - puppet last run on db2036 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:53] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:12] RECOVERY - puppet last run on cp4004 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:33] RECOVERY - puppet last run on mw2092 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:50:18] (03PS1) 10Aaron Schulz: Removed "refreshLinks" from $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210857 [07:21:56] (03PS1) 10Jcrespo: Adding master host CNAME for new mariadb shard on labs (m5) This is required for MySQL scripts to work properly. [dns] - 10https://gerrit.wikimedia.org/r/210859 (https://phabricator.wikimedia.org/T92693) [07:44:17] (03PS2) 10Filippo Giunchedi: mediawiki: adjust jobq alarms [puppet] - 10https://gerrit.wikimedia.org/r/207785 (https://phabricator.wikimedia.org/T87594) [07:44:24] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] mediawiki: adjust jobq alarms [puppet] - 10https://gerrit.wikimedia.org/r/207785 (https://phabricator.wikimedia.org/T87594) (owner: 10Filippo Giunchedi) [07:57:03] (03PS3) 10Giuseppe Lavagetto: hiera: use the proxy backend, rationalize the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/207129 [07:59:21] (03CR) 10Filippo Giunchedi: [C: 04-1] etcd: create puppet module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) (owner: 10Giuseppe Lavagetto) [07:59:42] (03CR) 10Filippo Giunchedi: etcd: create puppet module (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/208928 (https://phabricator.wikimedia.org/T97973) (owner: 10Giuseppe Lavagetto) [08:05:00] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1284369 (10ArielGlenn) 1.4 million. Looking good! [08:07:01] (03CR) 10Filippo Giunchedi: [C: 031] "thanks Bryan! how much WIP is WIP now? :)" [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [08:08:51] (03CR) 10ArielGlenn: [C: 032 V: 032] "this was +1ed by ryan on github: https://github.com/trebuchet-deploy/trigger/pull/29 and the package resulting from it has been used in p" [software/deployment/trebuchet-trigger] - 10https://gerrit.wikimedia.org/r/209775 (owner: 10ArielGlenn) [08:10:17] (03CR) 10Filippo Giunchedi: "looks like we could stil benefit from this (?)" [puppet] - 10https://gerrit.wikimedia.org/r/190501 (owner: 10Ori.livneh) [08:25:43] 6operations: fix trebuchet-trigger (git deploy) publish.runner arguments - https://phabricator.wikimedia.org/T98581#1284374 (10ArielGlenn) 5Open>3Resolved This is mrged, package has been installed and used for deploys without incident. [08:25:45] 6operations: upgrade salt on production cluster - https://phabricator.wikimedia.org/T98580#1284376 (10ArielGlenn) [08:26:40] 6operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1284379 (10ArielGlenn) [08:26:41] 6operations: upgrade salt on production cluster - https://phabricator.wikimedia.org/T98580#1284377 (10ArielGlenn) 5Open>3Resolved Package in repo, deploys work so this is done. [08:27:00] 6operations, 10Deployment-Systems: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1284382 (10ArielGlenn) [08:27:02] 6operations: Cannot use dsh-based restart of parsoid from tin anymore - https://phabricator.wikimedia.org/T87803#1284383 (10ArielGlenn) [08:27:04] 6operations: Upgrade salt to 2014.7 (investigating) - https://phabricator.wikimedia.org/T88971#1284380 (10ArielGlenn) 5Open>3Resolved Done! [08:47:10] (03PS1) 10Jcrespo: Adding --no-version-check on pt-online-schema-change [software] - 10https://gerrit.wikimedia.org/r/210863 [08:49:16] * jynus will remember to limit commit messages width to 76 chars [08:54:02] hehe [08:54:04] yeah :) [08:54:17] also see http://www.mediawiki.org/wiki/Gerrit/Commit_message_guidelines [08:57:05] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 6.67% of data above the critical threshold [500.0] [08:57:25] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1284403 (10ArielGlenn) I've added you both to all of these where you weren't on them already, but only for http. Google now has a cap on the number of domains (1000) and we are over that, the cap m... [08:59:07] (03PS2) 10Jcrespo: --no-version-check to be used by default on pt-online-schema-change [software] - 10https://gerrit.wikimedia.org/r/210863 [09:00:21] ^now it was gerrit who broke it [09:01:18] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to contint-admins for Jan Zerebecki - https://phabricator.wikimedia.org/T98961#1284405 (10ArielGlenn) Krenair, we will want some sort of vetting for wmde and for volunteers; we just have to figure out what that looks like. I'm taking gr... [09:08:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [09:13:44] (03PS2) 10ArielGlenn: admin: add jdouglas to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/210772 (https://phabricator.wikimedia.org/T98536) (owner: 10Dzahn) [09:14:57] (03CR) 10ArielGlenn: [C: 032] admin: add jdouglas to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/210772 (https://phabricator.wikimedia.org/T98536) (owner: 10Dzahn) [09:22:48] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add jdouglas to researchers admin group - https://phabricator.wikimedia.org/T98536#1284419 (10ArielGlenn) 5Open>3Resolved a:3ArielGlenn @Manybubbles: Corrrelation I think but it wouldn't be hard at all. I've gone ahead with the researchers group p... [09:30:00] 6operations, 10Datasets-General-or-Unknown: snaphot1004 running dumps very slowly, investigate - https://phabricator.wikimedia.org/T98585#1284441 (10ArielGlenn) Well I now know what this is. It's not a new leak, it's just that the largest single stubs file in our dumps runs is now produced by wikidata! And g... [09:37:40] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1284443 (10ArielGlenn) http://www-01.sil.org/iso639-3/documentation.asp?id=nsa Is this going to bite us if we use nsa.wikimedia.org? [09:45:46] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 77.78% of data above the critical threshold [35.0] [09:46:30] (03CR) 10Jcrespo: [C: 032] Adding a new MariaDB master node, part of a new shard (m5) for miscelaneous services in labs Bug: T92693 [puppet] - 10https://gerrit.wikimedia.org/r/210660 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo) [09:47:46] 6operations, 10Deployment-Systems: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1284459 (10ArielGlenn) Now that salt is upgraded, we need to fix up /usr/local/bin/service-restart so that it takes a timeout and the format of the args to publish.runner is updated. [09:49:52] (03PS2) 10Jcrespo: Adding a new MariaDB master node, part of a new shard (m5) for miscelaneous services in labs Bug: T92693 [puppet] - 10https://gerrit.wikimedia.org/r/210660 (https://phabricator.wikimedia.org/T92693) [09:51:49] (03CR) 10Jcrespo: [C: 032] Adding a new MariaDB master node, part of a new shard (m5) for miscelaneous services in labs Bug: T92693 [puppet] - 10https://gerrit.wikimedia.org/r/210660 (https://phabricator.wikimedia.org/T92693) (owner: 10Jcrespo) [09:52:01] (03PS1) 10Joal: Modify EventLogging forwarder configuration [puppet] - 10https://gerrit.wikimedia.org/r/210869 [09:54:35] PROBLEM - High load average on labstore1001 is CRITICAL 71.43% of data above the critical threshold [24.0] [09:54:58] (03PS1) 10Merlijn van Deen: shinken: add valhallasw to tools contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/210870 [09:55:33] (03PS2) 10Yuvipanda: shinken: add valhallasw to tools contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/210870 (owner: 10Merlijn van Deen) [09:55:46] (03CR) 10Yuvipanda: [C: 032 V: 032] shinken: add valhallasw to tools contactgroup [puppet] - 10https://gerrit.wikimedia.org/r/210870 (owner: 10Merlijn van Deen) [09:58:03] (03PS1) 10Jcrespo: New grants for new mariadb shard (m5) [puppet] - 10https://gerrit.wikimedia.org/r/210871 (https://phabricator.wikimedia.org/T98958) [09:59:46] PROBLEM - puppet last run on db1009 is CRITICAL puppet fail [10:00:07] (03CR) 10Springle: [C: 031] New grants for new mariadb shard (m5) [puppet] - 10https://gerrit.wikimedia.org/r/210871 (https://phabricator.wikimedia.org/T98958) (owner: 10Jcrespo) [10:00:55] (03CR) 10Jcrespo: [C: 032] New grants for new mariadb shard (m5) [puppet] - 10https://gerrit.wikimedia.org/r/210871 (https://phabricator.wikimedia.org/T98958) (owner: 10Jcrespo) [10:02:03] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1284497 (10faidon) >>! In T98803#1278739, @Yurik wrote: > * I would recommend introducing at least a 15 minute caching to reduce graphoid's load... [10:04:52] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1284521 (10Yurik) @faidon, thanks for the comment. Agree that we should apply more scientific method to it - but at present we have very little d... [10:05:36] "optimizing"? [10:06:15] we generally cache content for 30 days and purge on change, I'd say that you'd have to explain why you deviate from that rather than call this an optimization [10:06:20] yurik: ^ [10:07:46] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [10:08:21] paravoid, i do plan to have it much longer - 30 days sounds fine - but per a long discussion with anomie, i plan to change the design of the graphs to allow for such long time. For now, graphs don't invalidate too well, and their hash depends on external data sources [10:08:35] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1284527 (10faidon) >>! In T98803#1284521, @Yurik wrote: > @faidon, thanks for the comment. Agree that we should apply more scientific method to i... [10:08:40] s/depends/does not depend [10:08:48] oh sorry just saw it [10:09:02] then let's call it what it is, a broken design (at least for our scale), not an underoptimized one [10:11:53] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1284536 (10mobrovac) Let me state that I don't agree at all with this approach. As specified in the initial deployment task T90487 : //Expected l... [10:13:00] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1284545 (10Yurik) @faidon, per my discussion with @anomie, T98940, I do plan to introduce a different storage system for graphs, after which it s... [10:16:18] (03PS1) 10ArielGlenn: service-restart: fix up for new salt version [puppet] - 10https://gerrit.wikimedia.org/r/210877 [10:16:58] mobrovac: agree with what? [10:17:11] (03PS1) 10Faidon Liambotis: Add multatuli as temporary esams authdns [puppet] - 10https://gerrit.wikimedia.org/r/210878 [10:17:46] RECOVERY - puppet last run on db1009 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [10:17:48] (03CR) 10ArielGlenn: "Completely untested, syntax could even be wrong. So don't blindly commit please." [puppet] - 10https://gerrit.wikimedia.org/r/210877 (owner: 10ArielGlenn) [10:18:00] (or, disagree really :) [10:18:05] paravoid: with the task in general, and the committed patch for caching graphoid [10:18:13] 6operations, 10Deployment-Systems: [Trebuchet] Salt times out on parsoid restarts - https://phabricator.wikimedia.org/T63882#1284563 (10ArielGlenn) Something like this: https://gerrit.wikimedia.org/r/#/c/210877/ (UNTESTED) [10:19:15] paravoid: from my pow, there was no need to jump at it, since we agreed to put it behind restbase [10:19:41] paravoid: also, using parsoidcache for one thing for one service, and another for another one seems wildly wrong to me [10:19:48] I have no comments on that specifically, as long as it's going to get delivered by edge caches in the end (and restbase is, at least with the per-project domains) [10:19:59] not to mention the fact that afaik, parsoidcache will be gone soon [10:20:02] and yes, I think others have pointed out how parsoidcache shouldn't have been used in the first place :) [10:20:11] :) [10:20:32] but all that said, saying "15 minutes of cache should be enough to reduce load" and not thinking about cache times and invalidation from the beginning is just... unacceptable [10:21:24] oh, i didn't even want to get into that :) [10:21:37] heh [10:21:40] (hence the disagreement with the task in general) [10:22:25] nod [10:22:48] it wasn't clear to me if the "don't agree with this approach" was referring to my comment or the task or both :) [10:23:35] (03CR) 10Springle: [C: 031] --no-version-check to be used by default on pt-online-schema-change [software] - 10https://gerrit.wikimedia.org/r/210863 (owner: 10Jcrespo) [10:23:59] paravoid: ah, ok :) [10:24:22] should have written "with this task and everything done here" [10:24:22] :D [10:24:49] yeah, it might worth elaborating, regarding parsoidcache usage as well [10:25:33] PROBLEM - mysqld processes on db1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [10:31:16] ACKNOWLEDGEMENT - mysqld processes on db1009 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld Jcrespo MySQL server not started after resintall yet. [10:33:31] (03PS1) 10Springle: table partitioning for s4 logpager slave [software] - 10https://gerrit.wikimedia.org/r/210880 [10:37:28] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1284616 (10ArielGlenn) Having read the comments on T97645, do you want us to just reuse the elasticsearch_1.3.6_all.deb that you manually install... [10:40:57] 6operations, 10Citoid, 6Services: Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1284621 (10mobrovac) 5Open>3Invalid a:3mobrovac `citoid.wmflabs.org` runs off of master, while `citoid-beta.wmflabs.org` (from deployment-prep) runs off o... [11:00:09] (03CR) 10Jcrespo: [C: 031] table partitioning for s4 logpager slave [software] - 10https://gerrit.wikimedia.org/r/210880 (owner: 10Springle) [11:00:25] (03CR) 10Jcrespo: [V: 031] table partitioning for s4 logpager slave [software] - 10https://gerrit.wikimedia.org/r/210880 (owner: 10Springle) [11:23:54] (03CR) 10Springle: [C: 032 V: 032] table partitioning for s4 logpager slave [software] - 10https://gerrit.wikimedia.org/r/210880 (owner: 10Springle) [11:34:15] (03CR) 10Joal: [C: 04-1] "I'll remove the patch set I have submitted for forwarder only :)" (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/210765 (https://phabricator.wikimedia.org/T98779) (owner: 10Milimetric) [11:34:55] (03Abandoned) 10Joal: Modify EventLogging forwarder configuration [puppet] - 10https://gerrit.wikimedia.org/r/210869 (owner: 10Joal) [11:40:31] !log restarting elasticsearch on elastic1009 [11:40:42] Logged the message, Master [11:43:53] manybubbles: how's the rolling restart so far? [11:44:19] godog: easy but slow because I haven't scripted it - I'm worried about timeouts in es-tool [11:44:40] I've seen a few which throw the whole process off. best case a script would just pause [11:44:47] worst case it'd plow forward and break things [11:45:33] you wouldn't happen to have the time to do a few in the next few hours? I'm supposed to go to my kid's preschool for a while [11:45:53] manybubbles: sure I can continue where you left off [11:45:56] they tell me they are working on faster rolling restarts.... [11:46:47] godog: I just restarted 1009 as you see in the sal - its recovering. you can pick up on 1010 when the cluster is green again. [11:46:57] thanks so much [11:46:59] heh if we can get enough confidence to go unattended it wouldn't be a huge problem [11:47:11] manybubbles: ok! I'll restart from 1010 [11:47:28] yeah - but if they get the time down to 5 minutes a server then it wouldn't be a big problem either. [11:47:35] k. once its green. [11:47:40] thanks so much! [11:47:50] yup, no problem manybubbles ! [11:48:24] manybubbles: just to be sure, you're following wikitech docs for rolling restart (?) [11:49:29] godog: I'm just shelling into every server and running ```sudo es-tool restart-fast``` [11:49:58] not using the for loop because of the chance it'll plow forward. I ihaven't had time to verify that it won't. I was going to this morning. [11:51:04] yeah if we can rely on es-tool exit status (e.g. brutally timeout = exception) a loop should be fine [11:52:17] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1284730 (10Manybubbles) @ArielGlenn, If we locked the version of Elasticsearch that cirrus used then it'd be fine to just include them all. Right... [12:11:42] 6operations, 10Graphoid, 10RESTBase, 10Traffic, 5Patch-For-Review: Varnish does not honor Cache-Control for Graphoid - https://phabricator.wikimedia.org/T98803#1284749 (10BBlack) I expect everything about this will change when this moves to operating through the restbase entry point on the text balancers... [12:12:11] PROBLEM - High load average on labstore1001 is CRITICAL 62.50% of data above the critical threshold [24.0] [12:13:42] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [12:17:40] anyone knows what it is that allows servers like tin, terbium, etc. to connect to normal wiki database servers but not silver's? [12:20:54] what error do you get back? could be either acl or db grants off top of my head [12:21:16] krenair@terbium:~$ sql labswiki [12:21:16] ERROR 2003 (HY000): Can't connect to MySQL server on 'silver' (110) [12:22:21] I have no idea really, except the general notion of separation between labs and everything else for good reasons [12:22:27] terbium's address should a valid grant for wikiadmin according to templates/mariadb/production-grants-core.sql.erb [12:23:33] 6operations, 6Labs, 10wikitech.wikimedia.org: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1284755 (10Krenair) Probably caused by https://gerrit.wikimedia.org/r/#/c/196961/ ? [12:23:52] (03PS2) 10Faidon Liambotis: Add multatuli as temporary esams authdns [puppet] - 10https://gerrit.wikimedia.org/r/210878 [12:24:02] (03PS1) 10KartikMistry: CX: Enable English-Spanish dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210887 [12:24:04] (03CR) 10Faidon Liambotis: [C: 032] Add multatuli as temporary esams authdns [puppet] - 10https://gerrit.wikimedia.org/r/210878 (owner: 10Faidon Liambotis) [12:24:39] Krenair: it is iptables on silver [12:24:45] silver is a production host we deploy to [12:25:20] godog, would https://gerrit.wikimedia.org/r/#/c/196961/ have caused this? [12:25:37] it added class { 'base::firewall': } [12:25:48] (03CR) 10KartikMistry: "Alex, Ping! :)" [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [12:26:14] Krenair: likely [12:31:21] PROBLEM - puppet last run on multatuli is CRITICAL Puppet has 1 failures [12:32:11] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Enable English-Spanish dictionary [puppet] - 10https://gerrit.wikimedia.org/r/210887 (owner: 10KartikMistry) [12:32:38] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Use RESTBase API for page fetch [puppet] - 10https://gerrit.wikimedia.org/r/207378 (owner: 10KartikMistry) [12:33:01] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [12:33:34] (03CR) 10Alex Monk: "Is there any point keeping that wikitech SSH rule in modules/openstack/manifests/firewall.pp then? What about MySQL access from deployment" [puppet] - 10https://gerrit.wikimedia.org/r/196961 (owner: 10Hoo man) [12:36:11] !log es-tool restart-fast on elastic1010 [12:36:17] Logged the message, Master [12:36:32] godog: but that loop has to call ssh - I'm just worried about stuff like ssh not pushing the return code back. [12:37:44] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] New upstream release, fixed encoding issue [debs/contenttranslation/apertium-pt-gl] - 10https://gerrit.wikimedia.org/r/210333 (owner: 10KartikMistry) [12:37:45] manybubbles: nope I think that part works, i.e. ssh host will trickle the exit status up [12:38:48] but I see where you are coming from, OTOH it is fairly easy to check for cluster status and don't do anything if != green [12:40:04] !log uploaded to apt.wikimedia.org jessie-wikimedia: apertium-pt-gl_0.9.2~r60358-1 [12:40:09] Logged the message, Master [12:40:31] (03PS1) 10Faidon Liambotis: authdns: remove 1.x check from local-update/lint [puppet] - 10https://gerrit.wikimedia.org/r/210889 [12:40:34] bblack: ^ [12:41:50] bblack: so, I'm about to switch ns2 to multatuli, want to sanity check it? [12:42:43] (03PS1) 10KartikMistry: CX: Enable MT for newly updated Apertium packages [puppet] - 10https://gerrit.wikimedia.org/r/210890 [12:42:45] akosiaris: thanks! [12:43:04] "Dutch writer famous for his satirical novel, Max Havelaar (1860), which denounced the abuses of colonialism in the Dutch East Indies (today's Indonesia)." [12:43:06] (03CR) 10BBlack: [C: 031] authdns: remove 1.x check from local-update/lint [puppet] - 10https://gerrit.wikimedia.org/r/210889 (owner: 10Faidon Liambotis) [12:43:29] (03CR) 10Alexandros Kosiaris: [C: 032] CX: Enable MT for newly updated Apertium packages [puppet] - 10https://gerrit.wikimedia.org/r/210890 (owner: 10KartikMistry) [12:43:41] (03PS2) 10Alexandros Kosiaris: CX: Enable MT for newly updated Apertium packages [puppet] - 10https://gerrit.wikimedia.org/r/210890 (owner: 10KartikMistry) [12:43:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Enable MT for newly updated Apertium packages [puppet] - 10https://gerrit.wikimedia.org/r/210890 (owner: 10KartikMistry) [12:44:05] paravoid: can I switch it to 3.19 and reboot first? would be nice to confirm authdns service on that, too [12:44:05] (03PS2) 10Faidon Liambotis: authdns: remove 1.x check from local-update/lint [puppet] - 10https://gerrit.wikimedia.org/r/210889 [12:44:13] (03CR) 10Faidon Liambotis: [C: 032] authdns: remove 1.x check from local-update/lint [puppet] - 10https://gerrit.wikimedia.org/r/210889 (owner: 10Faidon Liambotis) [12:44:37] bblack: ok :) [12:44:49] that's why I asked about 3.19 btw :) [12:44:53] I figured [12:44:56] I already rebooted before with a newer 3.16 [12:45:03] or you can, just install the linux-meta package and that should take care of everything [12:45:14] this is a temporary switch anyway, I want to reinstall eeden and place traffic to it again [12:46:16] I wonder why linux-meta and not linux-image-amd64, moritzm? [12:46:32] it's covering things like bringing in our custom bnx2x firmware package, too [12:47:21] rebooting [12:47:36] (03PS1) 10KartikMistry: CX: Add missed pt-gl MT pair [puppet] - 10https://gerrit.wikimedia.org/r/210892 [12:48:40] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] CX: Add missed pt-gl MT pair [puppet] - 10https://gerrit.wikimedia.org/r/210892 (owner: 10KartikMistry) [12:49:12] oh which moritzm pointed out, the upstream update for that firmware has hit sid now [12:49:31] we could pull their 0.44 in from their instead of my hack 0.43-1~wmf or whatever [12:49:39] s/from their/from there/ :) [12:51:34] host key verification failed, hmm [12:52:24] ? [12:52:27] didn't for me [12:52:43] oh for update you mean? [12:53:17] yeah the old mechanism was broken for moving those around faster, and it's not really fixed, just gutted, so the whole phase-of-the-moon thing still breaks it [12:53:26] !log disabling temporarily Ichinga check for MySQL running on db1009 until data is migrated from virt1000 and host sent to production [12:53:27] I think I just copied hostkeys between the nodes last time I messed with it [12:53:34] Logged the message, Master [12:53:39] right I just figured the potm thing out (again) [12:54:48] our ssh_known_hosts handling is so broken [12:54:51] it's full of junk right now [12:57:30] we should add strace to our base pkgs [12:57:57] yes [12:58:18] ok, I'm going for it, right? [12:59:07] yup [12:59:28] !log switching ns2 to multatuli [12:59:32] done [12:59:35] PROBLEM - puppet last run on multatuli is CRITICAL puppet fail [12:59:37] Logged the message, Master [13:01:23] RECOVERY - puppet last run on multatuli is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [13:04:54] yuvipanda or other: pls approve: 12:29, 14 May 2015 Zhuyifei1999 (talk | contribs | block) proposed an OAuth consumer (consumer key ff6b05e3444853dac5c7f4b68c497bc7) (Login to Commons Archive, an unofficial repository for source materials of works on Wikimedia Commons.) [13:06:36] twentyafterfour^^ [13:07:41] (03PS1) 10Faidon Liambotis: Remove eeden from the set of active NS [puppet] - 10https://gerrit.wikimedia.org/r/210896 [13:07:42] (03PS1) 10Faidon Liambotis: Rename eeden.esams -> eeden, switch to jessie [puppet] - 10https://gerrit.wikimedia.org/r/210897 [13:07:45] (03PS1) 10Faidon Liambotis: Re-add eeden to the active set of nameservers [puppet] - 10https://gerrit.wikimedia.org/r/210898 [13:07:47] (03PS1) 10Faidon Liambotis: Remove authdns role from multatuli [puppet] - 10https://gerrit.wikimedia.org/r/210899 [13:07:50] (03PS1) 10Faidon Liambotis: eeden.esams -> eeden, add AAAA [dns] - 10https://gerrit.wikimedia.org/r/210900 [13:10:08] (03CR) 10Faidon Liambotis: [C: 032] Remove eeden from the set of active NS [puppet] - 10https://gerrit.wikimedia.org/r/210896 (owner: 10Faidon Liambotis) [13:11:22] (03CR) 10BBlack: [C: 031] Rename eeden.esams -> eeden, switch to jessie [puppet] - 10https://gerrit.wikimedia.org/r/210897 (owner: 10Faidon Liambotis) [13:11:46] (03CR) 10BBlack: [C: 031] Re-add eeden to the active set of nameservers [puppet] - 10https://gerrit.wikimedia.org/r/210898 (owner: 10Faidon Liambotis) [13:12:05] (03CR) 10BBlack: [C: 031] Remove authdns role from multatuli [puppet] - 10https://gerrit.wikimedia.org/r/210899 (owner: 10Faidon Liambotis) [13:12:42] (03CR) 10Merlijn van Deen: [C: 04-1] "Sorry for -1'ing again." [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [13:13:21] (03CR) 10Faidon Liambotis: [C: 032] eeden.esams -> eeden, add AAAA [dns] - 10https://gerrit.wikimedia.org/r/210900 (owner: 10Faidon Liambotis) [13:14:54] PROBLEM - Auth DNS on eeden is CRITICAL - Plugin timed out while executing system call [13:15:01] (that's me) [13:15:24] (03CR) 10Faidon Liambotis: [C: 032] Rename eeden.esams -> eeden, switch to jessie [puppet] - 10https://gerrit.wikimedia.org/r/210897 (owner: 10Faidon Liambotis) [13:20:47] (03CR) 10Merlijn van Deen: [C: 04-1] "looks good overall; some minor things" (034 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) (owner: 10Yuvipanda) [13:21:18] !log reimaging eeden with jessie [13:21:25] Logged the message, Master [13:22:43] PROBLEM - Host eeden is DOWN: PING CRITICAL - Packet loss = 100% [13:25:14] RECOVERY - Host eeden is UPING OK - Packet loss = 0%, RTA = 89.37 ms [13:26:25] bblack: I'll install 2.1.2 to eeden, probably manually :) [13:27:47] cool [13:28:13] by manually do you mean a rebuild of the upstream pkg? [13:28:30] no, I mean dpkg -i [13:28:35] PROBLEM - dhclient process on eeden is CRITICAL: Connection refused by host [13:28:54] PROBLEM - puppet last run on eeden is CRITICAL: Connection refused by host [13:29:05] it doesn't have dep issues from sid -> jessie yet, I guess [13:29:09] ? [13:29:13] PROBLEM - salt-minion processes on eeden is CRITICAL: Connection refused by host [13:29:14] PROBLEM - DPKG on eeden is CRITICAL: Connection refused by host [13:29:23] PROBLEM - Disk space on eeden is CRITICAL: Connection refused by host [13:29:34] PROBLEM - configured eth on eeden is CRITICAL: Connection refused by host [13:29:43] PROBLEM - RAID on eeden is CRITICAL: Connection refused by host [13:30:11] no, it must, so I'm lost. I'll assume you know whatever you're doing :) [13:31:09] no, I mean 2.1.2-1~deb8u1, which is built for jessie [13:31:21] sorry, I wasn't clear [13:31:23] ah ok, it's built, just not in the actual jessie repo [13:31:27] yeah [13:31:37] because I haven't gotten a release manager approval for that yet [13:31:40] right [13:32:01] so I have it built & ready to go [13:32:10] I'm just wondering whether to include it to jessie-wikimedia or not [13:32:13] I guess I can... [13:32:27] anyway :) [13:32:38] I guess it depends whether release mgr approves, or timeline for that [13:32:47] would be nice to run here if that doesn't go through or takes forever [13:32:58] yeah [13:40:03] !log es-root restart-fast on elastic1011 [13:40:10] Logged the message, Master [13:41:16] !log power cycling barium [13:41:22] Logged the message, Master [13:49:57] 6operations, 10Citoid, 6Services: Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1284821 (10Mvolz) What I was hoping for is a way to catch bugs on http://en.wikipedia.beta.wmflabs.org/ for versions of citoid that are further ahead, not what'... [13:50:49] 6operations, 10Citoid, 6Services: Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1284825 (10Mvolz) 5Invalid>3Open [13:52:27] 6operations, 10Citoid, 6Services: Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1105803 (10Mvolz) a:5mobrovac>3None [13:52:39] 6operations, 10Citoid, 6Services: Separate citoid service for beta that runs off master instead of deploy - https://phabricator.wikimedia.org/T92304#1105803 (10Mvolz) p:5High>3Normal [13:54:49] (03CR) 10Andrew Bogott: [C: 031] Labs: Include public IPs in ferm's $INTERNAL [puppet] - 10https://gerrit.wikimedia.org/r/210853 (https://phabricator.wikimedia.org/T96924) (owner: 10Tim Landscheidt) [13:56:05] !log upgrading tellurium to trusty [13:56:10] Logged the message, Master [13:58:28] folks, I'm just getting ready to do a CentralNotice deploy. I updated all the submodules (I know I only need to update the CN one, but anyway...) At the end of the submodule update I'm getting this funny message: "Unable to checkout 'dc9e64733b3158bdc5818dc83d44cda8da3adf78' in submodule path 'extensions/ContentTranslation'" [14:00:04] AndyRussG: Dear anthropoid, the time has come. Please deploy CentralNotice deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150514T1400). [14:00:11] <^d> AndyRussG: You really shouldn't update all submodules :\ [14:00:59] ^d: Hmmm well nothing odd will get into the core patches... Here's the one for wmf5 https://gerrit.wikimedia.org/r/#/c/210909/ [14:01:20] The funny message I mentioned above is from wmf6 [14:01:54] ^d: is there something bad that will happen from me updating all the submodules? [14:02:19] (I like to have 'em in there so I can grep about extensions and skins) [14:02:36] <^d> If someone was naughty and merged code without deploying you could end up staging it on tin. [14:02:46] jynus: I’m here now… do you have enough awake time left to work on the labs dbs now? [14:02:51] <^d> But otherwise mostly no, it's just a waste of time and likely to confuse yourself. [14:02:52] (03CR) 10Dereckson: [C: 031] Prevent indexing of User: namespace on ukwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210680 (https://phabricator.wikimedia.org/T98926) (owner: 10Glaisher) [14:03:08] ^d: ah no, I meant on my local machine :) [14:03:19] ^d: no, I don't update them all on tin!! [14:03:21] <^d> Oh that, oh do what you want :p [14:03:24] andrewbogott, yes, at least the first part (migration), switchover will depend [14:03:25] <^d> I thought you'd done it on tin! [14:03:35] <^d> AndyRussG: Ignore everything I said. I'm going to get my coffee. [14:03:37] <^d> :) [14:03:39] AndyRussG: if you have local changes in any submodules those changes will be silently clobbered by ‘git submodule update’. That’s the only danger. [14:04:09] <^d> andrewbogott: Which is why we set submodules to implicitly rebase so it'll at least keep any committed changes about on tin [14:04:16] <^d> (so people don't accidentally toss security patches) [14:04:20] ^d: mm np, I wasn't clear (coffee is good tho) [14:04:25] ^d: ok? i didn’t know that was possible even. Cool. [14:05:25] andrewbogott: ah OK... thanks :) yeah the local-box repo I use for preparing deploy patches is clean in that sense :) [14:06:00] <^d> andrewbogott: branch.autosetuprebase for the win :) [14:06:59] WRT to the issue I was actually getting at, is there anything bad about the ContentTranslation submodule bork? [14:08:41] (03CR) 10Faidon Liambotis: [C: 032] Re-add eeden to the active set of nameservers [puppet] - 10https://gerrit.wikimedia.org/r/210898 (owner: 10Faidon Liambotis) [14:09:31] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [14:12:48] !log switching ns2 back to eeden [14:12:56] Logged the message, Master [14:14:22] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [14:14:55] (03PS2) 10Faidon Liambotis: Remove authdns role from multatuli [puppet] - 10https://gerrit.wikimedia.org/r/210899 [14:14:57] (03PS1) 10Faidon Liambotis: authdns: fix user creation on pristine systems [puppet] - 10https://gerrit.wikimedia.org/r/210911 [14:15:06] akosiaris: are you still working tonight, or done for the day? [14:15:09] bblack: all done besides cleaning up & reimaging multatuli, which I'll do later [14:15:19] bblack: 2.1.2 is on apt and eeden as well [14:15:44] all: brb, if anything seems funky with DNS call/page [14:16:38] !log es-tool restart-fast on elastic1012 [14:16:44] Logged the message, Master [14:17:49] paravoid: awesome [14:21:31] !log andyrussg Synchronized php-1.26wmf5/extensions/CentralNotice/: Update CentralNotice (duration: 00m 12s) [14:21:36] Logged the message, Master [14:23:44] !log restarted ganglia-monitor on eeden [14:23:50] Logged the message, Master [14:25:33] still no eeden stats in ganglia, hmmm [14:25:55] I'll be gone for a little while, back later until late tonight again (for anyone needing clinic duty services) [14:25:59] andrewbogott: deep in boolean algebra right now [14:26:24] oh it's still coming in as eeden.esams, somehow, in ganglia web view anyways [14:26:24] grumble ganglia [14:26:24] akosiaris: ok, ping me if/when you have some time :) [14:26:53] probably puppet hasn't run on the ganglia box yet [14:26:56] right [14:31:14] (03CR) 10Faidon Liambotis: [C: 032] authdns: fix user creation on pristine systems [puppet] - 10https://gerrit.wikimedia.org/r/210911 (owner: 10Faidon Liambotis) [14:31:43] (03PS3) 10Faidon Liambotis: Remove authdns role from multatuli [puppet] - 10https://gerrit.wikimedia.org/r/210899 [14:31:54] (03CR) 10Faidon Liambotis: [C: 032] Remove authdns role from multatuli [puppet] - 10https://gerrit.wikimedia.org/r/210899 (owner: 10Faidon Liambotis) [14:34:00] !log migrating data db from virt1000 to db1009 [14:34:05] Logged the message, Master [14:34:44] !log reimaging multatuli [14:34:49] Logged the message, Master [14:39:52] bblack: so now it's the first time we're using SO_REUSEPORT, right? [14:40:22] PROBLEM - DPKG on multatuli is CRITICAL: Connection refused by host [14:40:32] PROBLEM - salt-minion processes on multatuli is CRITICAL: Connection refused by host [14:40:57] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: add jdouglas to researchers admin group - https://phabricator.wikimedia.org/T98536#1284899 (10Milimetric) @Manybubbles: this all makes sense. Just reach out on IRC if you have questions about this kind of stuff, especially when you start analyzing / gra... [14:41:13] raid is still resyncing, so it's hard to really say if things are better or worse [14:41:58] CPU seems higher [14:42:02] but again, raid [14:42:55] paravoid: yeah it's higher in general on the boxes that let it use SO_REUSEPORT [14:43:25] because the packets aren't confined to one socket/thread, it's less efficient at lower traffic levels relative to machine capacity [14:43:37] aha [14:44:22] 6operations, 7Icinga, 7Monitoring: remove (or fix) passive checks for removed hosts - https://phabricator.wikimedia.org/T99012#1284916 (10Jgreen) This is caused by hosts aluminium/beryllium submitting passive checks with icinga not configured to receive them. These hosts in particular are oddballs--aluminium... [14:44:29] the upside of that trade is it should be better at handling reqrate spikes without a corresponding minor latency spike or any dropped packets [14:44:47] (03PS1) 10KartikMistry: Beta: CX: Add Hebrew (he) as target language [puppet] - 10https://gerrit.wikimedia.org/r/210914 (https://phabricator.wikimedia.org/T99082) [14:45:02] (well and better at using up the full capacity of the machine, if constant load ever got that high) [14:46:21] (03PS2) 10KartikMistry: Beta: CX: Add Hebrew (he) as target language [puppet] - 10https://gerrit.wikimedia.org/r/210914 (https://phabricator.wikimedia.org/T99082) [14:46:32] at least, those are my theories and I'm sticking to them until someone shows me they're stupid :) [14:48:43] !log andyrussg Synchronized php-1.26wmf6/extensions/CentralNotice/: Update CentralNotice (duration: 00m 13s) [14:48:48] Logged the message, Master [14:50:20] I'm kind of tempted to play with the udp_recv_width parameter and see how it affects CPU load there actually, since it's such a modern kernel + reuseport and all that. [14:50:29] but maybe later when I have time! [14:50:55] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [14:51:18] ^d, thcipriani, marktraceur: Who wants to SWAT this morning? Only one config change on the list at the moment. [14:51:32] Jamesofur|cloud: Ping for SWAT in about 8.5 minutes [14:51:47] anomie: I'd propose one more [14:51:47] Yup [14:52:20] I gotta deal with Jenkins, sorry [14:53:04] hi Jamesofur|cloud [14:53:29] * Jamesofur|cloud nods morning morning [14:53:40] anomie: I've simple patch for SWAT [14:53:42] Hmm that was supposed to be a wave [14:54:01] :) [14:54:04] (03CR) 10Amire80: [C: 031] Beta: CX: Add Hebrew (he) as target language [puppet] - 10https://gerrit.wikimedia.org/r/210914 (https://phabricator.wikimedia.org/T99082) (owner: 10KartikMistry) [14:54:06] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [14:55:15] (03PS1) 10KartikMistry: CX: Enable newarticle campaign except bawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210916 [14:57:03] anomie: can SWAT this morning. [14:57:07] thcipriani: ok! [15:00:05] manybubbles, anomie, ^d, thcipriani, marktraceur, jamesofur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150514T1500). Please do the needful. [15:00:32] !log es-tool restart-fast on elastic1013 [15:00:39] Logged the message, Master [15:01:08] (03CR) 10Thcipriani: [C: 032] Open external links on voteWiki in new tab/window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210849 (https://phabricator.wikimedia.org/T98013) (owner: 10Jalexander) [15:01:14] (03Merged) 10jenkins-bot: Open external links on voteWiki in new tab/window [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210849 (https://phabricator.wikimedia.org/T98013) (owner: 10Jalexander) [15:03:48] andrewbogott, the read-only phase took about 1 second, so all good [15:04:03] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: Open external links on votewiki in new tab [[gerrit:210849]] (duration: 00m 12s) [15:04:09] Logged the message, Master [15:04:32] Jamesofur|cloud: test please [15:05:28] jynus: cool! What’s next? [15:05:36] Thank you thank you, /me goes to verify [15:06:05] I guess for you, maybe dinner [15:06:33] Yup, perfect [15:06:38] neat. [15:06:40] andrewbogott, I will set up replication, hopefully no intervention needed from you today [15:06:43] Thank you thcipriani [15:06:48] jynus: ok :) [15:06:50] Jamesofur|cloud: yw! [15:07:04] as I said on the ticket, th db may need some cleanup before puting it on production [15:07:22] Nikerabbit: what's the story with the workflow selector fix for SWAT? [15:07:28] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [15:07:38] thcipriani: I'm adding submodule patch [15:07:46] thcipriani: you can go ahead with my patch first [15:08:09] kart_: okie doke. [15:09:33] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210916 (owner: 10KartikMistry) [15:09:40] (03Merged) 10jenkins-bot: CX: Enable newarticle campaign except bawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210916 (owner: 10KartikMistry) [15:10:07] (03PS1) 10Merlijn van Deen: tools: make seperate /tmp hiera-configurable [puppet] - 10https://gerrit.wikimedia.org/r/210918 (https://phabricator.wikimedia.org/T99069) [15:10:51] (03CR) 10jenkins-bot: [V: 04-1] tools: make seperate /tmp hiera-configurable [puppet] - 10https://gerrit.wikimedia.org/r/210918 (https://phabricator.wikimedia.org/T99069) (owner: 10Merlijn van Deen) [15:12:12] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT enable new article campaign except bawiki [[gerrit:210916]] (duration: 00m 12s) [15:12:17] Logged the message, Master [15:13:50] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [15:14:04] thcipriani: we've patch now. [15:14:33] kart_: cool, wanna check your other patch for me? [15:15:26] thcipriani: doing testing, but that's fine :) [15:16:04] (03PS2) 10Merlijn van Deen: tools: make seperate /tmp hiera-configurable [puppet] - 10https://gerrit.wikimedia.org/r/210918 (https://phabricator.wikimedia.org/T99069) [15:21:25] plop [15:22:10] (03PS2) 10Greg Grossmeier: service-restart: fix up for new salt version [puppet] - 10https://gerrit.wikimedia.org/r/210877 (https://phabricator.wikimedia.org/T63882) (owner: 10ArielGlenn) [15:23:38] plop [15:23:42] (03PS3) 10Merlijn van Deen: tools: make seperate /tmp hiera-configurable [puppet] - 10https://gerrit.wikimedia.org/r/210918 (https://phabricator.wikimedia.org/T99069) [15:24:17] akosiaris: can you restart apertium-apy on sca1001/sca1002 please? [15:27:46] anomie: ^d: thcipriani: marktraceur: something's gone wrong when I call wikis with debug=true [15:28:16] AndyRussG: Can you be more specific? [15:28:25] kart_: I already have [15:28:48] anomie: it makes a request to http://en.wikipedia.org/wiki/function%20()%20%7Breturn%20this.protocol%20+%20'://'%20+%20this.getAuthority()%20+%20this.getRelativePath();%7D which returns 400 [15:29:09] _and_ I'm not getting the new JS code that I just deployed in the previous slot [15:29:21] however without debug=true, that new code _is_ getting served [15:31:07] looks like a RL issue [15:31:23] akosiaris: thanks. Bug from my side. [15:31:32] AndyRussG: I'm not seeing that when I try it. But I see your deploy was CentralNotice stuff, which might explain that. [15:32:13] when you say debug=true, do you mean the X-Wikimedia-Debug header? [15:32:26] anomie: which might explain what? the new CentralNotice code is there without debug=true, but the old code is there with debug=true? [15:32:41] bblack: I mean, when I add "debug=true" as a URL param [15:32:53] <^d> bblack: That ^ for resourceloader [15:33:10] <^d> Keeps code unminimized, etc etc. [15:33:11] AndyRussG: Might explain me not seeing your problem, if I'm not being served the same CentralNotice junk [15:33:13] andrewbogott: I am free now [15:33:26] (03PS1) 10Dereckson: Add *.bl.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210922 [15:33:33] anomie: the problem isn't with CN [15:33:43] so, bits.wm.o has this block in vcl_fetch: [15:33:45] // Do not serialize calls for non-cachable objects. [15:33:45] // Removing this would break debug=true in mediawiki. [15:33:45] if (beresp.ttl <= 0s || req.http.X-Wikimedia-Debug == "1") { [15:33:47] AndyRussG: And you know this how? [15:33:48] set beresp.ttl = 120s; [15:33:51] return (hit_for_pass); [15:33:53] } [15:34:01] and we moved RL off of that bits cache code, and probably didn't port over whatever hack that is [15:34:02] anomie: CN makes not cals even vaguely similar to that [15:34:06] no calls [15:34:21] (03PS2) 10Dereckson: Add *.bl.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210922 (https://phabricator.wikimedia.org/T98734) [15:34:44] anomie: also, there's no reason for it to work minified but not work un-minified, other than a resourceloader error [15:34:54] or some bits config error [15:35:27] yeah I think we've probably just got broken cache behavior for debug=true here, but I'm not exactly sure of the fix yet [15:36:28] <^d> Can that vcl bit be copied over to text-lb's vcl? [15:36:53] <^d> Or would that be undesirable? [15:37:27] thanks, lmk if I can do anything :) [15:37:52] I'm going to finish up swat for kart_ 's patch now [15:38:12] ^d: well, just blindly copying things around can have all kinds of unintended consequences... [15:38:31] I need to understand what was going on and how it's really broken first. right now I'm just digging through git blame [15:38:38] * ^d nods [15:39:07] https://gerrit.wikimedia.org/r/#/c/159294/4/templates/varnish/bits.inc.vcl.erb <- relevant, same code before X-WM-Debug header stuff, from j.oe investigating this stuff before [15:41:16] eh maybe I'm looking at the wrong thing here, we *do* actually have the same basic beresp.ttl <= 0s behavior in text-lb's vcl_fetch as well, in the general case. [15:43:20] kart_: this is just a sync-dir, is that correct? Didn't see updates to any localization anything. [15:44:27] AndyRussG: can you summarize a basic way to reproduce? [15:44:39] maybe I don't really comprehend what's happening here [15:45:32] bblack: sure! go to chrome and open the dev console [15:45:50] thcipriani: Nikerabbit's patch? [15:45:51] thcipriani: yeah sync-dir is okay [15:45:55] then go to this URL: http://en.wikipedia.org/wiki/Boris_Nemtsov?uselang=fr&country=FR&debug=true [15:45:57] as long as RL modules update [15:46:26] !log thcipriani Synchronized php-1.26wmf5/extensions/Translate: SWAT update translate to a6f0a63 [[gerrit:210919]] (duration: 00m 15s) [15:46:31] Logged the message, Master [15:46:34] AndyRussG: ok I'm with you so far :) [15:46:35] bblack: you'll see the 400 in the console and no centralnotice banner [15:46:44] kart_: heh, right, thank you for the submodule bump. Nikerabbit: test please :) [15:46:54] thcipriani: thanks! [15:47:04] * bblack cries at the neverending list of network resources loaded for that one page [15:47:21] bblack: leave off the debug=true and there's no error and you see the banner [15:47:26] http://en.wikipedia.org/wiki/Boris_Nemtsov?uselang=fr&country=FR [15:48:10] bblack: mm remember tho debug=true is turning off RL's consolidation of stuff [15:48:20] well [15:48:22] thcipriani: works (with no-cache) [15:48:40] bblack: try this one then :P [15:48:41] http://www.webpagetest.org/result/150325_B4_19KG/1/details/ [15:48:45] I do see your 400, and when I paste that back into the URL bar as a separate request, I get a pretty legit-looking 400 error that says: [15:48:48] The requested page title contains invalid characters: "{". [15:49:14] bblack: hmmm... I mean, I have absolutely no idea where that request is coming from [15:49:21] it's coming from RL [15:49:27] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Scrum-of-Scrums: Jenkins is using php-luasandbox 1.9-1 for zend unit tests; precise should be upgraded to 2.0-8 or equivalent - https://phabricator.wikimedia.org/T88798#1285023 (10akosiaris) https://gerrit.wikimedia.org/r/#/c/21... [15:50:28] confirmed that the same error about invalid page title char { is what the server is reponding with for the real request as well [15:50:39] so it's not like, an error in me pasting the req back to the URL bar [15:51:17] 6operations, 10ops-codfw: degraded RAID / disk fail on es2010 - https://phabricator.wikimedia.org/T98982#1285026 (10Papaul) a:3Dzahn @Dzahn Disk replacement complete [15:51:34] thcipriani: SWAT over? then I can proceed. [15:51:46] ah, yes, SWAT complete [15:51:53] thcipriani: thanks [15:52:23] AndyRussG: that bad request is coming from CN code within RL [15:52:40] mw.centralNotice.loadRandomBanner [15:53:03] (03PS1) 10Faidon Liambotis: Fix completely broken SSH host key collection [puppet] - 10https://gerrit.wikimedia.org/r/210926 [15:53:06] (03CR) 10Alexandros Kosiaris: [C: 04-1] Labs: Include public IPs in ferm's $INTERNAL (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/210853 (https://phabricator.wikimedia.org/T96924) (owner: 10Tim Landscheidt) [15:53:09] wtf of the day [15:53:52] AndyRussG: it seems like somewhere along the line, some code is supposed to get eval'd to generate the request URL, and instead the code is being used as the literal text of the URL, IMHO [15:54:02] (03CR) 10Faidon Liambotis: "Since this has been broken for so long, another option would be to ensure known_hosts to absent, and collect manually only where it makes " [puppet] - 10https://gerrit.wikimedia.org/r/210926 (owner: 10Faidon Liambotis) [15:54:09] reviews anyone? [15:54:15] akosiaris / bblack? [15:54:39] paravoid: I think we had them disabled a while ago for some reason [15:54:59] they were disabled by accident in a commit that made ssh a module [15:55:08] bblack, AndyRussG: Specifically, in there it's doing "url: scriptUrl.toString," instead of "url: scriptUrl.toString(),". [15:55:11] well, I'm guessing it's by accident, I don't know [15:55:23] but it happened silently, as far as commit log goes [15:55:34] https://gerrit.wikimedia.org/r/#/c/90098/ [15:56:43] anomie: that seems to jive, yeah [15:57:10] anomie: bblack: there is a call in there but I don't see any toString (without the ()) [15:57:10] paravoid: naggen2 delete intentional there? [15:57:25] akosiaris: sorry for the delay, I’ve got electricians working in the house today. Still around? [15:57:38] bblack, AndyRussG: I don't know what exactly was deployed earlier, but apparently code is still being served that doesn't include https://gerrit.wikimedia.org/r/#/c/190239/ [15:57:48] it's in the loadtestingbanner copy of the same code [15:57:51] I’m hoping you will +1 https://gerrit.wikimedia.org/r/#/c/210853/ and then switch that bridge to promiscuous so we can get the rest of these pieces assembled. [15:57:55] there's different paths for debug/nondebug [15:58:00] From https://en.wikipedia.org/static/1.26wmf5/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js anyway [15:58:08] ^d: around? [15:58:13] or akosiaris [15:58:20] oh, you already reviewed that patch :) [15:58:41] anomie: bblack: https://git.wikimedia.org/blob/mediawiki%2Fextensions%2FCentralNotice/bab5f9ac1a2b6bd07ed49fac0cb460dd8992c1a6/modules%2Fext.centralNotice.bannerController%2FbannerController.js#L189 [15:58:54] andrewbogott: I was thinking again about all this routing via public ips thing [15:58:58] ^d: I deployed cxserver, but code isn't updated in sca1001 [15:58:59] why do we want it ? [15:59:10] (03CR) 10BryanDavis: "There are a few things that could use work here still:" [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T90892) (owner: 10BryanDavis) [15:59:36] <^d> I don't know anything about sca*... [15:59:50] akosiaris: ^^ [16:00:00] anomie: bblack: this is what I deployed: https://gerrit.wikimedia.org/r/#/c/210909/ https://gerrit.wikimedia.org/r/#/c/210910/ using sync-dir [16:00:05] kart_: Dear anthropoid, the time has come. Please deploy Content Translation deployment (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150514T1600). [16:00:18] akosiaris: I can think of a few reasons. We want beta to be able to talk to itself without having to use a bunch of ‘if labs’ switches. [16:00:30] kart_: paste with error ? [16:00:33] Likewise, tools that interact with other tools shouldn’t necessarily have to know if a given tool is hosted in labs or prod. [16:00:37] akosiaris: no error [16:00:45] AndyRussG: in the version of the code my browser gets, the copy of that down on line 257 lacks the () [16:00:49] akosiaris: should do update again? [16:00:55] kart_: fetch/checkout completed successfully ? [16:01:17] akosiaris: here is some explanation: https://phabricator.wikimedia.org/T95288 [16:01:45] akosiaris: yes. no error there. [16:01:54] hmm resolving public DNS records such as en.wikipedia.beta.wmflabs.org [16:02:11] bblack: So https://bits.wikimedia.org/static/1.26wmf5/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js seems to be serving the correct code, while the same path on en.wikipedia.org, de.wikipedia.org, commons.wikimedia.org, and probably the rest are serving the old code. [16:02:40] akosiaris: any remedy? [16:03:33] kart_: is it ok now ? [16:03:37] anomie: bblack: K, I see that line without the () in my browser when I use debug=true. I don't recall us every having a bug in which we forgot the () there [16:03:56] I did one more git-deploy cycle [16:04:11] akosiaris: yes, and I’d rather have routing work than a million special cases in DNS. [16:04:20] AndyRussG: In Gerrit 190239, your commit summary says "This has been deprecated and unused for two months"... [16:04:31] akosiaris: thanks. [16:04:34] akosiaris: fixed. [16:04:39] PROBLEM - High load average on labstore1001 is CRITICAL 55.56% of data above the critical threshold [24.0] [16:04:43] andrewbogott: well, my point is it will not always work [16:04:55] anomie: yes, that seems to be true. I'm kinda wondering how/why we have ever had any backend serve two different contents for the 1.26wmf5 variant of that file, though.... [16:05:06] those URLs should be content-unique, right? [16:05:34] akosiaris: you mean, depending on w/not a given instance has a public IP? [16:05:43] yup [16:06:07] if it's normal for the contents of a file under a specific version of static/ to change, then there are probably a lot of things about static/ that need rethinking :P [16:06:16] (03PS2) 10Faidon Liambotis: Fix completely broken SSH host key collection [puppet] - 10https://gerrit.wikimedia.org/r/210926 [16:06:28] bblack: My guess is en.wikipedia.org etc are still serving the version of the file from before AndyRussG's update this morning. [16:06:33] akosiaris: I thought that that that exception was fixed in newer versions. I can dig through the code again and check... [16:06:40] there should only be one version of that file [16:06:50] otherwise, how are we invalidating these on deploy? [16:07:01] bits never had a PURGE mechanism... [16:07:03] bblack: Well, it changes whenever someone does a backport to the deployment version (that touches something under there that is actually loaded anywhere)... [16:07:29] so I don't see how we wouldn't have had the same problem on bits in the general case, is what I'm saying, I guess [16:07:36] bblack: TL;DR is that the authdns ssh problem wouldn't have fixed itself given time -- it's all very broken across the fleet [16:07:50] the reason bits doesn't have the problem right now is because it wasn't cached there before you checked or something, since we're not generally using it for this traffic now [16:08:16] bblack: anomie: K I see. The code with the toString (without the ()) was dead code. This update eliminated it, but it also eliminated the config variable that made it not be invoked [16:08:39] !log kartik Started scap: Update ContentTranslation [16:08:46] Logged the message, Master [16:08:47] 6operations, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch on jessie - https://phabricator.wikimedia.org/T98042#1285055 (10bd808) >>! In T98042#1284730, @Manybubbles wrote: > @ArielGlenn, If we locked the version of Elasticsearch that cirrus used then it'd... [16:08:56] bblack: anomie: so what's happening with debug=true is that the old code is being served, but it doesn't receive the configuration that turns off that bit of unused code [16:09:18] I think debug=true is a red herring here, it just happens to be something that's helping to trigger the problem [16:09:25] anomie: bblack: excellent catch BTW! (how did you track it down?) [16:09:26] or make it more-visible [16:09:48] bblack: well it's a problem in code that shouldn't be served, no? I mean, it's not deployed anywhere... [16:09:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [16:10:18] bblack: Without debug=true, wouldn't RL do its load and compress via load.php instead of hitting the /static/ paths at all? [16:10:27] yeah I guess [16:10:43] akosiaris: and, in general, it seems silly that something running on labs should have to know whether or not a given fqdn is a labs host or not. [16:10:44] but still, we shouldn't be relying on that to make static/ changes work correctly. [16:10:55] so, fixing it sometimes seems like a step in the right direction :/ [16:11:33] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1285070 (10Heather) Sorry I've been slow, it's not going to be nsa.wikimedia.org, you're right it would send the wrong message. [16:11:34] bblack: It's not making /static work correctly, it's just not using /static at all unless debug=true. [16:11:49] akosiaris: we didn’t test to see if our changes allowed routing to work from non-floating-ip instances, did we? [16:12:04] no we did not [16:12:11] well yeah, but at the end of the day, from a caching/purging perspective, we're expecting static/$version/ things to be constants and not need purges for updates, and we're serving up changing data that violates that assumption [16:12:43] it just so happens that load.php without debug=true saves you from the effect in this case [16:13:56] !log es-tool restart-fast on elastic1014 [16:14:01] Logged the message, Master [16:14:41] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [16:15:42] bd808: Know anything about scap and the "http://$domain/static/" (/srv/mediawiki-staging/w/static/ symlink farm) stuff? [16:15:47] (03PS1) 10BBlack: depool ulsfo due to traffic issues [dns] - 10https://gerrit.wikimedia.org/r/210927 [16:16:51] anomie: hmmm nothing particular. If the content is under /srv/mediawiki-staging then scap will sync it everywhere [16:17:19] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1285086 (10JohnLewis) >>! In T97341#1284443, @ArielGlenn wrote: > http://www-01.sil.org/iso639-3/documentation.asp?id=nsa Is this going to bite us if we use nsa.wikimedia.org? wikimedia.org domains are only... [16:17:34] anomie: my basis for saying that about our assumptions, is that we've never had any infrastructure for forwarding PURGE requests to the bits caches in the past [16:17:58] bd808: We're running into an issue where content there is being updated by SWAT and other backports, which bblack tells me violates the caching assumptions behind /static. [16:18:07] so there is no mechanism for actually telling the bits servers that the content of something in /static/$version/ has changed and to invalidate the old copy [16:18:08] the dirs there are created as part of the new branch creation process [16:18:23] ah... well... [16:18:33] (historically I mean, orthogonal to the whole move of bits asssets to text-lb in the present) [16:18:46] that sounds like bad assumptions on both sides honestly [16:19:02] "static" really means "not php" rather than "immutable" [16:19:18] right, it's less about the static part than the version number I guess [16:19:25] but a version is mutable too heh [16:19:37] akosiaris: unrelated distraction: any ideas about https://phabricator.wikimedia.org/T99085 ? [16:19:47] our asset management it a mess really [16:20:07] anyways, the text-lb cluster where this stuff lives in practice now *does* have the infrastructure to accept invalidation requests for this stuff [16:20:22] but, I don't think there's anything on the mw side sending those invalidations on deploy [16:20:32] no, there isn't for sure [16:20:49] we could I guess with a bit of work [16:21:04] it's the multicast stuff, like mw uses for articles, basicallu [16:21:06] *basically [16:21:21] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1285090 (10RobH) RT tikcet for HP quote: https://rt.wikimedia.org/Ticket/Display.html?id=9349 [16:21:22] would we have to permute the purges on wiki name? Or just asset base path? [16:21:26] (03CR) 10BBlack: [C: 032] depool ulsfo due to traffic issues [dns] - 10https://gerrit.wikimedia.org/r/210927 (owner: 10BBlack) [16:22:01] I'm actually not sure how that works in this case, because we're explicitly ignoring the wiki hostname for cache purposes for these URLs.... [16:22:12] I think purging any one name would get them all, but would need testing [16:22:19] *nod* [16:22:20] andrewbogott: ACL probably [16:23:30] anomie: sounds like we need a phab ticket to track this [16:23:30] I agree [16:23:36] even if the decision is to ignore the whole mess ;) [16:24:23] I'm guessing that things like Gather which have been jumping the train with weekly submodule bumps would be the most effected [16:24:50] *jet another reason not to jump the train) [16:25:13] (off by one finger error x 2) [16:25:36] bd808: bblack: didn't debug=true used to always serve the most recent code? [16:26:04] yeah that should bust cache [16:26:06] you mean by referencing /static/current/ or /static-current/ ? [16:26:16] (because that has the same cache issues heh) [16:26:58] What we need is an audit of cache actualities [16:27:06] akosiaris: OK, so, labs routing-wise… shall we run another test to see what actually works and doesn’t work? [16:27:09] so we can say "this is what happens" with confidence [16:27:20] so, debug=true (among other things I'm sure) is going to set no-cache headers on the response, which gets you the latest/current load.php output, but if the URLs inside ref normal static asset paths, those are subject to caching still [16:28:24] andrewbogott: sure [16:28:36] generally speaking, though, we don't want to have new deploys invalidating very hot URLs, either, which some static assets might be [16:28:57] !log disabling puppet on labnet1001 for testing [16:29:01] it would make more sense to make a rule that you don't break static/ things within a version. if you update the code, the old code better still work too [16:29:06] Logged the message, Master [16:29:23] bblack: I completely agree with that [16:29:37] bd808: I get lost at " the URLs inside ref normal static asset paths" [16:29:50] bblack: except in the "oh crap" case where you pushed something horrible and need to back it out [16:30:12] AndyRussG: the URLs that the debug=true ResourceLoader emits are not themselves debug URLs [16:30:13] when can we expect debug=true to start serving the code I just deployed? and what should I have doe differently? [16:30:51] * bd808 reads more backscroll for context [16:30:51] bblack: ah I see... hmm... [16:31:04] AndyRussG: just going off of "how things work normally" rather than perhaps whatever was intended, there is no expectation there. If you update things under static/ mid-wmf-version, there's no gaurantee of those ever seeing the light of day. [16:31:43] (except that in the non-debug case, load.php grabs them directly anyways, which bypasses any caching concerns for its own purposes) [16:31:58] (03CR) 10Merlijn van Deen: Initial commit (032 comments) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [16:33:05] bblack: maybe I need to read some more doc (don't want to waste your time...)... don't know what you mean by "static"? I guess you mean, no guarntees when deploying stuff outside the MW train? [16:33:55] huh... wikis higly slow just for me? :/ [16:33:56] by static I mean, anything that lives, URL-wise, within the /static/ path space, which includes your javascript, which lives at /static/1.26wmf5/extensions/CentralNotice/.... [16:34:17] ^d: I have to go in ~40m FYI elastic1014 has been fast-restart'd and is recovering [16:34:38] <^d> okie dokie [16:35:09] Steinsplitter: I'm seeing normal feel on anonymous readonly hits to random things.... [16:35:35] we did just shift some major traffic from mostly asian users from the west coast to the east, though [16:35:37] akosiaris: ok, I’m ready to restart nova-network with the new setting. Can you switch the bridge? [16:35:39] bblack: OK I see [16:35:44] * andrewbogott has no idea where the bridge is or how to do that [16:36:18] bblack: bd808: I'll try to learn more details about this before I deploy again. [16:37:22] bblack: so just trying to connect the dots here... do you feel that moving bits from it's own varnish cluster is likely to have changed basic behavior here or not? [16:37:28] AndyRussG: really that's probably the wrong distinction for me to be using. It's anything that's assets scap sync around, rather than code running in HHVM/Zend, basically [16:37:30] andrewbogott: done [16:37:58] bd808: no, I think what we're seeing here is the way things have always been, and orthogonal to the s/bits-lb/text-lb/ stuff. [16:38:08] akosiaris: everything works [16:38:24] bblack: ok. that makes me feel better actually [16:38:27] I’m going to wait a minute in case iptables isn’t fully populated yet [16:38:28] bblack: OK, right, as in, a static resource on a server rather than one generated by a script [16:38:37] andrewbogott: wait it a bit ... [16:38:42] (such as a php app...) [16:39:16] AndyRussG: right. in general, we have no infrastructure for purging those URLs from the caches on update, and they can last up to a month in cache. Mostly we rely on the 1.26wmf5 part of the URL to help with that. [16:39:26] bblack: it back now ;) [16:39:34] but if you update a file in the midst of such a version, rather than as part of a new version, then... yeah. [16:40:27] (03CR) 10Merlijn van Deen: [C: 04-1] "lintian doesn't like it too much:" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 (owner: 10Yuvipanda) [16:40:44] AndyRussG: The thing bblack was saying about debug=true before is really that if using that flag caused RL to emit script, style and image tags in the page HTML which in turn reference URLs that don't invoke PHP/HHVM then the output of those URLs from the backend servers will be cached by Varnish [16:40:55] bblack: I see... So debug=true is pulling in cached versions of static resources, and it won't work until a version bump causes a different url (say wmf7) to be used, right? [16:41:04] right [16:41:33] there are other cases within the /static/$version/ tree that don't go through RL though, I think, where this is a more-direct issue [16:41:34] 6operations, 6Labs, 10wikitech.wikimedia.org: labswiki DB is inaccessible from tin, terbium, etc. - https://phabricator.wikimedia.org/T98682#1285179 (10hoo) We have two problems here: First of all (as noted above) it's not possible to open a tcp connection to silver on that port (mysql -h silver is failing a... [16:41:39] akosiaris: pings still work. Lemme try an scp to make sure I’m not just pinging the bridge [16:41:40] That caching has no purge mechanism. we rely on the release branch version number incrementing to cause clients to ask for new URLs [16:42:16] So the cache state of a given static asset is indeterminate basically [16:42:17] for instance, a non-debug load of a random wiki page just now, hits: [16:42:20] http://en.wikipedia.org/static/1.26wmf5/resources/assets/file-type-icons/fileicon-ogg.png [16:42:30] so some /static/ is direct, outside of RL [16:42:44] it could come from front end varnish, back end varnish or direct from the web server tier [16:43:21] hoo, how are you using root@localhost there? I thought only ops could do that? [16:43:36] bblack: bd808: OK I think I understand... so in summary that means that debug=true won't work, say, for another 2 weeks? that could be bad... [16:43:50] Krenair: By using the evil mysql root password [16:43:58] well, we can manually purge the affected files, but that's not part of any normal process [16:43:59] I guess it's not broken for everyone, just CN [16:44:04] you have the mysql root password..? [16:44:05] debug=true is always a hack as well [16:44:14] Everyone has [16:44:24] bblack: bd808: hmmm well it's used for debuging javascript on production [16:44:32] unless there's a better way... [16:44:38] right, and debugging is a hack [16:44:59] e.g. not normal or scalable operation [16:45:02] rrrg... [16:45:04] :) [16:45:37] akosiaris: well… I can’t scp from an instance w/out a floating IP. I’m not sure what that tells us [16:46:09] Like I said before I think we need a phab task, some discussion and the probably some good documentation made and maybe some new tools [16:46:38] akosiaris: oh, but I can wget beta.wmflabs.org from both. [16:46:39] +1 [16:46:43] bblack: bd808: I think it might be important to try to purge those CN js files sometime in the next few days, if possible. The folks designing banners do use debug=true, I'm pretty sure, when they work on in-banner JS [16:46:44] So I’m back to thinking that everything works. [16:46:52] want to double check? [16:46:53] <_joe_> papaul: [16:47:08] andrewbogott: the first once was expected. the second one... why does it work ? [16:47:16] the DNS split horizon ? [16:47:19] bd808: bblack: so it's just my imagination that something changed? I'm pretty sure I've seen off-train updates where debug=true does bring in the new code [16:47:20] <_joe_> sorry, hitted enter again. mw2128 is ok and back in the pool [16:47:28] AndyRussG: if you can send me a list of the affected URL paths (in terms of /static/wmf-whatever/.../something.js, etc) I can do that [16:47:28] oh… [16:47:34] yes, probably. My mistake, should be doing this by ip [16:47:38] (03CR) 10coren: [C: 032] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/202363 (https://phabricator.wikimedia.org/T94990) (owner: 10Tim Landscheidt) [16:47:43] andrewbogott: yup, that's it [16:47:44] but it's not about to become any part of any normal deploy process for me to do that all the time :) [16:47:48] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [16:47:55] ok, you’re right. [16:47:56] bblack: OK will do :) much appreciated! [16:48:05] andrewbogott: and here's your problem. If you undo the split horizon, you lose functionality [16:48:06] So, it works from an instance w/a floating IP [16:48:08] and not from one w/out [16:48:14] AndyRussG: re: past behavior, it's basically not gauranteed what will happen [16:48:19] sometimes it will work :) [16:48:30] oh [16:48:31] OK [16:48:34] andrewbogott: yup. That is what I meant by "it will not always work" [16:48:36] iiiiiiiiinteresting [16:48:40] Yep, confirmed. [16:48:47] probably this is where the s/bits/text/ change comes in: the old bits caches had very small caches, so things were likely to fall off randomly and refresh faster [16:48:56] AndyRussG: there may be some regression here, yes. bblack and o.ri have been working to eliminate the bits vhost to make spdy work better and reduce TLS handshake counts [16:49:00] andrewbogott: so ? what are we going to do about DNS ? [16:49:11] bblack: yeah that makes sense [16:49:19] bblack: K, I'll send an e-mail with the paths to purge in a little while [16:49:30] akosiaris: I don’t know. One option is to just document this as ‘you need a floating IP if you want XXX to work' [16:49:33] because there was always this "wait a bit and bits will start doing what you expect" assumption [16:49:47] I really don’t want to fix this in DNS because that will be EXTREMELY messy [16:49:51] grrr [16:49:58] what I want is for routing to just work the way it should :( [16:50:08] andrewbogott: hehe, so everyone starts requests floating IPs just to get basic things done ? [16:50:15] bblack: bd808: mmm K... thanks so much for ur help with this BTW!! [16:50:18] bd808: we could sort of emulate that by directly limiting the maximum TTL of /static/ stuff in the text/mobile caches for now, but honestly I'd rather fix it right [16:50:19] we are going to run out of IPs :P [16:50:20] well, those basic things are fairly unusual things [16:50:26] It’s mostly just within beta that it would be an issue. [16:50:28] RECOVERY - RAID on es2010 is OK optimal, 1 logical, 2 physical [16:50:28] PROBLEM - puppet last run on mw2190 is CRITICAL puppet fail [16:50:44] <^d> Ugh, why are we limited to 4 recovering indices? [16:50:50] bblack: yeah. I'd love to see us handle static assets more cleanly [16:51:02] andrewbogott: well, any VM that wants to access beta, or CI, or tools, or something else hosted in labs [16:51:03] * ^d keeps having to bump that during rolling restarts. [16:51:08] yeah [16:51:15] ^d: ah? [16:51:35] it starts to get a big "if then else" flowchart for ppl to mentally follow [16:51:43] Yeah [16:51:50] <^d> godog: relo: 1, init: 4, unassigned: 102 [16:52:07] <^d> At that rate it'll take quite some time, we should probably have those limits higher in elasticsearch.yml [16:52:10] akosiaris: if we were to hotfix nova-network to make this work properly, would that be possible? Would you know what that fix would look like? [16:52:20] bblack: ideally every static asset would have a distinctly versioned URL. Update the file and the the URL magically changes. Then you can do far futures expires and edge caching on everything [16:52:40] but that takes design :) [16:52:41] andrewbogott: define properly [16:52:47] !log kartik Finished scap: Update ContentTranslation (duration: 44m 07s) [16:52:55] Logged the message, Master [16:53:03] bd808: we could start putting the git commit hash in the URL :) [16:53:18] I've done that before in other systems [16:53:31] andrewbogott: because I doubt the openstack ppl have not thought of the "VM w/o floating IP accessing VM w/ floating IP" scenario [16:53:38] <^d> godog: I've bumped it as high as 30 before. Can usually run it at 10ish without noticing much of a change from what we have now [16:53:48] they probably have and have not come up with a solution [16:53:50] <^d> Anything above 20 gets sketchy :p [16:53:57] akosiaris: there are facilities to automatically assign a floating IP to every new instance. [16:54:12] So if folks are using that they might regard our use case as a dusty corner [16:54:44] ^d: sounds good to me, haven't timed how long it is taking for a single node to recover but I'd say in ~30m order [16:54:44] I guess I can just email vish, maybe he’ll have time to respond. [16:54:58] andrewbogott: well, we dont have enough IPs for the entire of labs [16:55:03] <^d> godog: We could probably shave close to ~10m off if we tune these a little finer. [16:55:04] nope [16:55:40] bd808: I'm imagining whatever populates the tree on scap, making a separate little tree where 1.25wmf6 is replaced with a sha1 fragment, and each file has all of its past versions living in that tree, under the hash of the last commit that changed that particular file [16:55:49] don't ask me how you'd manage the refs to them, though. [16:56:04] that's the tirck [16:56:07] akosiaris: join me in #openstack-nova? I’m going to try to get vishy to address this directly, since he seems to have coded much of that. [16:56:07] *trick [16:56:53] <^d> godog: It's cluster.routing.allocation.node_concurrent_recoveries I'm looking at [16:56:54] I don't think that MediaWiki has any sort of facility for that [16:57:21] or alternatively, we can speed up train version bumps so that people don't jump it very often, and/or make train jumps a special case that uses an invalidator script that targets specific changed files [16:57:36] those are more likely [16:58:15] the purge part shouldn't be too hard really if we don't need to enumerate all of the fqdns that may have asked for the resource [16:58:33] GWT had something like that, every version of a file had an identifier and could be cached infinitely [16:59:02] <^d> !log elasticsearch: set transient cluster.routing.allocation.node_concurrent_recoveries on prod cluster to 8 (default: 2) to speed up recoveries. [16:59:07] Logged the message, Master [16:59:09] <^d> godog: ^ [16:59:18] the "easy" start would be to make "purge-file" and "purge-dir" commands that people could run manually [16:59:49] but we'd want to make sure you could not run "purge-dir w/static" to purge all the things [17:00:01] we have to be careful with those, though. they could easily cause huge overloads if they were applied very broadly [17:00:06] yeah [17:00:07] yeah [17:00:11] (03PS1) 10Hoo man: Add grants on centralauth.* via production-grants-core [puppet] - 10https://gerrit.wikimedia.org/r/210932 [17:00:44] <^d> godog: `curl localhost:9200/_cluster/settings?pretty` will give you some interesting stuff. Most of the non-elasticsearch.yml config we have in prod right now is related to recovery stuff. [17:00:52] bblack: we could rate limit sending the purges as one level of protection [17:00:58] only N per unit time [17:00:58] I think we can work around the hostname part pretty easily. worst-case, I'll have to add a conditional in VCL for PURGE on /static/ stuff to separate ignore the hostname part there as well. [17:01:32] or something [17:01:36] it can be done one way or another [17:03:36] ^d: nice! have to run now but I'll take a look later, we should set a goal for full-cluster restarts timing [17:03:42] so step 1 in hyptothetical phab task is probably for me: research/implement whatever the HTCP invalidation requests need to look like to work. [17:03:51] *hypothetical! [17:03:58] <^d> godog: Agreed. I think we can get it down some :) [17:05:38] yup [17:05:40] * godog waves [17:07:19] PROBLEM - Host tellurium is DOWN: PING CRITICAL - Packet loss = 100% [17:07:58] working on a phab task [17:08:02] ^ that's me....thought I still i was still within scheuduled maint [17:08:09] cmjohnson1: ok [17:08:39] RECOVERY - puppet last run on mw2190 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:08:49] ok [17:10:19] RECOVERY - Host tellurium is UPING OK - Packet loss = 0%, RTA = 3.02 ms [17:10:47] 6operations, 10Deployment-Systems, 10Traffic: Fix static asset varnish cache invalidation issues - https://phabricator.wikimedia.org/T99094#1285268 (10BBlack) 3NEW [17:11:04] 6operations, 10Deployment-Systems, 10Traffic: Fix static asset varnish cache invalidation issues - https://phabricator.wikimedia.org/T99094#1285275 (10BBlack) [17:15:39] PROBLEM - High load average on labstore1001 is CRITICAL 50.00% of data above the critical threshold [24.0] [17:19:17] 6operations: Upgrade sodium to jessie - https://phabricator.wikimedia.org/T82698#1285294 (10RobH) Do we have any specific requirements for a new system for this, other than 'similar to sodium'? This seems to have stalled out, due to other items taking precedence, but seems the last update suggests this is good... [17:19:33] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1285300 (10JohnLewis) a:3RobH [17:21:30] akosiaris: not much to do now but wait and hope for a response on IRC or mailing list. If nothing turns up I guess I’ll start building a second DNS server :( [17:21:33] Thanks for your help with this. [17:22:16] 6operations, 10Deployment-Systems, 10Traffic: Fix static asset varnish cache invalidation issues - https://phabricator.wikimedia.org/T99094#1285306 (10BBlack) [17:22:38] akosiaris, bblack, paravoid, mark, I’d love it if someone could follow up with (and implement a solution ) for https://phabricator.wikimedia.org/T99085 so that jaime can get unblocked. [17:25:24] andrewbogott: the quick hack workaround would be to set up netcat or something to bounce the connection through e.g. labcontrol1001, if it's just for a one-off thing. [17:25:42] what are we looking for here, a temporary exception to a firewall rule on the router? [17:26:21] 6operations, 10Deployment-Systems, 7Varnish: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1285315 (10bd808) 3NEW [17:26:23] bblack: A temporary work-around would suit my purposes. Jaime was asking partly because I think he suspected that some kind of proper redesign was in order. “something tells me that they may want to change the vlan for security reasons” [17:27:16] yeah that's kind of a tricky line there.... [17:27:55] AndyRussG, bblack: I opened a ticket at least -- https://phabricator.wikimedia.org/T99096 [17:28:00] I mean, we want labs/prod separated in the general case, and they are. and then we have this prod database in the prod db cluster, which needs to act more like it's in the labs side of the network? but actually putting it there would probably cause more problems. [17:28:02] The goal is to get the db off of virt1000 entirely, so we wouldn’t need bi-directional traffic in the long run [17:28:30] bd808: so did I lol :) [17:28:31] https://phabricator.wikimedia.org/T99094 [17:28:47] we can dupe-close mine, yours looks more-details [17:29:40] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1285322 (10BBlack) [17:29:47] 6operations, 10Deployment-Systems, 10Traffic: Fix static asset varnish cache invalidation issues - https://phabricator.wikimedia.org/T99094#1285324 (10bd808) [17:29:49] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1285325 (10bd808) [17:30:10] 6operations, 10Deployment-Systems, 10Traffic: Fix static asset varnish cache invalidation issues - https://phabricator.wikimedia.org/T99094#1285327 (10BBlack) 5duplicate>3Invalid a:3BBlack Duplicate of T99096 [17:30:35] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1285337 (10bd808) From @bblack's description at T99094: > Currently, when static assets are deployed... [17:30:48] oh there's a real merge operation that's separate from resolving with a link? [17:31:20] duh, Merge Duplicates In at the top [17:31:31] I'm just used to older systems where resolved->duplicate is where you go [17:32:05] pahb is goofy [17:32:50] when you merge it just closes the merged task and leaves a tiny breadcrumb that says something was merged in the timeline [17:33:18] and the non-existent conflict resolution... don't get me started [17:33:23] well still, probably provides some better metadata somewhere than the resolved->invalid + link text I've been using as a hacky merge so far [17:33:55] that's the thing i dislike the most about phab vs. RT , honestly, that "merge" is not an actual merge anymore [17:34:12] godog: back [17:34:19] yeah we used that feature a lot for maint emails from vendors [17:34:25] and quotes [17:34:27] and things like that [17:34:37] yeah, the merge thing in phabricator is weird [17:34:46] do we have a ticket open for it? [17:34:55] it got rejected [17:34:58] hold on [17:34:59] oh right [17:35:37] 6operations: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285379 (10RobH) 3NEW a:3RobH [17:35:40] https://phabricator.wikimedia.org/T96424 [17:35:43] make 7 tickets with one word of "Make merging tickets in phabricator work properly" in each title, then merge them :P [17:35:49] haha [17:36:41] bd808: there's at least some conflict resolution these days. I got an error recently 'can't apply your changes, do you just want to place the comment?', like we had in BZ [17:36:55] oh that's nice [17:37:07] there _was_ a duplicate of that ticket [17:37:13] 6operations, 10Wikimedia-Mailing-lists: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285389 (10Krenair) [17:37:57] robh: never seen someone file a ticket for a few lists job before ;) [17:37:59] 6operations, 10Wikimedia-Mailing-lists: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285393 (10RobH) [17:38:09] 6operations, 10Wikimedia-Mailing-lists: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285379 (10RobH) [17:38:20] valhallasw: you bring good news :) [17:38:31] 6operations, 10Wikimedia-Mailing-lists: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285379 (10RobH) [17:40:54] (03Abandoned) 10Dereckson: Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207491 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [17:42:44] 6operations, 10Wikimedia-Mailing-lists: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285407 (10RobH) [17:42:51] (03CR) 10Gage: [C: 031] add_ip6_mapped: enable token-based SLAAC for all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/202725 (https://phabricator.wikimedia.org/T94417) (owner: 10BBlack) [17:43:45] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice: mailing list server maintainance window - https://phabricator.wikimedia.org/T99098#1285415 (10RobH) [17:44:05] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice: Mailing list maintenance window - 2015-05-19 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T99098#1285416 (10RobH) p:5Triage>3Normal [17:45:09] (03CR) 10BBlack: [C: 031] "I think this looks sane, I'd rather have it working everywhere than just special cases. I haven't really tested any of it obviously." [puppet] - 10https://gerrit.wikimedia.org/r/210926 (owner: 10Faidon Liambotis) [17:46:10] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1285421 (10Dzahn) >>! In T99030#1284110, @Halfak wrote: > Might it make sense to create a task for a role common to stat machines? It seems that this is a common proble... [17:47:46] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1285426 (10RobH) This task is planned to be accomplished during the upcoming maintenance window on sodium, via T99098. [17:47:58] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice: Mailing list maintenance window - 2015-05-19 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T99098#1285379 (10RobH) This task is planned to be accomplished during the upcoming maintenance window on sodium, via T99098. [17:48:00] bd808: cool, thx! [17:48:22] AndyRussG: I'll expect your patches to fix this next week ;) [17:48:33] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1285430 (10Halfak) I wonder if statistics::cruncher would be a good common role between machines. To me, it sounds like an appropriate name for use-cases common to both... [17:49:04] bd808: for the varnish ticket? [17:49:19] yeah and scap [17:49:33] (note the wink) [17:50:16] (03PS1) 10Dereckson: Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210934 (https://phabricator.wikimedia.org/T97510) [17:50:26] (03CR) 10Dereckson: "Superseded by I01e82115d8a0a57b7d0cb5a235fda6bd75240345." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207491 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [17:50:26] hmmmmm yeah I've a long way to go before submitting any varnish stuff 8p [17:50:29] bblack: is ‘set up netcat or something’ easy? Because even if that’s not the long-term solution I’d like to get jaime unblocked (since that task is blocking me too) [17:53:39] (03PS2) 10BBlack: add_ip6_mapped: enable token-based SLAAC for all jessie/trusty [puppet] - 10https://gerrit.wikimedia.org/r/202725 (https://phabricator.wikimedia.org/T94417) [17:54:16] andrewbogott: something like http://29a.ch/2009/5/10/forwarding-ports-using-netcat [17:54:33] basically you tell netcat to listen on some port and forward traffic to another netcat command that connects to the remote port [17:54:37] (03CR) 10Ottomata: Add fluorine rsync connector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209684 (owner: 10OliverKeyes) [17:54:50] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1285446 (10Dzahn) If we would put the existing statistics::cruncher role on stat1002 now we would effectively change that server and install all the things that role ins... [17:55:42] * andrewbogott pastes in ticket [17:55:45] assuming there's a common host the two disconnected hosts can both see [17:56:20] it's kinda hacky, I'm just offering what might be an easy workaround because I don't really have time to look at this deeper right now :/ [17:56:42] I'm not really sure why whatever's blocked from whatever or how the world should look and what all the implications are [17:57:06] godog: want me to take over? [17:57:59] ^d: speeding up recoveries seems to have increased the load a bit [17:58:11] (03CR) 10BBlack: "PS2 just rebased and fixed the code comments to match reality better" [puppet] - 10https://gerrit.wikimedia.org/r/202725 (https://phabricator.wikimedia.org/T94417) (owner: 10BBlack) [17:58:14] <^d> I WENT FROM 2->8 [17:58:17] <^d> fuck, capslock [17:59:04] Reedy: AWB is completely dead based on the reports on enwp. Any chance of a rollback until it's fixed? I'd like to do some data crunching. [18:00:09] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:00:53] (03PS2) 10Thcipriani: Deployment group for trebuchet [puppet] - 10https://gerrit.wikimedia.org/r/209045 (https://phabricator.wikimedia.org/T97775) [18:02:40] a large 503 spike is starting up... [18:03:59] 7Puppet, 6operations, 10Beta-Cluster, 5Patch-For-Review: Trebuchet on deployment-bastion: wrong group owner - https://phabricator.wikimedia.org/T97775#1285452 (10thcipriani) @ArielGlenn I just updated my patch to allow a per-repo override of `deployment_repo_group` in the `repo_config` pillar which overri... [18:04:07] seems to be related to uploadstash / thumbnails [18:04:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 21.43% of data above the critical threshold [500.0] [18:04:45] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1285453 (10Ottomata) Note that both role::statistics::cruncher and role::statistics::private include the module class statistics::compute. I would just ensure these pac... [18:05:09] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:08:37] (03PS2) 10Dzahn: stats: add role for enchant on stat1002/1003 [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [18:09:22] T13|mobile: pls report on phab. this is obviousely the wrong channel... [18:10:08] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [18:10:22] It's already been reported. Was just checking in with one of the developers of the app to get an eta on being able to use it again. [18:10:41] mutante: did you see my comment about the enchant thing? [18:11:00] RECOVERY - High load average on labstore1001 is OK Less than 50.00% above the threshold [16.0] [18:11:23] ottomata: i just did now :) [18:11:26] :) [18:12:22] ottomata: ok, thanks, i'll amend and put it in statistics::compute [18:12:49] cool [18:13:03] statistics crap is confusing [18:13:06] i know. [18:13:45] stat1002 (private) and stat1003 (cruncher) are basically the same, except stat1002 has private data, and also hadoopish clients to the analytics cluster (which has private data) [18:14:25] i kept thinking maybe it should be role research::something when it's for halfak's things [18:14:59] ottomata: gotcha about the private data difference, that was my guess [18:15:31] so researchers need both [18:16:19] often, yes, usually its just easiest to share things like package installs between the two [18:16:28] researchers usually have access to both [18:16:30] well ha [18:16:38] except the 'researchers' group [18:16:42] does not have access to stat1002 [18:16:50] that group exists soley to give access to a pw file on stat1003 [18:17:07] and we encourage people to use that and stat1003 to connect to the mysql reserach slaves [18:17:12] :p that last part is fun [18:17:12] but they could technically do it from anywhere [18:18:13] ottomata: so technically the GRANT is too broad on the mysql side? [18:19:23] oh for sure [18:19:31] we've talked about this several times though [18:19:37] * halfak watches on [18:19:37] we have no good way to manage real people users in mysql [18:19:48] there is a single mysql user account that everybody users [18:19:49] uses* [18:20:10] (03CR) 10OliverKeyes: Add fluorine rsync connector (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/209684 (owner: 10OliverKeyes) [18:20:30] yea, but besides having just one user, it's about where it can connect from [18:20:30] i think i've posited making pupet manage the users, but if i remember right, sean didn't want the mysql users to be automated, since puppet could potentially do something really nasty [18:20:36] ohoh [18:20:38] i mean, i guess? [18:20:51] that is still more for sean or whoever to manage [18:21:01] but not as much as individual accounts [18:21:33] yea, i just meant the "do it from anywhere" part [18:22:05] well, anywhere in prod network [18:22:14] e.g. stat1002, bast1001, etc. [18:22:28] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [18:25:00] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2120.98796876 [18:25:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:26:52] 6operations, 10Analytics-Cluster: Kafka Broker disk usage is imbalanced - https://phabricator.wikimedia.org/T99105#1285537 (10Ottomata) 3NEW [18:30:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:35:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:40:09] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [18:41:17] (03PS3) 10John F. Lewis: Add deployment server role to mira [puppet] - 10https://gerrit.wikimedia.org/r/209874 (https://phabricator.wikimedia.org/T95436) [18:41:28] (03PS4) 10John F. Lewis: Add deployment server role to mira [puppet] - 10https://gerrit.wikimedia.org/r/209874 (https://phabricator.wikimedia.org/T95436) [18:42:54] (03PS1) 10John F. Lewis: Add mira to mediawiki-installation dsh [puppet] - 10https://gerrit.wikimedia.org/r/210938 (https://phabricator.wikimedia.org/T95436) [18:44:39] (03Abandoned) 10John F. Lewis: graphite: use use HTTPS by default [puppet] - 10https://gerrit.wikimedia.org/r/198564 (owner: 10John F. Lewis) [18:55:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [18:58:30] hey valhallasw [19:00:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [19:00:40] 7Puppet, 6Phabricator: Puppet lock files fail because tag names are treated like dirs - https://phabricator.wikimedia.org/T98411#1285648 (10mmodell) Hmm I'm not sure whether it's better to filter/escape the tag in puppet or just use a different tag naming convention, e.g. `2015-06-06.1` [19:01:12] 7Puppet, 6Phabricator: Puppet lock files fail because tag names are treated like dirs - https://phabricator.wikimedia.org/T98411#1285649 (10mmodell) a:3mmodell [19:01:23] 7Puppet, 6Phabricator: Puppet lock files fail because tag names are treated like dirs - https://phabricator.wikimedia.org/T98411#1267229 (10mmodell) p:5Triage>3Normal [19:02:31] (03PS34) 10Yuvipanda: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) [19:03:00] (03CR) 10Yuvipanda: "Got rid of the PyYAML thing. As for tools. - this jives with tools.manifest being that, and if we rename we can rename them all later. Thi" (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [19:03:14] (03PS3) 10Yuvipanda: Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 [19:03:22] 6operations, 6Phabricator: Activate OAuth Server Application in Phabricator for phragile login - https://phabricator.wikimedia.org/T98954#1285653 (10mmodell) I don't know if we really need help from #operations, I can probably set this up. [19:04:04] (03CR) 10Yuvipanda: "patchsets and approaches are not set in stone - let's not get caught up with git-buildpackage right now. We can use it for this later if d" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 (owner: 10Yuvipanda) [19:05:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [19:09:58] (03PS4) 10Yuvipanda: Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 [19:10:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [19:13:18] jouncebot: next [19:13:18] In 3 hour(s) and 46 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150514T2300) [19:15:09] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 297 seconds ago with 0 failures [19:15:32] !log debugging on tin / mw1017 for nostalgiawiki issue [19:15:41] Logged the message, Master [19:24:03] !log legoktm Synchronized php-1.26wmf5/skins/Nostalgia/skin.json: touch (duration: 00m 17s) [19:24:09] Logged the message, Master [19:24:35] Krenair: ^ should work without debug now [19:25:09] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [19:25:49] legoktm, works for me! [19:25:51] what'd you do? [19:25:56] I just touched the file [19:26:01] ... huh [19:26:08] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [19:26:13] bad cache got stuck somewhere [19:26:48] and cache invalidation is based on mtime() of all the queued files [19:30:08] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 228 seconds ago with 0 failures [19:35:21] (03PS5) 10Yuvipanda: Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 [19:37:30] (03CR) 10Yuvipanda: "W: tools-webservice: binary-without-manpage usr/bin/webservice-new" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 (owner: 10Yuvipanda) [19:41:26] HaeB: https://phabricator.wikimedia.org/T97378 [19:43:07] (03PS13) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [19:43:18] (03PS14) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [19:43:23] !log mass unsubcription in listadmins list, resulting in unsupressed mass unsubscribe notices to all listadmin email address (sorry about the emails!) [19:43:28] Logged the message, Master [19:45:15] ori: awesome :) [19:45:29] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 50.00% of data above the critical threshold [35.0] [19:49:08] Coren: mark ^ is from *two* rsyncs running - one whch isn’t at IDLE priority [19:49:12] is that expected? [19:49:18] 6operations, 6Phabricator, 5Patch-For-Review: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1285796 (10mmodell) 5Open>3Resolved As far as I can tell this is resolved, please reopen if there are further problems. [19:50:32] 7Puppet, 6Phabricator: Puppet lock files fail because tag names are treated like dirs - https://phabricator.wikimedia.org/T98411#1285801 (10mmodell) [19:50:59] Coren: mark okay, definitely *two* of them running, one in a screen and one not in a screen [19:51:20] * yuvipanda considers killing the one not in a screen [19:55:09] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [19:55:43] People are getting logged out, logged in. https://www.wikidata.org/wiki/Wikidata:Contact_the_development_team#Logged_in_but_not_logged_in [20:00:08] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 202 seconds ago with 0 failures [20:05:52] (03PS3) 10Dzahn: stats: add role for enchant on stat1002/1003 [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:06:15] (03PS4) 10Dzahn: stats: add enchant and myspell packages [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:06:46] (03PS5) 10Dzahn: stats: add enchant and myspell packages [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:07:23] (03CR) 10jenkins-bot: [V: 04-1] stats: add enchant and myspell packages [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) (owner: 10Dzahn) [20:09:51] (03PS6) 10Dzahn: stats: add role for enchant on stat1002/1003 [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:10:40] (03CR) 10jenkins-bot: [V: 04-1] stats: add role for enchant on stat1002/1003 [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) (owner: 10Dzahn) [20:11:53] (03PS7) 10Dzahn: stats: add role for enchant on stat1002/1003 [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:15:13] (03PS16) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [20:15:46] (03PS8) 10Dzahn: statistics: add spell checker packages [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:17:03] (03PS9) 10Dzahn: statistics: add spell checker packages [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) [20:18:44] (03CR) 10Dzahn: [C: 032] ""I put a spell on you because you're mine" [puppet] - 10https://gerrit.wikimedia.org/r/210847 (https://phabricator.wikimedia.org/T99030) (owner: 10Dzahn) [20:19:23] (03PS17) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [20:19:38] (03PS18) 10Yuvipanda: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) [20:22:49] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [20:23:37] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1285868 (10Dzahn) @ottomata thanks, added to statistics::compute. also, used "ensure_packages" like for the other existing ones @halfak done, the packages have been ins... [20:24:26] halfak: ^ done. dpkg -l | grep myspell [20:25:09] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [20:25:20] mutante, thanks! [20:25:39] halfak: all language dictionaries, only exception that -tools package [20:25:51] " This package contains a the munch/unmunch tools of hunspell and ispellaff2myspell for converting ispell affix files for myspell/hunspell format. " ??:) [20:26:14] mutante, yeah, that doesn't sound like one that I was actually looking for. [20:26:44] halfak: alright, then should be good to go on both servers [20:27:43] 6operations, 6Analytics-Engineering, 5Patch-For-Review: Install enchant on stat1002 and stat1003 - https://phabricator.wikimedia.org/T99030#1285873 (10Dzahn) 5Open>3Resolved [20:30:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [20:35:08] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 1 failures [20:40:09] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:41:37] um, jgage yt? i need a sanity check! [20:42:03] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1285899 (10dr0ptp4kt) Team, I have received approval for the access request from Lila Tretikov. @ArielGlenn, regarding the domain capping, would you create a new tracking ticket on that? I'll follo... [20:42:29] PROBLEM - puppet last run on mw2029 is CRITICAL Puppet has 1 failures [20:43:25] valhallasw: look at my patches now :P [20:43:35] yuvipanda: no, I'm first going to re-apply puppet [20:43:35] :P [20:44:28] valhallasw: fine, after that :) [20:47:47] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1285914 (10Dzahn) After thinking about it some more, i would now suggest https://legal.wikimedia.org as a landing page for the legal team in general and then something like https://legal.wikimedia.org/nsa for... [20:50:19] PROBLEM - Persistent high iowait on labstore1001 is CRITICAL 55.56% of data above the critical threshold [35.0] [20:52:06] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1285924 (10Heather) Let's not put nsa anywhere it might be mistaken for a partnership. legal.wikimedia.org/stopsurveillance was a suggestion, though personally I find surveillance a little hard to spell :P [21:00:20] RECOVERY - puppet last run on mw2029 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:03:31] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1285969 (10Dzahn) adding @JAlexander because i know he knows about the 1000 domains limit. He mentioned it to me. [21:05:24] 6operations, 10GoogleLogin: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1285976 (10Dzahn) 3NEW [21:06:22] 6operations, 6Labs, 10hardware-requests: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1285990 (10Andrew) 3NEW a:3RobH [21:07:34] 6operations, 10GoogleLogin: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1286012 (10Dzahn) [21:08:29] 10Ops-Access-Requests, 6operations: Additional Webmaster tools access - https://phabricator.wikimedia.org/T98283#1286023 (10Dzahn) >>! In T98283#1285899, @dr0ptp4kt wrote: > @ArielGlenn, regarding the domain capping, would you create a new tracking ticket on that? I'll follow up on email about that. Done. pl... [21:08:58] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1286025 (10ArielGlenn) now down to 78 jobs! do we consider this done? [21:10:20] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286032 (10JohnLewis) 3NEW a:3RobH [21:11:03] !log elastic1015 es-tool restart-fast [21:11:09] Logged the message, Master [21:12:05] 6operations, 10ops-codfw: degraded RAID / disk fail on es2010 - https://phabricator.wikimedia.org/T98982#1286044 (10Dzahn) >>! In T98982#1285026, @Papaul wrote: > @Dzahn Disk replacement complete @Papaul Thank you very much. Confirmed it looks good in monitoring again. RAID OK 2015-05-14 21:10:02 0d... [21:12:19] 6operations, 10ops-codfw: degraded RAID / disk fail on es2010 - https://phabricator.wikimedia.org/T98982#1286045 (10Dzahn) 5Open>3Resolved [21:14:19] !log ori Synchronized php-1.26wmf5/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: I3df6713a1: Log request times to StatsD (duration: 00m 15s) [21:14:25] Logged the message, Master [21:14:26] 6operations: More adjustments to fundraising tech email groups - https://phabricator.wikimedia.org/T99137#1286049 (10K4-713) 3NEW [21:14:32] !log ori Synchronized php-1.26wmf6/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: I3df6713a1: Log request times to StatsD (duration: 00m 13s) [21:14:36] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1286056 (10ori) 5Open>3Resolved a:3ori [21:14:37] Logged the message, Master [21:16:36] 6operations, 10Wikimedia-Mailing-lists: Rename all lists will -l suffixes - https://phabricator.wikimedia.org/T99138#1286062 (10JohnLewis) 3NEW [21:19:52] 6operations, 10Wikimedia-DNS: Redirect for Wikimedia v NSA - https://phabricator.wikimedia.org/T97341#1286092 (10Dzahn) >>! In T97341#1285924, @Heather wrote: > legal.wikimedia.org/stopsurveillance > was a suggestion, though personally I find surveillance a little hard to spell :P Agree, not very easy to spe... [21:20:39] 6operations, 10ops-codfw: degraded RAID / disk fail on es2010 - https://phabricator.wikimedia.org/T98982#1286094 (10Papaul) @Dzahn you welcome [21:20:50] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 40.00% of data above the critical threshold [500.0] [21:21:34] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286097 (10Legoktm) What's the actual point to this? Breaking everyone's mail filters? [21:21:43] is Wikipedia down? i'm getting a 503 [21:21:52] 6operations: Google Webmaster Tools - 1000 domain limit - https://phabricator.wikimedia.org/T99132#1286099 (10Krenair) [21:22:14] @apergos: ^ [21:22:31] tomorrow [21:22:42] works for me [21:22:44] apergos: he meant the 503 not gwt [21:22:47] I guess [21:22:49] oh [21:22:50] yeah, wfm too [21:22:57] well can I pass that to someone? [21:23:05] it is midnight-o-clock here [21:23:05] where are you Tom_jsdzxzd? can you traceroute it? [21:23:06] probably! go to sleep, apergos [21:23:21] oh huh, it's not the whole site, just one page that i was trying to load [21:23:26] in Washington DC [21:23:33] Tom_jsdzxzd: which page? [21:23:49] thanks for picking it up yuvi. [21:23:54] * apergos clocks out [21:24:10] any search for a non-existent article seems to be doing it [21:24:10] https://en.wikipedia.org/w/index.php?search=foo+asdasdfasgd&title=Special%3ASearch&go=Go [21:24:19] yeah [21:24:26] known bug? [21:24:26] manybubbles: ^ [21:24:32] search 503s? [21:24:39] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 0 below the confidence bounds [21:24:39] An error has occurred while searching: Search is currently too busy. Please try again later. [21:24:53] I'm just getting the standard WMF 503 page [21:25:20] hhvm.log is being flooded [21:25:20] ori: yeah, I get that with wikimedia-debug on [21:25:24] and a 503 without [21:25:30] it's unusable [21:25:32] manybubbles: that's my fault [21:25:33] 2015-05-14 23:25:03 mw1235 dewiki fatal INFO: [fc1289ce] /w/api.php?action=query&format=json&generator=prefixsearch&redirects=true&gpssearch=ge&gpsnamespace=0&gpslimit=20&list=search&srsearch=ge&srnamespace=0&srwhat=text&srinfo=suggestion&srprop=&sroffset=0&srlimit=1&prop=pageterms%7Cpageimages&wbptterms=description&piprop=thumbnail&pithumbsize=96&pilimit=20&continue= ErrorException from line 264 of /srv/mediawiki/php-1.26wmf5 [21:25:33] /includes/exception/MWExceptionHandler.php: Fatal Error: Class undefined: CirrusSearch\RequestContext [21:26:27] I also see "pool-queuefull" warnings on fatalmonitor [21:27:14] missing \ ? [21:27:16] !log ori Synchronized php-1.26wmf5/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: (no message) (duration: 00m 12s) [21:27:21] Logged the message, Master [21:27:44] !log ori Synchronized php-1.26wmf6/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: (no message) (duration: 00m 12s) [21:27:49] Logged the message, Master [21:27:51] yeah, that seems to have fixed it [21:27:56] Tom_jsdzxzd: ^ can you try now? [21:28:23] yup, working now [21:28:31] (03CR) 10Merlijn van Deen: [C: 032] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:28:31] thanks y'all [21:28:43] valhallasw: \o/ finally :) [21:28:49] manybubbles: sorry [21:28:49] yuvipanda: 'this has no external consumers' *yet :-p [21:28:52] hhvm.log looks sane now [21:29:11] valhallasw: I would like tools-manifest to become a consumer of this at some point instead of shelling out... [21:29:30] well, there's still the regular stream of "Recursion detected in RequestContext::getLanguage" or cirrussearch parse errors [21:29:39] but it's back to normal [21:29:54] yuvipanda: eh, why is there debian stuff in the lighttpd changeset? [21:29:59] wat [21:30:03] * yuvipanda checks [21:30:04] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286122 (10JohnLewis) The point is to finally standardise all mailing lists, a project that has been open for a few years now. [21:30:34] valhallasw: where are you seeing that? [21:30:41] valhallasw: https://gerrit.wikimedia.org/r/#/c/210833/18 looks clean tome [21:31:12] yuvipanda: wtf. set reference version to patchset 12 [21:31:16] and suddenly debian [21:31:26] oh, maybe the order of the patches changed? [21:31:59] valhallasw: oh yes, they did at some point [21:32:29] I realized I'll do far less rebasing if it's in this order :P [21:32:40] (03CR) 10Merlijn van Deen: [C: 032] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) (owner: 10Yuvipanda) [21:33:44] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286141 (10Legoktm) That's a nice goal, but what's the cost? Are links going to break? Will people have to change their mail filters? Are gmane and external mirrors going to have to be updated? All... [21:33:54] (03CR) 10Merlijn van Deen: [C: 032] "why not python 3? :(" [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 (owner: 10Yuvipanda) [21:34:00] (03CR) 10jenkins-bot: [V: 04-1] Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 (owner: 10Yuvipanda) [21:34:02] (03CR) 10jenkins-bot: [V: 04-1] Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) (owner: 10Yuvipanda) [21:34:08] valhallasw: because needs to run on precise [21:34:12] and I don't want to do python 3.2 [21:34:44] err [21:34:51] what was added to python 3.3 that you're using? :P [21:34:58] it's not like you need unicode literals [21:35:08] valhallasw: well, mostly I don't want to deal with knowing the difference :P [21:35:13] !log es-tool restart-fast on elastic1016 [21:35:20] Logged the message, Master [21:35:28] legoktm: uh, jenkins-bot not hitting submit button [21:35:51] valhallasw: basically, 'yuvi panda is lazy and does not want to deal with 3 python versions in his head (2.7, 3.2 and 3.4)' [21:36:15] yuvipanda: if you package it for 3 it's only two versions, and effectively only one (3.2) :P [21:36:26] the future, man! [21:36:35] valhallasw: yeah, but my poor head! [21:37:08] valhallasw: also will need to package any py3 dependencies we might need on precise [21:37:19] trusty has a lot more of those by default [21:37:30] it's a glorified shell script :P [21:37:41] valhallasw: the solution is of course to move to trusty :D I should probably keep it 2/3 compatible tho [21:38:17] yuvipanda: you need to give it submit privledges on the repo [21:38:23] legoktm: oh, hmm [21:38:58] legoktm: done. how do I trigger it again? [21:39:02] !log I'm going to be done doing rolling restarts for a couple of hours. If someone wants to pick them up and do the next one after the cluster goes green again then be my guest. [21:39:02] recheck doesn't work for +2s right [21:39:08] Logged the message, Master [21:39:08] yuvipanda: remove +2, re +2 [21:39:13] valhallasw: ^ [21:40:55] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286173 (10JohnLewis) On a technical side, lists and emails will not break as aliases and redirects will be put in place to keep them as they are so user behaviour is not a dependent factor. [21:41:04] * yuvipanda goes afk for a bit to shower and head to office [21:41:22] (03CR) 10Merlijn van Deen: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:41:23] (03CR) 10Merlijn van Deen: [C: 032] Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:41:41] (03Merged) 10jenkins-bot: Initial commit [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210196 (https://phabricator.wikimedia.org/T98440) (owner: 10Yuvipanda) [21:41:48] (03Merged) 10jenkins-bot: Add simple debian packaging [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210850 (owner: 10Yuvipanda) [21:41:50] (03Merged) 10jenkins-bot: Add lighttpd server types [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/210833 (https://phabricator.wikimedia.org/T98818) (owner: 10Yuvipanda) [21:43:10] (03PS1) 10John F. Lewis: mailman: exim alias and redirect to wikidata-l [puppet] - 10https://gerrit.wikimedia.org/r/211047 [21:43:38] (03CR) 10John F. Lewis: [C: 04-1] "Tuesday window deploy." [puppet] - 10https://gerrit.wikimedia.org/r/211047 (owner: 10John F. Lewis) [21:43:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [21:48:41] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1286203 (10Mlaffs) And, for what it's worth, still template edits from back as far as April 19th that haven't filtered through. I don't know w... [21:50:18] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [21:59:55] 6operations, 10MediaWiki-JobQueue, 10MediaWiki-JobRunner, 5Patch-For-Review: enwiki's job is about 28m atm and increasing - https://phabricator.wikimedia.org/T98621#1286219 (10Technical13) @Mlaffs You would have assumed very wrong... From my understanding, the job queue is not a linear, easy to follow thi... [22:00:10] PROBLEM - check_puppetrun on tellurium is CRITICAL puppet fail [22:00:10] PROBLEM - check_puppetrun on indium is CRITICAL puppet fail [22:00:19] PROBLEM - check_puppetrun on silicon is CRITICAL puppet fail [22:05:08] PROBLEM - check_puppetrun on tellurium is CRITICAL puppet fail [22:05:09] PROBLEM - check_puppetrun on indium is CRITICAL puppet fail [22:05:18] RECOVERY - check_puppetrun on silicon is OK Puppet is currently enabled, last run 282 seconds ago with 0 failures [22:07:09] (03PS1) 10Ori.livneh: 1:100 request profiling via xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211051 [22:08:47] (03CR) 10GWicke: "Ping!" [puppet] - 10https://gerrit.wikimedia.org/r/207341 (https://phabricator.wikimedia.org/T97500) (owner: 10GWicke) [22:10:09] RECOVERY - check_puppetrun on tellurium is OK Puppet is currently enabled, last run 199 seconds ago with 0 failures [22:10:09] RECOVERY - check_puppetrun on indium is OK Puppet is currently enabled, last run 239 seconds ago with 0 failures [22:10:47] (03CR) 10Legoktm: [C: 04-1] 1:100 request profiling via xhprof (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211051 (owner: 10Ori.livneh) [22:12:28] (03PS7) 10GWicke: Enable group1 wikis in RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) [22:12:43] (03PS2) 10Ori.livneh: 1:100 request profiling via xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211051 [22:13:00] 6operations: test email creation task for rob - https://phabricator.wikimedia.org/T99146#1286261 (10RobH) 3NEW [22:13:22] mutante: ^ https://phabricator.wikimedia.org/T99146#1286261 [22:13:27] testwiki is dead (503s everywhere)... [22:13:27] it seems it associates the first project tag [22:13:30] but not all project tags [22:14:12] robh: ah ! or because it's a people project? [22:14:27] testing now [22:15:13] tto, works for me? [22:15:31] Krenair: Yeah, just came up again, it seems [22:15:45] 6operations: test email creation task for rob - reverse project order - https://phabricator.wikimedia.org/T99147#1286272 (10RobH) 3NEW [22:15:49] (03PS3) 10Ori.livneh: 1:1000 request profiling via xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211051 [22:15:50] May 14 22:15:29 mw1017: #012Fatal error: Class undefined: CirrusSearch\RequestContext in /srv/mediawiki/php-1.26wmf6/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php on line 159 [22:15:54] I thought ori fixed this? [22:16:30] 6operations: test email creation task for rob - reverse project order - https://phabricator.wikimedia.org/T99147#1286279 (10RobH) 5Open>3Invalid a:3RobH [22:16:38] 6operations: test email creation task for rob - https://phabricator.wikimedia.org/T99146#1286281 (10RobH) 5Open>3Invalid a:3RobH [22:16:45] Krenair: yes. odd. [22:17:38] [tin:~] $ awk 'NR==159' /srv/mediawiki/php-1.26wmf6/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php [22:17:38] \RequestContext::getMain()->getStats()->timing( 'CirrusSearch.requestTime', $took ); [22:18:03] but not on mw1017, oddly [22:20:09] !log ori Synchronized php-1.26wmf6/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php: (no message) (duration: 00m 15s) [22:20:15] Logged the message, Master [22:20:26] (03CR) 10Ori.livneh: [C: 032] 1:1000 request profiling via xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211051 (owner: 10Ori.livneh) [22:22:49] (03CR) 10Ori.livneh: [V: 032] 1:1000 request profiling via xhprof [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211051 (owner: 10Ori.livneh) [22:23:39] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286308 (10Ariconte) Whoa! Everything that WMF ops does not control will break for sure..... People's filters, mailing list archives, sieve scripts, ...... I think this needs more planning and mor... [22:23:55] !log deployed RESTBase v0.6.3 (fd942ac38ad) [22:24:00] Logged the message, Master [22:25:17] 6operations: More adjustments to fundraising tech email groups - https://phabricator.wikimedia.org/T99137#1286309 (10Dzahn) p:5Triage>3Normal [22:25:24] 6operations: More adjustments to fundraising tech email groups - https://phabricator.wikimedia.org/T99137#1286311 (10Dzahn) a:3Dzahn [22:26:10] (03CR) 10GWicke: "This is now ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [22:26:37] !log ori Synchronized wmf-config/StartProfiler.php: Icbf826a7: 1:1000 request profiling via xhprof (duration: 00m 12s) [22:26:43] Logged the message, Master [22:27:03] (03CR) 10Ori.livneh: [C: 031] Enable group1 wikis in RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/198433 (https://phabricator.wikimedia.org/T93452) (owner: 10GWicke) [22:27:18] 6operations: More adjustments to fundraising tech email groups - https://phabricator.wikimedia.org/T99137#1286316 (10Dzahn) @K4-713 done. Yes, you are right ,being in the engineers group automatically gets you fr-tech@. Here's the whole thing for completeness: ``` # fundraising aliases - RT 3580, 5595, 6348... [22:28:21] K4-713: can't ping you in phabricator using "@nickname" format :p [22:28:39] @K4-713 is not recognized somehow [22:28:40] mutante: Yeah, apparently I am a bug. :) [22:28:52] K4-713: well, the fr-tech alias thing is done [22:28:59] Thank you! [22:29:03] sure, yw [22:29:16] Pretty sure that's going to be all for a little while. :) [22:29:23] 6operations: More adjustments to fundraising tech email groups - https://phabricator.wikimedia.org/T99137#1286318 (10Dzahn) 5Open>3Resolved [22:31:16] 7Blocked-on-Operations, 10RESTBase, 5Patch-For-Review: Deploy RESTBase to group1 wikis - https://phabricator.wikimedia.org/T93452#1286320 (10GWicke) [22:31:28] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286321 (10JohnLewis) Archives will not be broken. People's filters will also not necessarily be broken as people will more than likely continue to us the wikidata-l address when sending emails. The... [22:33:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [22:36:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:40:08] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [22:40:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 22.22% of data above the critical threshold [20000.0] [22:40:18] PROBLEM - Varnishkafka Delivery Errors per minute on cp4010 is CRITICAL 22.22% of data above the critical threshold [20000.0] [22:42:09] PROBLEM - Varnishkafka Delivery Errors per minute on cp4013 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:42:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:43:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4015 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:43:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:44:07] hmm i wondered if those varnishkafka alerts might be a canary for problems with the zayo link, but smokeping looks ok [22:44:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [22:45:09] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [22:45:32] 6operations, 3Roadmap, 10Wikimedia-Mailing-lists, 7notice, 7user-notice: Mailing list maintenance window - 2015-05-19 17:00 UTC to 19:00 UTC - https://phabricator.wikimedia.org/T99098#1286356 (10gpaumier) [22:46:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4010 is OK Less than 1.00% above the threshold [0.0] [22:46:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4013 is OK Less than 1.00% above the threshold [0.0] [22:47:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [22:47:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [22:48:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [22:49:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [22:53:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 22.22% of data above the critical threshold [20000.0] [22:53:13] jgage: they generally are, yes [22:53:29] but we depooled traffic out of ulsfo earlier today anyways because of various "network sucks" problems anyways [22:53:59] yeah, i just noticed that [22:56:24] interesting how much traffic still flows in & out when it's depooled [22:56:28] RECOVERY - Varnishkafka Delivery Errors per minute on cp4015 is OK Less than 1.00% above the threshold [0.0] [22:59:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [23:00:04] RoanKattouw, ^d, rmoen, Dereckson: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150514T2300). Please do the needful. [23:00:20] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [23:00:38] Hi. [23:02:07] (03CR) 10Chad: [C: 032] Add *.bl.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210922 (https://phabricator.wikimedia.org/T98734) (owner: 10Dereckson) [23:02:09] <^d> let's go! [23:02:14] (03Merged) 10jenkins-bot: Add *.bl.uk to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210922 (https://phabricator.wikimedia.org/T98734) (owner: 10Dereckson) [23:02:24] Whee. [23:03:02] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 17s) [23:03:06] <^d> Dereckson: First one in place now ^ [23:03:08] Logged the message, Master [23:03:14] (03CR) 10Chad: [C: 032] Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210934 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [23:03:20] (03Merged) 10jenkins-bot: Logo configuration on ur.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210934 (https://phabricator.wikimedia.org/T97510) (owner: 10Dereckson) [23:03:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:04:58] !log demon Synchronized w/static/images/project-logos/urwikiquote.png: (no message) (duration: 00m 14s) [23:05:03] Logged the message, Master [23:05:10] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 11s) [23:05:15] Logged the message, Master [23:05:35] <^d> Dereckson: And the urwikiquote one ^ [23:06:30] grmbl of cache, still the former logo [23:06:49] PROBLEM - HHVM rendering on mw1169 is CRITICAL - Socket timeout after 10 seconds [23:06:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [23:06:50] 210922 works. [23:07:08] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:07:39] PROBLEM - Apache HTTP on mw1169 is CRITICAL - Socket timeout after 10 seconds [23:07:52] er... [23:08:06] I think we should revert 210934. [23:08:19] They created a white-background logo and not a transparent one. [23:08:35] And with the ori change, they can't reupload a new Wiki.png file to fix that. [23:08:36] 6operations: Implement CWDM between knams and cnams - https://phabricator.wikimedia.org/T98971#1286430 (10Krenair) [23:09:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:09:32] Does someone have an opinion if I should prepare or not a revert commit for https://ur.wikiquote.org/wiki/%D8%B5%D9%81%D8%AD%DB%81_%D8%A7%D9%88%D9%84 logo issue? [23:11:56] what is the logo issue? [23:11:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [23:12:01] is there a task for this? [23:12:09] (03PS1) 10Dereckson: Revert I01e82115d8a0a57b7d0cb5a235fda6bd75240345 - ur.wikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211057 [23:12:48] PROBLEM - Varnishkafka Delivery Errors per minute on cp4014 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:12:55] ori: a task to set this logo as the default ur.wikiquote one at T97510 [23:14:01] ori: the task were created before your change, so it were an acceptable Wiki.png request, but now, the community can't fix errors like this one. [23:14:18] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [23:15:12] ^d: I support change 210934 revert, which could be enacted by change 211057, and then a new change with the transparent logo. That will avoid to break interface and visual identity. [23:15:56] ori: [23:15:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp4014 is OK Less than 1.00% above the threshold [0.0] [23:16:13] So we need to revert the config change request, ask the community the new file, do a new commit. [23:16:24] yeah, makes sense [23:16:26] fine by me [23:16:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:17:08] PROBLEM - HHVM busy threads on mw1169 is CRITICAL 44.44% of data above the critical threshold [115.2] [23:18:09] wooo [23:19:31] (03PS1) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [23:20:18] (03CR) 10jenkins-bot: [V: 04-1] Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [23:20:22] !log Depooled mw1169; HHVM deadlock à la T89912. Leaving it depooled to investigate. [23:20:29] Logged the message, Master [23:21:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [23:22:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:22:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:23:20] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:23:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:24:12] (03PS3) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [23:24:14] (03PS1) 10Andrew Bogott: Ensure => present rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 [23:25:00] (03CR) 10jenkins-bot: [V: 04-1] Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [23:25:04] (03CR) 10jenkins-bot: [V: 04-1] Ensure => present rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 (owner: 10Andrew Bogott) [23:25:49] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [23:26:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [23:26:58] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [23:27:11] (03PS2) 10Andrew Bogott: Ensure => present rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 [23:27:13] (03PS4) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [23:27:19] ^d: could you merge https://gerrit.wikimedia.org/r/#/c/211057/ to revert the logo back to the previous one? [23:27:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [23:27:53] (03CR) 10jenkins-bot: [V: 04-1] Ensure => present rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 (owner: 10Andrew Bogott) [23:27:56] (03CR) 10jenkins-bot: [V: 04-1] Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [23:28:05] 6operations, 10Wikimedia-Mailing-lists: Rename Wikidata-l to Wikidata - https://phabricator.wikimedia.org/T99136#1286503 (10Legoktm) >>! In T99136#1286321, @JohnLewis wrote: > Archives will not be broken. People's filters will also not necessarily be broken as people will more than likely continue to us the wi... [23:28:39] PROBLEM - HHVM queue size on mw1169 is CRITICAL 44.44% of data above the critical threshold [80.0] [23:32:21] (03PS3) 10Andrew Bogott: Ensure => present rather than 'latest' [puppet] - 10https://gerrit.wikimedia.org/r/211060 [23:32:23] (03PS5) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [23:33:49] (03PS3) 10BBlack: Revert "Add dummy "/preconnect" URL endpoint on restbase varnishes" [puppet] - 10https://gerrit.wikimedia.org/r/207341 (https://phabricator.wikimedia.org/T97500) (owner: 10GWicke) [23:34:01] (03CR) 10Chad: [C: 032] Revert I01e82115d8a0a57b7d0cb5a235fda6bd75240345 - ur.wikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211057 (owner: 10Dereckson) [23:34:07] (03Merged) 10jenkins-bot: Revert I01e82115d8a0a57b7d0cb5a235fda6bd75240345 - ur.wikiquote logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/211057 (owner: 10Dereckson) [23:34:39] (03CR) 10BBlack: [C: 032] Revert "Add dummy "/preconnect" URL endpoint on restbase varnishes" [puppet] - 10https://gerrit.wikimedia.org/r/207341 (https://phabricator.wikimedia.org/T97500) (owner: 10GWicke) [23:34:46] (03PS2) 10Aaron Schulz: Removed "refreshLinks" from $wgJobBackoffThrottling [mediawiki-config] - 10https://gerrit.wikimedia.org/r/210857 [23:35:17] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 11s) [23:35:18] <^d> Dereckson: ^^^ [23:35:21] Thanks. [23:35:57] <^d> yw [23:36:34] And 211057 works, interface is correct again. [23:37:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4001 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:41:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp4001 is OK Less than 1.00% above the threshold [0.0] [23:44:45] * bordercat gives ^d https://gerrit.wikimedia.org/r/#/c/210857/ [23:45:02] * ^d ignores [23:45:58] PROBLEM - Varnishkafka Delivery Errors per minute on cp4004 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:47:40] PROBLEM - Varnishkafka Delivery Errors per minute on cp4003 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:48:38] PROBLEM - Varnishkafka Delivery Errors per minute on cp4017 is CRITICAL 11.11% of data above the critical threshold [20000.0] [23:50:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp4004 is OK Less than 1.00% above the threshold [0.0] [23:51:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp4002 is CRITICAL 22.22% of data above the critical threshold [20000.0] [23:51:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4017 is OK Less than 1.00% above the threshold [0.0] [23:52:30] RECOVERY - Varnishkafka Delivery Errors per minute on cp4003 is OK Less than 1.00% above the threshold [0.0] [23:52:44] 6operations, 10ops-esams: Implement CWDM between knams and enams - https://phabricator.wikimedia.org/T98971#1286735 (10faidon) p:5Triage>3Normal [23:53:11] 6operations, 10ops-esams: Implement CWDM between knams and esams - https://phabricator.wikimedia.org/T98971#1282372 (10faidon) [23:55:29] RECOVERY - Persistent high iowait on labstore1001 is OK Less than 50.00% above the threshold [25.0] [23:56:29] RECOVERY - Varnishkafka Delivery Errors per minute on cp4002 is OK Less than 1.00% above the threshold [0.0] [23:56:40] (03PS1) 10Andrew Bogott: Make the DNS server for .wmflabs configurable; point it to the new designate/pdns/mysql server for the recursor class. [puppet] - 10https://gerrit.wikimedia.org/r/211063 [23:58:37] (03PS2) 10Andrew Bogott: Make the DNS server for .wmflabs configurable [puppet] - 10https://gerrit.wikimedia.org/r/211063