[00:04:07] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1469455 (10tstarling) We could buy new servers, immediately configure MW to write to the new cluster, then recompress the old cluster, and decommission it when recompression is done (say 3... [00:18:18] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1469515 (10Legoktm) I see usage of the `ExternalSt... [00:18:42] (03PS1) 10MaxSem: Enable geo features tracking everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226253 [00:21:32] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1469519 (10Jdforrester-WMF) [00:21:34] 6operations, 6Release-Engineering, 7Database: Re-compress External Storage in production using trackBlobs.php and recompressTracked.php - https://phabricator.wikimedia.org/T106387#1469518 (10Jdforrester-WMF) [00:30:17] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:32:17] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [00:34:22] 6operations: Update wikimedia apt repo to include debs for shiny-server - https://phabricator.wikimedia.org/T106435#1469550 (10EBernhardson) [00:51:23] (03CR) 10Thcipriani: [C: 031] "Seems like there shouldn't be any production impact (outside of accepting a new possible X-header)" [puppet] - 10https://gerrit.wikimedia.org/r/158016 (https://phabricator.wikimedia.org/T72181) (owner: 10Dduvall) [00:56:42] 6operations, 10Deployment-Systems, 5Patch-For-Review: Trebuchet doesn't like when a deployer server is also a minion, a edge case for scap - https://phabricator.wikimedia.org/T67549#1469568 (10thcipriani) @fgiunchedi works as expected on deployment-prep deploying the test repo. LGTM. [01:17:15] ori: I want to add sampling to StatsD in MediaWiki (https://phabricator.wikimedia.org/T106457) and the current version of liuggio/statsd-php-client only accepts sampling as a parameter to send() [01:17:32] which would you hate less? [01:18:23] update the library from 1.0.12 to 1.0.16, have a separate BufferingStatsdDataFactory for each sample rate, or have BufferingStatsdDataFactory juggling sample rates internally and return one batch per sample rate? [01:20:36] actually updating to 1.0.13 would be enough [02:03:19] !log LocalisationUpdate failed (1.26wmf14) at 2015-07-22 02:03:18+00:00 [02:03:19] !log LocalisationUpdate failed (1.26wmf15) at 2015-07-22 02:03:19+00:00 [02:03:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:03:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:07:33] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 22 02:07:33 UTC 2015 (duration 7m 32s) [02:07:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:53] !log l10nupdate Synchronized php-1.26wmf14/cache/l10n: (no message) (duration: 07m 01s) [02:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:37:45] !log LocalisationUpdate completed (1.26wmf14) at 2015-07-22 02:37:45+00:00 [02:37:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:51:02] (03PS1) 10Springle: depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226257 [02:51:34] (03CR) 10Springle: [C: 032] depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226257 (owner: 10Springle) [02:51:40] (03Merged) 10jenkins-bot: depool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226257 (owner: 10Springle) [02:52:40] !log springle Synchronized wmf-config/db-eqiad.php: depool db1071 (duration: 00m 11s) [02:52:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:04:09] !log l10nupdate Synchronized php-1.26wmf15/cache/l10n: (no message) (duration: 10m 33s) [03:04:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:10:24] !log LocalisationUpdate completed (1.26wmf15) at 2015-07-22 03:10:23+00:00 [03:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:56:30] (03PS2) 10MZMcBride: Don't match Phabricator task IDs inside URLs [puppet] - 10https://gerrit.wikimedia.org/r/226234 (https://phabricator.wikimedia.org/T75997) (owner: 10Ricordisamoa) [04:12:17] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 11.11% of data above the critical threshold [100000000.0] [04:14:05] !log upgrade db1071 trusty [04:14:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:23:45] (03PS1) 10Springle: repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226263 [04:24:05] (03CR) 10Springle: [C: 032] repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226263 (owner: 10Springle) [04:24:12] (03Merged) 10jenkins-bot: repool db1071 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226263 (owner: 10Springle) [04:25:08] !log springle Synchronized wmf-config/db-eqiad.php: repool db1071, warm up (duration: 00m 12s) [04:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:28:02] !log ori Synchronized php-1.26wmf14/extensions/Scribunto/common/Base.php: Live-hack I53dd1ecb to test impact (duration: 00m 13s) [04:28:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:35:30] !log deployed small restbase hotfix d96210f2 [04:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:43:41] !log ori Synchronized php-1.26wmf14/extensions/Scribunto/common/Base.php: Revert: Live-hack I53dd1ecb to test impact (duration: 00m 12s) [04:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:45:58] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:50:27] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 14.29% of data above the critical threshold [500.0] [04:50:37] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [04:50:39] 6operations, 10Wikimedia-DNS, 10Wikimedia-Language-setup: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915#1469825 (10Glaisher) >>! In T86915#1468231, @Purodha wrote: > No. > We should not introduce, nor perpetuate wrong language codes. > A redirect must go in th oppos... [05:01:28] PROBLEM - puppet last run on db2055 is CRITICAL puppet fail [05:14:26] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [05:22:11] !log ori Synchronized php-1.26wmf15/extensions/Scribunto/common/Base.php: Cherry-pick I53dd1ecb (duration: 00m 13s) [05:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:22:49] !log ori Synchronized php-1.26wmf14/extensions/Scribunto/common/Base.php: Cherry-pick I53dd1ecb (duration: 00m 13s) [05:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [05:27:06] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [05:35:21] 6operations, 10Wikimedia-DNS, 10Wikimedia-Language-setup: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915#1469843 (10Purodha) ! In T86915#1469825, @Glaisher wrote: >Once T30442 is resolved, we'll definitely add redirects from zh-min-nan to nan as otherwise, hundreds o... [05:44:31] (03CR) 10MZMcBride: "https://git.wikimedia.org/commit/operations%2Fmediawiki-config.git/4da178971317d3551a661b0bd176197f75518496 is the last relevant reference" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222289 (owner: 10John F. Lewis) [05:45:33] (03CR) 10Muehlenhoff: "If there's no indented use case which two ES processes running, I think we should configure ES to use a fixed port instead of a dynamic ra" [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) (owner: 10Muehlenhoff) [05:48:35] (03CR) 10MZMcBride: "Before dbtree got moved into operations/software/dbtree.git, it lived in operations/software.git. https://git.wikimedia.org/blobdiff/opera" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222289 (owner: 10John F. Lewis) [05:51:47] PROBLEM - puppet last run on ruthenium is CRITICAL Puppet has 6 failures [05:55:47] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [06:17:37] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:18:36] 6operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests: Rename zh-min-nan -> nan - https://phabricator.wikimedia.org/T30442#1469861 (10Glaisher) See {T86915} where I've proposed to remove minnan and zh-cfr domains. [06:28:43] (03PS1) 10Muehlenhoff: Add ferm rules for dbstore systems [puppet] - 10https://gerrit.wikimedia.org/r/226267 (https://phabricator.wikimedia.org/T104699) [06:30:47] PROBLEM - puppet last run on mw1251 is CRITICAL puppet fail [06:31:27] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:31:27] PROBLEM - puppet last run on holmium is CRITICAL Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw1170 is CRITICAL Puppet has 2 failures [06:31:37] PROBLEM - puppet last run on mw2021 is CRITICAL Puppet has 1 failures [06:31:59] PROBLEM - puppet last run on lvs1003 is CRITICAL Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:32:17] PROBLEM - puppet last run on mw2018 is CRITICAL Puppet has 1 failures [06:32:46] PROBLEM - puppet last run on db2055 is CRITICAL Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on wtp2017 is CRITICAL Puppet has 1 failures [06:33:07] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:33:36] PROBLEM - puppet last run on mw1119 is CRITICAL Puppet has 1 failures [06:33:38] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:41:07] PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:48:58] RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [06:56:06] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [06:56:47] RECOVERY - puppet last run on wtp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:48] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:57:07] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:07] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:16] RECOVERY - puppet last run on mw1119 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] RECOVERY - puppet last run on mw1170 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] RECOVERY - puppet last run on mw2021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:27] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:46] RECOVERY - puppet last run on lvs1003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:06] RECOVERY - puppet last run on mw2018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:27] RECOVERY - puppet last run on db2055 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:36] RECOVERY - puppet last run on mw1251 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:03:17] 6operations: Reduce rpcbind use - https://phabricator.wikimedia.org/T106477#1469915 (10MoritzMuehlenhoff) 3NEW [07:22:37] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 22 07:22:36 UTC 2015 (duration 22m 35s) [07:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:36:34] (03PS5) 10Giuseppe Lavagetto: ganglia: standardize has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225880 [07:50:18] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 5 below the confidence bounds [07:54:47] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: standardize has_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225880 (owner: 10Giuseppe Lavagetto) [08:03:41] <_joe_> !log repooling mw1158-60 [08:03:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:18:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [08:18:47] Who broke teh puppets? [08:18:51] 6operations, 10Beta-Cluster, 6Labs, 7Monitoring: Setup (simple) catchpoint monitoring and metrics for enwiki betacluster just like production - https://phabricator.wikimedia.org/T97865#1469958 (10hashar) 5Open>3declined a:3hashar From a reply I made to ops-l: > I thought Catchpoint to be super cheap... [08:18:52] (at least in beta...) [08:19:06] * ostriches looks suspiciously at _joe_ [08:20:01] <_joe_> ostriches: what broke puppet? [08:20:08] Error: Failed to apply catalog: Could not find dependent Service[ganglia-monitor] for File[/etc/ganglia/conf.d/redis.pyconf] at /etc/puppet/modules/redis/manifests/ganglia.pp:12 [08:20:23] (on deployment-* nodes in beta) [08:20:39] <_joe_> uhm I'm pretty sure there is a "include ganglia" there, right? [08:20:42] <_joe_> lemme see [08:21:07] <_joe_> ostriches: on all nodes? [08:21:08] Shouldn't has_ganglia be false everywhere in labs anyway? [08:21:22] <_joe_> yes, but classes like redis monitoring assume you have it [08:21:30] deployment-elastic01, deployment-redis01 [08:22:34] I wonder if moving it to standard:: is what broke it for us [08:22:40] (03PS1) 10Giuseppe Lavagetto: redis::monitoring::ganglia: include ganglia [puppet] - 10https://gerrit.wikimedia.org/r/226271 [08:22:40] <_joe_> nope [08:22:47] <_joe_> this ^^ is the solution [08:23:38] <_joe_> you see that same error in deployment-elastic01? [08:23:52] Yeah, but with elasticsearch.pyconf [08:24:01] <_joe_> oh I see [08:24:21] <_joe_> other places? [08:24:49] Looks like just redis and elastic that's yelling. [08:24:57] <_joe_> the problem there is that those classes assume you're using ganglia somewhere else and don't make explicit inclusions [08:29:04] <_joe_> ostriches: I'll fix both of them [08:29:35] Thx [08:30:49] <_joe_> ostriches: ouch, my fault man [08:31:30] <_joe_> lol [08:32:24] (03PS2) 10Muehlenhoff: ferm rules for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) [08:37:14] (03CR) 10Chad: "+1 to binding to 9200/9300 only and not a range. We don't run multiple ES processes on a single node (except by accident really) and locki" [puppet] - 10https://gerrit.wikimedia.org/r/224095 (https://phabricator.wikimedia.org/T104962) (owner: 10Muehlenhoff) [08:37:52] (03PS2) 10Giuseppe Lavagetto: ganglia: correctly use variable, not bareword [puppet] - 10https://gerrit.wikimedia.org/r/226271 [08:39:09] _joe_: sigil mistakes suck :p [08:39:33] (03PS2) 10Muehlenhoff: Enable firejail for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/219801 (https://phabricator.wikimedia.org/T103095) [08:39:37] <_joe_> ostriches: perl -pe fu gone wrong [08:40:09] <_joe_> also, I should make that better [08:41:44] (03PS3) 10Giuseppe Lavagetto: ganglia: correctly use variable, not bareword [puppet] - 10https://gerrit.wikimedia.org/r/226271 [08:41:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable firejail for graphoid [puppet] - 10https://gerrit.wikimedia.org/r/219801 (https://phabricator.wikimedia.org/T103095) (owner: 10Muehlenhoff) [08:42:18] (03PS4) 10Giuseppe Lavagetto: ganglia: correctly use variable, not bareword [puppet] - 10https://gerrit.wikimedia.org/r/226271 [08:42:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: correctly use variable, not bareword [puppet] - 10https://gerrit.wikimedia.org/r/226271 (owner: 10Giuseppe Lavagetto) [08:43:02] _joe_: Already cherry picked into beta, works again [08:43:19] 6operations, 6Labs: bond0 connection on labstore1001 is unpuppetized - https://phabricator.wikimedia.org/T92622#1469979 (10yuvipanda) No bonds on that anymore, afaik - the system was re-installed. Should we mark this as invalid? The bond isn't really needed now either, is it? [08:43:37] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1469982 (10yuvipanda) @coren did this happen? [08:47:58] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Inspect and diagnose labstore1001's H800 controler - https://phabricator.wikimedia.org/T95293#1469999 (10yuvipanda) [08:49:32] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure: labstore1002 issues while trying to reboot - https://phabricator.wikimedia.org/T98183#1470001 (10yuvipanda) Any updates on this? It's currently the main labstore server - is it considered reliable now? Did we swap out any hardware? [08:50:06] 6operations, 10ops-eqiad, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-102: Locate and assign some MD1200 shelves for proper testing of labstore1002 - https://phabricator.wikimedia.org/T101741#1470006 (10yuvipanda) Did this happen? @Coren updates? [09:00:18] (03PS2) 10Giuseppe Lavagetto: ganglia: rename ganglia_new to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225881 [09:00:44] 6operations, 10ops-eqiad, 10Incident-20150401-LabsNFS-Overload: Verify visually that the labstore shelves' wiring is stable - https://phabricator.wikimedia.org/T94828#1470011 (10yuvipanda) Was this done / should this be done, @Coren? [09:01:45] 6operations, 6Labs: Investigate heavy NFS users and see if they can move IO to local storage - https://phabricator.wikimedia.org/T96065#1470013 (10yuvipanda) [09:02:27] 6operations, 6Labs, 3Labs-Sprint-101: Make Labs NFS alerts paging - https://phabricator.wikimedia.org/T101650#1470018 (10yuvipanda) Can this be marked resolved? [09:03:01] 6operations, 10RESTBase-Cassandra: setup an alertable threshold for Cassandra heap dumps - https://phabricator.wikimedia.org/T106346#1470020 (10fgiunchedi) a:3fgiunchedi >>! In T106346#1467624, @Eevans wrote: > Do you mean for this to be in addition to alerts, or as an alternative? > OOMs should be a very ex... [09:04:44] 6operations, 6Services, 5Patch-For-Review: Service containment for nodejs-based services with firejail - https://phabricator.wikimedia.org/T101870#1470028 (10MoritzMuehlenhoff) [09:04:47] 6operations, 10Graphoid, 6Services, 5Patch-For-Review: Confine Graphoid with firejail - https://phabricator.wikimedia.org/T103095#1470026 (10MoritzMuehlenhoff) 5Open>3Resolved Firejail for Graphoid has been enabled in production on sca100[12]. [09:06:32] (03PS1) 10Muehlenhoff: Remove firejail conditional [puppet] - 10https://gerrit.wikimedia.org/r/226273 (https://phabricator.wikimedia.org/T101870) [09:06:47] 6operations, 6Labs: Recover home folders and /data/project from wikimetrics1 - https://phabricator.wikimedia.org/T103530#1470036 (10yuvipanda) 5Open>3Resolved a:3yuvipanda [09:07:14] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 3 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1470040 (10yuvipanda) [09:11:24] 6operations, 7Graphite: improve graphite failover - https://phabricator.wikimedia.org/T88997#1470050 (10fgiunchedi) a:3fgiunchedi [09:14:10] (03PS3) 10Giuseppe Lavagetto: ganglia: rename ganglia_new to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225881 [09:20:38] PROBLEM - puppet last run on ms-fe2003 is CRITICAL puppet fail [09:35:10] 6operations, 7HHVM: Create new HHVM package for HHVM 3.6.5 + patches - https://phabricator.wikimedia.org/T106483#1470090 (10Joe) 3NEW a:3Joe [09:41:09] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1470142 (10fgiunchedi) hi, there's a few things we'll need before proceeding, * The shell account doesn't seem to exist (yet?) on labs, @srijan can you access some machines in labs? spe... [09:46:29] (03PS2) 10Filippo Giunchedi: Make compaction alert less sensitive to short-time spikes [puppet] - 10https://gerrit.wikimedia.org/r/226113 (owner: 10GWicke) [09:46:40] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Make compaction alert less sensitive to short-time spikes [puppet] - 10https://gerrit.wikimedia.org/r/226113 (owner: 10GWicke) [09:48:48] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:29] (03PS3) 10Filippo Giunchedi: access: shell account for Trey Jones [puppet] - 10https://gerrit.wikimedia.org/r/226077 (owner: 10Matanya) [09:54:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: shell account for Trey Jones [puppet] - 10https://gerrit.wikimedia.org/r/226077 (owner: 10Matanya) [09:55:59] hey! [09:56:11] can someone tell me the Ubuntu version we are using for the puppet master ? [09:56:16] is that Precise or Trusty? [09:57:41] (03PS2) 10Filippo Giunchedi: access: add Trey Jones to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/226115 (owner: 10Matanya) [09:58:03] hashar: palladium [09:58:14] yeah I am wondering whether it is Trusty or Precise :) [09:58:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] access: add Trey Jones to statistics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/226115 (owner: 10Matanya) [09:58:59] oops, sorry :) [09:59:33] matanya: hehe no worries, it was easier to fix on the fly [09:59:42] matanya: thanks for your help though! [10:00:29] sure [10:00:46] godog: can you answer hashar by any chance ? [10:00:58] hashar: palladium is precise, related T98129 [10:01:10] matanya: hehe yeah I was as we speak :) [10:02:04] godog: thanks! [10:02:29] godog: and one last request for now, would you be willing to bring https://phabricator.wikimedia.org/T106447 in ops meeting next week ? [10:02:44] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1470221 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi @tjones you should be set, more access documentation at https://wikitech.wikimedia.org/wiki/SSH_access also thanks @ma... [10:04:15] matanya: yup I'll bring it up [10:04:25] thank you [10:04:31] 10Ops-Access-Requests, 6operations: access request for server side uploads - https://phabricator.wikimedia.org/T106447#1470238 (10fgiunchedi) p:5Triage>3Normal [10:04:55] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1470243 (10fgiunchedi) p:5Triage>3Normal [10:05:08] np! [10:17:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.69% of data above the critical threshold [500.0] [10:18:06] 5xx just had a huge spike [10:18:30] because of thumbnails [10:18:57] <_joe_> jynus: thumbnails? [10:19:05] <_joe_> are you sure? [10:19:56] <_joe_> I don't think we do 200K thumbnails an hour [10:20:41] maybe those were the only 500 left after the fact, let me backtrack [10:29:44] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1470253 (10mark) A clarification is needed here I think: # Labs is setup so it equals the level of access of the rest of the Internet. So Labs hosts can only access production hosts that th... [10:32:10] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia - https://phabricator.wikimedia.org/T106499#1470255 (10hashar) 3NEW [10:38:51] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [10:39:30] (03PS6) 10Giuseppe Lavagetto: Patch to uniqify filename of eval()'d code [debs/hhvm] - 10https://gerrit.wikimedia.org/r/219125 (https://phabricator.wikimedia.org/T102937) (owner: 10EBernhardson) [10:40:51] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1470272 (10mark) a:3RobH Rob, can you take a look at this? SMSGlobal is consistenly unreliable for us, and we either need to get it fixed soon or move to another solution... [10:41:43] 6operations, 10Traffic, 7Performance: enwiki Main_Page timeouts - https://phabricator.wikimedia.org/T104225#1470283 (10mark) [10:42:25] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1470286 (10PleaseStand) >>! In T106388#1469515, @L... [10:42:46] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Great work, thanks" [debs/hhvm] - 10https://gerrit.wikimedia.org/r/219125 (https://phabricator.wikimedia.org/T102937) (owner: 10EBernhardson) [10:45:42] 7Blocked-on-Operations, 6operations, 6Services: Migrate SCA cluster to Jessie - https://phabricator.wikimedia.org/T96017#1470292 (10mark) p:5High>3Normal We've recently discussed about how to move forward with this, and it will likely involve a new temporary service cluster indeed. [10:51:17] 6operations, 7HHVM: Create new HHVM package for HHVM 3.6.5 + patches - https://phabricator.wikimedia.org/T106483#1470308 (10Joe) [10:51:18] 6operations, 7HHVM: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1470307 (10Joe) [10:54:16] (03CR) 10Mobrovac: [C: 031] Remove firejail conditional [puppet] - 10https://gerrit.wikimedia.org/r/226273 (https://phabricator.wikimedia.org/T101870) (owner: 10Muehlenhoff) [10:57:50] PROBLEM - puppet last run on heze is CRITICAL Puppet has 1 failures [11:05:41] PROBLEM - Hadoop NodeManager on analytics1038 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:06:25] (03PS1) 10Giuseppe Lavagetto: Perserve session handler on session_destroy [debs/hhvm] - 10https://gerrit.wikimedia.org/r/226286 (https://phabricator.wikimedia.org/T97675) [11:17:10] 6operations, 10OCG-General-or-Unknown: ocg alarm ocg_job_status_queue 'flapping' - https://phabricator.wikimedia.org/T97524#1470358 (10fgiunchedi) p:5Triage>3Normal I don't see the flapping alerts and much lower queues, out of curiosity what changed? [11:24:11] RECOVERY - puppet last run on heze is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [11:27:16] 6operations, 10Mathoid, 10RESTBase, 6Services: Document and hook up public mathoid end point in RB - https://phabricator.wikimedia.org/T102030#1354112 (10mobrovac) [11:27:51] RECOVERY - Hadoop NodeManager on analytics1038 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [11:37:11] !log springle Synchronized wmf-config/db-eqiad.php: raise db1071 to normal load (duration: 00m 12s) [11:37:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:55:45] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1470493 (10Matanya) Can specific instances be whitelisted ? [11:57:06] matanya: mark has just said no! [11:57:27] he said via proxy [11:57:41] or at aleast that was my understanding [11:58:13] yup [11:58:14] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1470496 (10mark) >>! In T95714#1470493, @Matanya wrote: > Can specific instances be whitelisted ? I'm afraid not, no... That's too much of a security loophole, given the more dynamic and less... [11:58:50] yeah but not a proxy in labs [11:59:01] matanya: but yeah download from mediawik app servers to labs via a proxy might not work [11:59:19] that's hairy, security wise [11:59:42] thank you mark, i'll tell yu the usecase and you can come with an idea? [11:59:52] (prevent the x,y problem) [12:00:12] yes, it needs to be investigated case by case [12:00:57] legoktm has shown the use case: [12:00:58] HTTPS_PROXY=url-downloader.wikimedia.org:8080 curl https://tools.wmflabs.org/legobot/hi.txt [12:00:58] curl: (56) Received HTTP code 403 from proxy after CONNECT [12:00:59] :( [12:01:06] mark: i have TB sized videos i'd like to upload to commons, that requires downloading them from labs to local pc, upload to terbium by someone with rights, and then do server side upload [12:01:40] that seems to be a one time thing? [12:01:48] not quite [12:02:06] it is in the dozens currently [12:03:29] mark: I also have the wikimania videos to upload [12:03:37] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1470498 (10hashar) The use case from T78167 is for wgCopyUploadsDomain: ``` legoktm@terbium:~$ HTTPS_PROXY=url-downloader.wikimedia.org:8080 curl https://tools.wmflabs.org/legobot/hi.txt curl: (... [12:03:39] in the url downloader proxy, whitelisting a specific public labs host may be reasonable [12:03:47] but that would need a thorough review by the security team [12:04:06] note that that's unrelated to the above networking ticket [12:04:16] the url downloader ticket is a public host and can reach labs just fine [12:04:27] it's just not setup to allow requests to labs [12:04:32] I see [12:04:37] needs a new ticket ? [12:05:28] no, https://phabricator.wikimedia.org/T78167 is fine [12:05:41] but i think we can close T95714 as wontfix [12:05:47] or whatever that is these days in phab ;) [12:06:18] weird that it suddenly works hashar [12:06:20] matanya: there is also a ticket to do the video transcoding of Wikimania videos at https://phabricator.wikimedia.org/T106112 [12:06:39] matanya: so potentially the raw files would be uploaded there, transcoded and then manually uploaded to commons [12:06:39] yes, i asked for it [12:07:13] but the reason there is labs doesn't have enough disk space, not the upload problem [12:07:27] mark: maybe the url downloader is now whitelisting the labs reverse proxy || tools.wmflabs.org [12:07:42] possibly, should be able to find it in gerrit then [12:07:54] tough it might be redudant now, since wikimania 2015 had very few videos [12:08:36] 6operations, 6Security-Team: Production cluster can't access labs cluster - https://phabricator.wikimedia.org/T95714#1470502 (10Matanya) 5Open>3declined a:3Matanya Per @mark comments above. [12:10:18] that wikimania video problem does come back every year indeed ;) [12:10:52] and will get worse as video quilty grows [12:10:59] actually [12:11:12] at some point video resolution won't grow anymore as people can't see the difference anymore [12:11:20] so perhaps in 10 years we'll have more bandwidth and not a lot higher quality [12:11:30] and the problem will be gone ;) [12:12:13] i heard this in 2005 when people asked for hi-res images, so we should have hope, as that one is indeed resolved [12:12:29] mark: you can only dream ;) [12:15:11] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [12:19:50] 6operations: Evaluate traffic flow between the Jobrunners and the Cirrus cluster - https://phabricator.wikimedia.org/T105705#1470526 (10mark) That's noise, not a problem at all. :) [12:43:31] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:49:32] 6operations: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1470593 (10fgiunchedi) a:3fgiunchedi [12:52:14] 6operations, 7Varnish: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1470600 (10Joe) 3NEW [12:53:05] 6operations: Stop a poolcounter server fail from being a SPOF for the service and the api (and the site) - https://phabricator.wikimedia.org/T105378#1470608 (10fgiunchedi) see also related poolcounter tickets: * {T32452} * {T65027} [12:53:35] (03PS2) 10Giuseppe Lavagetto: Perserve session handler on session_destroy [debs/hhvm] - 10https://gerrit.wikimedia.org/r/226286 (https://phabricator.wikimedia.org/T97675) [12:55:10] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 12 failures [12:55:33] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Perserve session handler on session_destroy [debs/hhvm] - 10https://gerrit.wikimedia.org/r/226286 (https://phabricator.wikimedia.org/T97675) (owner: 10Giuseppe Lavagetto) [12:59:46] 6operations, 7Icinga, 5Patch-For-Review: Icinga check to detect saturation of nf_conntrack - https://phabricator.wikimedia.org/T105154#1470621 (10MoritzMuehlenhoff) 5Open>3Resolved This is enabled in prod since last week. [13:00:10] PROBLEM - check_puppetrun on boron is CRITICAL Puppet has 12 failures [13:02:36] 6operations, 7Varnish, 7Wikimedia-log-errors: upload.wikimedia.org returns HTTP status code 501 for truncated urls, not 404 - https://phabricator.wikimedia.org/T106517#1470639 (10jcrespo) The main issue here, in my opinion, is the log noise this creates, then; associating project. [13:04:42] 6operations, 6Labs, 7Database, 7Tracking: (Tracking) Database replication services - https://phabricator.wikimedia.org/T50930#1470642 (10jcrespo) [13:04:48] 6operations, 10Deployment-Systems, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1470645 (10fgiunchedi) did this get fixed upstream? afaik we're not experiencing hhvm lockups now in production even on big deploys and there was work around statcache [13:05:10] RECOVERY - check_puppetrun on boron is OK Puppet is currently enabled, last run 63 seconds ago with 0 failures [13:15:07] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1470664 (10hashar) [13:15:09] 6operations, 5Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1470663 (10hashar) [13:15:26] 6operations, 5Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1470666 (10hashar) [13:15:28] 6operations, 10hardware-requests, 5Continuous-Integration-Isolation: eqiad: 2 hardware access request for CI isolation on labsnet - https://phabricator.wikimedia.org/T93076#1470667 (10hashar) [13:16:06] 6operations, 5Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1179075 (10hashar) The hardware has been allocated for Nodepool, so I removed the blocking task {T93706} Most of the puppet patches have been merged, the one left over is the systemd confi... [13:17:50] (03PS1) 10Giuseppe Lavagetto: Bump changelog [debs/hhvm] - 10https://gerrit.wikimedia.org/r/226293 [13:25:22] 6operations: Packet loss alerting - https://phabricator.wikimedia.org/T83196#1470685 (10fgiunchedi) [13:26:55] 6operations, 5Continuous-Integration-Isolation: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1470690 (10hashar) scandium is going to host the Zuul mergers. On the [[ https://www.mediawiki.org/wiki/Continuous_integration/Architecture/Isolation#Architecture_overv... [13:27:50] (03CR) 10Muehlenhoff: "A few comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [13:28:52] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 12 data above and 0 below the confidence bounds [13:33:13] 6operations: Packet loss alerting - https://phabricator.wikimedia.org/T83196#1470695 (10fgiunchedi) some external monitoring is performed by catchpoint nowadays and `check_ripe_atlas`, internal network loss between datacenters still stands though (e.g. by integrating smokeping and icinga or simple checks in icinga) [13:33:39] 6operations, 7Monitoring: internal network packet loss alerting - https://phabricator.wikimedia.org/T83196#1470696 (10fgiunchedi) [13:36:12] (03CR) 10Hashar: "Thanks. Maybe nodepool should be run as a normal process instead of daemon mode. This way we no more need the pid file and rely on system" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [13:39:41] (03CR) 10Hashar: [C: 031] "Indeed there is no inbound connections expected beside the mysql one which only listen on 127.0.0.1 anyway and nodepool is configured to p" [puppet] - 10https://gerrit.wikimedia.org/r/226118 (owner: 10Muehlenhoff) [13:43:22] (03PS2) 10Muehlenhoff: Enable ferm on labnodepool [puppet] - 10https://gerrit.wikimedia.org/r/226118 [14:04:22] !log added cython_0.20.1+git90-g0e6e38e-1ubuntu2~precise1 to precise-wikimedia on carbon (required for activemq backport on precise) [14:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:06:52] 6operations: Manage Appveyor account - https://phabricator.wikimedia.org/T104306#1470743 (10MoritzMuehlenhoff) p:5Triage>3Low [14:11:04] (03PS4) 10Giuseppe Lavagetto: ganglia: rename ganglia_new to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225881 [14:11:40] <_joe_> about to merge ^^ [14:12:38] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: rename ganglia_new to ganglia [puppet] - 10https://gerrit.wikimedia.org/r/225881 (owner: 10Giuseppe Lavagetto) [14:16:16] exceptions (puppet/salt-call) look familiar to anyone? https://phabricator.wikimedia.org/P1034 [14:16:58] i'm trying to get trebuchet deploys to work for those repos in deployment-prep [14:18:44] (03PS7) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [14:18:53] PROBLEM - puppet last run on cp3021 is CRITICAL puppet fail [14:19:10] PROBLEM - puppet last run on ms-fe1001 is CRITICAL Puppet has 1 failures [14:19:11] PROBLEM - puppet last run on db1066 is CRITICAL puppet fail [14:19:15] (03CR) 10Hashar: "It is no running in interactive mode (-d). systemd ExecStop can use $MAINPI" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:19:22] PROBLEM - puppet last run on lvs1002 is CRITICAL Puppet has 1 failures [14:19:39] (03PS1) 10Giuseppe Lavagetto: role::ganglia::web: fully qualify class inclusion [puppet] - 10https://gerrit.wikimedia.org/r/226301 [14:19:41] PROBLEM - puppet last run on uranium is CRITICAL puppet fail [14:19:55] (03CR) 10Hashar: [C: 031] "It is now running in interactive mode (-d). systemd ExecStop can use $MAINPID, so pass it to the graceful-stop script to iterate." [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:20:02] PROBLEM - puppet last run on lvs1005 is CRITICAL puppet fail [14:20:31] PROBLEM - puppet last run on lvs3003 is CRITICAL Puppet has 1 failures [14:20:36] !log enabling puppet on labnodepool1001.eqiad.wmnet [14:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:21:56] (03CR) 10Giuseppe Lavagetto: [C: 032] role::ganglia::web: fully qualify class inclusion [puppet] - 10https://gerrit.wikimedia.org/r/226301 (owner: 10Giuseppe Lavagetto) [14:23:30] RECOVERY - puppet last run on lvs1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:50] RECOVERY - puppet last run on uranium is OK Puppet is currently enabled, last run 2 seconds ago with 0 failures [14:25:34] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me (although PermissionsStartOnly=true might just as well be dropped in the revised version)" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:26:55] (03PS8) 10Hashar: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) [14:27:01] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [14:27:07] Could not retrieve catalog from remote server: Error 400 on SERVER: Could not find template 'ganglia_new/gmond.conf.erb' at /etc/puppet/modules/ganglia_new/manifests/monitor/config.pp:8 on node db1066.eqiad.wmnet [14:28:02] now it works [14:28:38] jynus: probably because ganglia_new was renamed ganglia :) [14:29:17] oh, I see now [14:29:30] RECOVERY - puppet last run on db1066 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:29:42] (03CR) 10Hashar: "I dropped PermissionsStartOnly=true, that was only needed to chown /var/run/nodepool which is no more used" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:33:30] (03PS1) 10Hashar: nodepool: fix erb expansion for dib_cache_dir [puppet] - 10https://gerrit.wikimedia.org/r/226302 [14:35:13] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: tjones needs access to stat1002 - https://phabricator.wikimedia.org/T106175#1470786 (10TJones) Thanks @fgiunchedi & @Matanya, and thanks for the link. [14:44:13] 6operations, 10CirrusSearch, 6Discovery: Decide on and document the implementation for multi-DC CirrusSearch - https://phabricator.wikimedia.org/T105708#1470800 (10chasemp) >>! In T105708#1449782, @Manybubbles wrote: > Looking for a README style document on how it works and how to set it up. Does this exist... [14:44:20] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:44:32] moritzm: mind merging it in ? :-) [14:44:52] hashar: will do [14:44:55] RECOVERY - puppet last run on lvs1005 is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:45:32] (03PS9) 10Muehlenhoff: nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:45:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] nodepool: systemd wrapper [puppet] - 10https://gerrit.wikimedia.org/r/224102 (https://phabricator.wikimedia.org/T96867) (owner: 10Hashar) [14:46:06] RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:17] (03PS2) 10Hashar: nodepool: fix erb expansion for dib_cache_dir [puppet] - 10https://gerrit.wikimedia.org/r/226302 [14:46:27] RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:35] RECOVERY - puppet last run on ms-fe1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:16] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:49:17] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:49:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/226302 (owner: 10Hashar) [14:52:08] 6operations: salt '*' test.ping after upgrade fails on many hosts - https://phabricator.wikimedia.org/T83095#1470826 (10fgiunchedi) @arielglenn I'd think this has been fixed long ago? [14:52:09] 6operations: salt '*' test.ping after upgrade fails on many hosts - https://phabricator.wikimedia.org/T83095#1470828 (10fgiunchedi) [14:59:32] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1470847 (10hashar) [14:59:34] 6operations, 5Continuous-Integration-Isolation: install/deploy labnodepool1001 - https://phabricator.wikimedia.org/T95045#1470849 (10hashar) [14:59:36] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1470850 (10hashar) [14:59:39] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool, 5Patch-For-Review: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1470844 (10hashar) 5Open>3Resolved Thanks to @Muehlenhoff for the final review of the systemd integration. We now... [15:00:04] manybubbles anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150722T1500). Please do the needful. [15:00:19] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1470852 (10hashar) [15:00:33] doesn't look like there are any patches up for SWAT today [15:04:21] maybe swat that new wiki setup we talked about yesterday ? [15:07:28] thcipriani: less work for you ;) [15:10:05] (03PS6) 10Eevans: WIP: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [15:16:22] let me give you some work, then [15:16:39] (03PS2) 10Jcrespo: remove db-secondary.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222289 (owner: 10John F. Lewis) [15:17:16] (03CR) 10Jcrespo: [C: 032] remove db-secondary.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/222289 (owner: 10John F. Lewis) [15:17:50] jynus: what's the point of saying 'let me give you work' to then merge it yourself? :) [15:17:59] :-) [15:18:08] oh, wait until everihting breaks! [15:18:22] true! [15:18:41] (03PS7) 10Eevans: WIP: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [15:19:10] untracked files: "w/405.php" [15:19:46] I guess tracking is a method not allowed [15:20:21] jynus: That's been untracked for awhile... [15:20:39] (03PS8) 10Jakob: Add Phragile module. [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) [15:20:42] ok, no problem for me, will leave it untouched [15:22:00] (03CR) 10Jakob: Add Phragile module. (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [15:22:16] (03PS3) 10EBernhardson: Add statsd reporting plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/223202 [15:24:23] thcipriani: Hi [15:24:31] Mjbmr: hello [15:24:49] thcipriani: can you review this https://gerrit.wikimedia.org/r/220358 [15:26:10] !log jynus Synchronized docroot/noc: removing db-secondary.php from the list of symlinks to maintain (duration: 00m 12s) [15:26:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:25] Any ops about who can touch DNS for me? [15:27:40] !log jynus Synchronized wmf-config: removing db-secondary.php (duration: 00m 12s) [15:27:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:28:30] ostriches: yup, what's up? [15:28:39] https://phabricator.wikimedia.org/rODNS948faf3c44219a55ae6543f7b053e1b249f51c34 was merged, but https://phabricator.wikimedia.org/T106305#1464443 [15:29:29] ostriches: looking [15:29:56] Mjbmr: looks fine to me, are you wanting to get this out for SWAT? [15:30:15] thcipriani: well, yeah. [15:30:46] heh, kk [15:31:23] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220358 (owner: 10Mjbmr) [15:34:13] JohnFLewis, I do not see anything strange after the sync, but I wouldn't be surprised there is an obscure dependency that only arises after months :-) [15:34:56] jynus: This is Wikimedia - of course they will be an issue because of something merged 6 months prior :p [15:35:19] * ostriches sets a calendar reminder [15:36:50] hmm, looks like zuul is having some trouble... [15:38:14] is it broken? [15:39:43] there were some upgrades yesterday that had to be reverted that have been reapplied [15:39:50] so maybe [15:41:11] YuviPanda: any news on the apk download server ? [15:41:28] (03PS3) 10Filippo Giunchedi: Touch project domain templates to enable azb subdomain [dns] - 10https://gerrit.wikimedia.org/r/226079 (https://phabricator.wikimedia.org/T106305) (owner: 10Mjbmr) [15:41:36] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Touch project domain templates to enable azb subdomain [dns] - 10https://gerrit.wikimedia.org/r/226079 (https://phabricator.wikimedia.org/T106305) (owner: 10Mjbmr) [15:42:36] Mjbmr ostriches ^ [15:43:16] JohnFLewis, the other changes you asked me for review are ok, but will take a while, as the dns changes will be one of the last things to do (otherwise we won't be able to login) [15:43:25] godog: I do? [15:43:43] (03CR) 10Thcipriani: Update the logo of lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220358 (owner: 10Mjbmr) [15:43:54] Mjbmr: oh I was pointing to your merge dns change from grrrit-wm [15:44:02] jynus: yeah I saw. once you remove them from puppet they'll be fine as cmjohnson1 just needs the WMF* mgmt addresses anyway [15:44:04] (03CR) 10Thcipriani: [C: 032 V: 032] "SWAT try II" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220358 (owner: 10Mjbmr) [15:44:12] godog: oh, ok. [15:44:23] *remove as in once you're done with them as db*.eqiad.wmnet as hostname [15:44:36] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1470935 (10EBernhardson) a:3EBernhardson [15:44:37] (03Merged) 10jenkins-bot: Update the logo of lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220358 (owner: 10Mjbmr) [15:44:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [15:45:15] Mjbmr: ok, got it working [15:45:32] godog: thx [15:45:37] thcipriani: thanks, can you sync it? [15:46:02] (03CR) 10EBernhardson: "updated to now match es 1.7.0" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/223202 (owner: 10EBernhardson) [15:46:21] godog: Sweet, it's resolving to incubator now. We can take over from here and get the wiki setup. [15:47:04] ostriches: no problem! [15:47:37] for a while there was a lot of mobile urls failing [15:47:51] !log thcipriani Synchronized w/static/images/project-logos/lrcwiki.png: SWAT: Update the logo of lrcwiki [[gerrit:220358]] (duration: 00m 13s) [15:47:55] 10Ops-Access-Requests, 6operations: access request for server side uploads - https://phabricator.wikimedia.org/T106447#1470942 (10Krenair) >>! In T106447#1469355, @Legoktm wrote: > I think adding matanya to the "restricted" group would work? I support this request fwiw. Yep. It's all done on terbium and needs... [15:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:58] ^ Mjbmr check please [15:48:05] thcipriani: I've got a meeting in ~10m, but we'll poke azbwiki and get it setup after if you've got time still [15:48:41] ostriches: I've got a meeting at 9:30, but after that though... [15:48:48] thcipriani: I think you should run sync-file w/static/images/project-logos/lrcwiki.png [15:49:35] thcipriani: 1h from now wfm, my day's pretty open. [15:49:44] ostriches, do you have a ticket already? [15:49:46] Mjbmr: I did run that: sync-file w/static/images/project-logos/lrcwiki.png "SWAT: Update the logo of lrcwiki [[gerrit:220358]]" [15:50:07] thcipriani: thanks! [15:50:10] jynus: T106305 [15:50:14] Mjbmr: yw [15:50:15] ostriches, thanks [15:50:57] (have to create my own for db-labs) [15:51:30] Gotcha. It's a new language wikipedia so nothing special about it. [15:52:08] yes , the earlier I do it the easier it is for me my part [15:52:30] <_joe_> !log uploaded hhvm_3.6.5+dfsg1-1+wm1 to reprepro [15:52:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:54:08] jynus: If you need to get the labsdb stuff done first then make it a blocking task and I can wait until it's resolved. [15:54:19] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1470965 (10fgiunchedi) I took a quick look at this, it should really happening automatically, for that to happen `authdns-gen-z... [15:54:25] no, actually it has to be after [15:54:32] but shortly after :-) [15:54:33] Ah, but soon after. [15:54:34] Gotcha [15:54:37] do not ask [15:57:49] 6operations, 10Wikimedia-DNS: DNS zones do not get re-generated when adding new language - https://phabricator.wikimedia.org/T84684#1470976 (10fgiunchedi) [15:57:50] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: adding new languages to DNS langs.tmpl doesn't work until zone template is edited as well - https://phabricator.wikimedia.org/T97051#1470978 (10fgiunchedi) [16:01:27] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:03:43] <_joe_> !log installed the hhvm 3.6.5 on deployment-prep [16:03:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:10:21] 10Ops-Access-Requests, 6operations: access request for server side uploads - https://phabricator.wikimedia.org/T106447#1471029 (10Krenair) I would note that @matanya already has #WMF-NDA access. [16:12:31] 6operations, 7HHVM: Create new HHVM package for HHVM 3.6.5 + patches - https://phabricator.wikimedia.org/T106483#1471055 (10Joe) [16:12:32] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1471054 (10Joe) [16:12:48] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1377848 (10Joe) [16:12:49] 6operations, 7HHVM, 5Patch-For-Review: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1471062 (10Joe) [16:13:09] <_joe_> ebernhardson: your patch to hhvm is live in beta, in case you want to verify [16:13:23] <_joe_> anomie, bd808: your required backport is there too [16:13:36] _joe_: awesome! [16:13:39] _joe_: Thanks [16:14:18] <_joe_> it's going to prod in a few days, first on canary systems, then everywhere [16:14:23] _joe_: thanks, i've let the mobile team know so they can try reverting their hacks [16:15:06] <_joe_> sorry it took so long :( [16:15:21] thanks _joe_! [16:15:32] things look good for the scalers too, no? [16:15:48] 10Ops-Access-Requests, 6operations: access request for server side uploads - https://phabricator.wikimedia.org/T106447#1471076 (10hoo) I'm not a huge fan of this given that shell upload is a bad workaround and should be eliminated, rather than being propagated further. I don't have a problem with matanya havi... [16:16:14] <_joe_> ori: not completely, there is a high level of 5xxs right now, and I think it has something to do with majority of the cluster being on HHVM [16:18:05] (03PS1) 10Tim Landscheidt: Tools: Fix mail address for webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/226311 (https://phabricator.wikimedia.org/T106462) [16:19:26] hmm, ok. i'll try to investigate [16:19:40] (03CR) 10Tim Landscheidt: "Tested on tools-bastion-01; pre:" [puppet] - 10https://gerrit.wikimedia.org/r/226311 (https://phabricator.wikimedia.org/T106462) (owner: 10Tim Landscheidt) [16:20:06] memcached errors are still really high for the keys "zhwiki:preprocess-hash:b878bc90c624257155bc0aa0e4b4e0c5:1" and " zhwiki:preprocess-hash:715b4007d719254ecd0ba9a09e83dfe1:1" [16:21:03] bd808: any ideas why? [16:21:09] and when did it start? [16:21:39] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 15.38% of data above the critical threshold [500.0] [16:21:43] They were the big error source 24 hours ago. Haven't gone farther back than that. [16:22:05] <_joe_> ori: so what I'm seeing right now is that https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Casquette-IMG_0922.jpg/600px-Casquette-IMG_0922.jpg returns a 503, but if you request that to the imagescalers, they respond with a 302 [16:22:17] There were some logs yesterday about them being too big [16:22:24] grr [16:22:24] <_joe_> which seems not to be liked by varnish, evidently [16:22:38] _joe_: i have to run to a dentist appt, can you copy that url to the task? [16:22:40] <_joe_> ori: I'll comment on the HHVM/imagescalers ticket [16:22:41] i'll look at it as soon as i get back [16:22:42] thanks! [16:22:56] <_joe_> and good luck with the dentist :) [16:28:08] 6operations: mwdeploy does not have the same user ID on all Apaches - https://phabricator.wikimedia.org/T79786#1471169 (10fgiunchedi) [16:29:59] (03CR) 10Merlijn van Deen: [C: 031] Tools: Fix mail address for webservice jobs [puppet] - 10https://gerrit.wikimedia.org/r/226311 (https://phabricator.wikimedia.org/T106462) (owner: 10Tim Landscheidt) [16:33:06] jdlrobson: it's going to prod in a 'few days', first on canary systems then everywhere [16:34:18] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [16:34:48] 7Blocked-on-Operations, 6operations, 6Commons, 6Multimedia, and 5 others: Convert eqiad imagescalers to HHVM, Trusty - https://phabricator.wikimedia.org/T84842#1471228 (10Joe) New imagescalers fun found today: an high level of 5xx were going on today, so I intercepted a few urls that were returning a 503... [16:37:30] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Package looks good." [debs/hhvm] - 10https://gerrit.wikimedia.org/r/226293 (owner: 10Giuseppe Lavagetto) [16:41:44] 6operations, 7HHVM, 5Patch-For-Review: Custom session handler corrupted by session_destroy, "Failed to initialize storage module" - https://phabricator.wikimedia.org/T97675#1471267 (10Anomie) Testing on beta, it appears that (once deployed) the new package will fix this bug. @hashar: Please upgrade the CI s... [16:42:19] godog, why did you restrict editing of https://phabricator.wikimedia.org/T83095 ? [16:42:47] (03PS2) 10Giuseppe Lavagetto: hhvm: expire APC keys after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/225858 (https://phabricator.wikimedia.org/T104769) [16:43:02] (03PS1) 10EBernhardson: Fix IndentationError in es-tool [puppet] - 10https://gerrit.wikimedia.org/r/226312 [16:43:28] <_joe_> anomie: I was wondering - might this be related to people losing sessions when editing? [16:43:49] _joe_: context for "this"? [16:44:19] <_joe_> oh, err, sorry [16:44:31] <_joe_> the corruption of session data that the patch fixed [16:45:15] could i pester anyone for an easy +2 in operations/puppet? There is a python script called es-tool that has an indentation error: https://gerrit.wikimedia.org/r/#/c/226312/1 [16:45:26] <_joe_> ebernhardson: I can take a look [16:45:58] _joe_: thanks [16:46:24] (03PS3) 10Giuseppe Lavagetto: hhvm: expire APC keys after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/225858 (https://phabricator.wikimedia.org/T104769) [16:46:40] (03CR) 10Giuseppe Lavagetto: [C: 032] hhvm: expire APC keys after 2 days [puppet] - 10https://gerrit.wikimedia.org/r/225858 (https://phabricator.wikimedia.org/T104769) (owner: 10Giuseppe Lavagetto) [16:47:00] 6operations: salt '*' test.ping after upgrade fails on many hosts - https://phabricator.wikimedia.org/T83095#1471325 (10fgiunchedi) [16:47:11] Krenair: too much queue duty :( fixed [16:50:32] _joe_: if it happened to be related, when should we see the effects on https://phabricator.wikimedia.org/T102199 ? 2 days? :) [16:51:12] _joe_: It seems unlikely, since MediaWiki currently seems to work around the bug (the only place it calls session_destroy(), it then calls session_set_save_handler() again before session_start()) [16:51:53] <_joe_> Nemo_bis: apparently not [16:53:52] 6operations: setup redirection for mediawiki.gr, wikibooks.gr, wikisource.gr - https://phabricator.wikimedia.org/T83190#1471398 (10fgiunchedi) [16:54:10] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1471401 (10EBernhardson) Ran into small issue with es-tool, there is an IndentationError in the current script. Waiting on https://gerrit.wikimedia.org/r/226312... [16:54:32] (03PS2) 10Filippo Giunchedi: Fix IndentationError in es-tool [puppet] - 10https://gerrit.wikimedia.org/r/226312 (owner: 10EBernhardson) [16:54:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Fix IndentationError in es-tool [puppet] - 10https://gerrit.wikimedia.org/r/226312 (owner: 10EBernhardson) [16:54:41] (03CR) 10Giuseppe Lavagetto: [C: 032] Fix IndentationError in es-tool [puppet] - 10https://gerrit.wikimedia.org/r/226312 (owner: 10EBernhardson) [16:54:46] _joe_: haha [16:54:48] <_joe_> lol at the same time [16:54:52] :) [16:54:57] ebernhardson: you have plenty of +2s! [16:55:14] {{done}} [16:55:37] excellent, beta cluster instances should run puppet on their own in a "few minutes" ? [16:56:02] will also be doing prod, but next week not today [16:56:36] _joe_: well yeah, chances were very low :) [16:56:38] correct, I think the beta puppet master will update by itself from production [16:57:38] <_joe_> ok, off for today, see you tomorrow [16:58:02] godog, thanks [17:00:19] 6operations: cp1021-36, cp1041-42 status - https://phabricator.wikimedia.org/T83075#1471465 (10fgiunchedi) [17:00:48] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 23.08% of data above the critical threshold [500.0] [17:02:18] godog, and also thank you for opening up more tickets [17:04:01] Krenair: np, it is mostly outdated stuff tho [17:05:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 11 data above and 3 below the confidence bounds [17:09:45] 6operations, 6Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#1471530 (10fgiunchedi) this is still happening on `labvirt1007` ``` root@palladium:~# salt -b 10 'virt*' cmd.run 'ip a l | grep virbr' Executing run on ['virt1006.eqiad.wmnet', 'virt1002.eqiad.wmnet', '... [17:09:47] 6operations, 6Labs: virbr0 interface present in some virt hosts - https://phabricator.wikimedia.org/T83732#1471533 (10fgiunchedi) [17:15:08] PROBLEM - Hadoop NodeManager on analytics1031 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:15:32] hm! [17:17:08] RECOVERY - Hadoop NodeManager on analytics1031 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [17:19:48] the increase of 5XXs correlates with HHVM restarting on several nodes [17:20:16] should I be worried about the uptick in hhvm fatals since ~16:52? https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor [17:20:58] <_joe_> jynus: yes, because puppet [17:21:13] <_joe_> greg-g: what kind of fatals? [17:21:19] _joe_: see the link [17:22:02] "Lost parent, LightProcess exiting" [17:22:09] <_joe_> greg-g: linked with restarts [17:22:10] (sorry, it wasn't loading for me completely, now it did) [17:22:20] <_joe_> that's just lightprocess exiting [17:22:30] <_joe_> it gets registered to the hhvm error log [17:22:36] is there a reason it just started? [17:22:40] <_joe_> so those are not really errors at all [17:22:52] <_joe_> greg-g: yeah, I changed a config key for hhvm in puppet [17:22:58] (hhvm only graph/data: https://logstash.wikimedia.org/#dashboard/temp/AU62ygMts2oZVZyjCZAn ) [17:23:04] <_joe_> with a change earlier [17:23:25] so, nothing to stop mukunda deploying? :) [17:23:57] <_joe_> I don't think so, but lemme check one thing [17:24:02] * greg-g really dislikes "meh, not interesting" issues spamming the fatal log [17:25:04] <_joe_> well, we could decide to patch hhvm to avoid this... [17:25:39] How can I force a purge of a static image? [17:26:21] <_joe_> greg-g: go on :) [17:26:43] r:) [17:26:57] -r [17:34:02] Krenair: good question ... [17:34:13] * twentyafterfour wants to know this one too ;) [17:34:16] Apparently these things are cached for a year. :/ [17:34:44] whose idea was it anyway? [17:34:50] can never have too much cache [17:35:18] SquidUpdate::purge( array( 'https://url/to/static/image' ) ); was tried and didn't appear to have any effect [17:36:19] twentyafterfour: If you're not caching it in 8 places, you're not caching it enough [17:39:38] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [17:40:03] ostriches: of course. Because 8 layers of cache makes purge much more straightforward and consistent [17:43:01] oh https://gerrit.wikimedia.org/r/#q,I8c9a6a56730f9e4a4f3caf956fd2ee12a39c9cde,n,z [17:43:11] (03CR) 10Legoktm: "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/226101 (https://phabricator.wikimedia.org/T106051) (owner: 10Ori.livneh) [17:43:45] also read my first comment on https://gerrit.wikimedia.org/r/#q,I1e4f348f03f8644eca6998313ab92c94a093cc33,n,z [17:45:04] or maybe append a question mark at the end of url in InitialiseSettings.php [17:52:17] Krenair: it looks like purging a file from cache involves sending a HTTP PURGE request from an IP on the allowed ACL [17:53:44] then again there is a lot of conflicting documentation on wikitech. [17:53:46] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges [17:55:31] 6operations, 6Release-Engineering, 7Database: Audit all existing code to ensure that any extension currently or previously adding blobs to ES has been registering a reference in the text table (and fix up if wrong) - https://phabricator.wikimedia.org/T106388#1471749 (10Legoktm) [18:00:04] twentyafterfour greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150722T1800). Please do the needful. [18:00:19] (03PS9) 10Rush: Add Phragile module. [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [18:00:30] jouncebot: You're always so punctual. [18:05:01] (03PS1) 1020after4: group1 wikis to 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226322 [18:05:18] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226322 (owner: 1020after4) [18:05:25] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf15 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226322 (owner: 1020after4) [18:08:09] !log twentyafterfour Synchronized php-1.26wmf15/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: deploy https://gerrit.wikimedia.org/r/#/c/226313/ (duration: 00m 13s) [18:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:56] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf15 [18:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:09:31] twentyafterfour, I don't think -g bits would work these days anyway? [18:09:55] probably not [18:15:23] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471811 (10RobH) 3NEW a:3RobH [18:15:42] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471821 (10RobH) [18:15:44] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459481 (10RobH) [18:18:43] 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1471831 (10RobH) 3NEW [18:18:47] !log running populateContentModel.php --ns=all --table=page on all medium wikis [18:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:18:58] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471844 (10RobH) [18:18:59] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1459481 (10RobH) [18:19:01] 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1471843 (10RobH) [18:20:24] 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1471831 (10RobH) [18:20:25] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471811 (10RobH) [18:20:27] 6operations, 10hardware-requests: server for wikimania video transcoding - https://phabricator.wikimedia.org/T106112#1471854 (10RobH) 5Open>3Resolved Task T106563 is for the installation of this host. Please use T106565 for ongoing discussion on the technical implications of the actual video import (since... [18:20:43] jouncebot: next [18:20:43] In 1 hour(s) and 39 minute(s): Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150722T2000) [18:21:11] twentyafterfour: You done with your window? [18:21:17] 6operations: Import Wikimania 2015 Videos - https://phabricator.wikimedia.org/T106565#1471863 (10RobH) [18:21:25] ostriches: yes [18:21:29] Thx [18:21:34] * ostriches takes the deploy baton [18:25:44] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471867 (10RobH) There is an ongoing discussion on what vlan this can live in, installation cannot progress until resolved. [18:26:26] What goes on in the 'maintenance' RT queue? [18:26:42] notices of carrier outages [18:26:45] vendor outages, etc.. [18:27:00] as soon as procurement moves out of rt, we'll move maint at same time [18:27:06] but its easier to let it sit so far ;D [18:27:35] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471869 (10RobH) a:5RobH>3None [18:30:07] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471883 (10RobH) I'm happy to install this system once the import process (and thus what vlan this has to sit on) are determined. [18:30:33] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471886 (10yuvipanda) One option is to have this be a test bed for T95185, other is to make it a regular machine with prod access. [18:30:54] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1471890 (10leila) @fgiunchedi thank you for the follow up. >>! In T106407#1470142, @fgiunchedi wrote: > * As per https://wikitech.wikimedia.org/wiki/Requesting_shell_access we'll need... [18:33:47] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1471907 (10srijan) Yes I can access the machines as given in https://wikitech.wikimedia.org/wiki/Help:Getting_Started#Project_Instances [18:34:29] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1471915 (10EBernhardson) Upgrade completed, but I'm not sure if this was a reasonable test of freezing/thawing the indexes. I did a reboot of a single machine... [18:35:02] robh, okay, thanks [18:35:21] bblack, hey [18:35:24] you around? [18:38:11] matanya around? [18:38:16] Krenair: he might be on vacation [18:39:10] _joe_: BTW, did you/anyone convert the last few image scalers to HHVM? [18:39:13] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471919 (10yuvipanda) If this requires shell access / scp to terbium for actual upload, then this should just be a machine in prod. Not sure about the security implications of that, h... [18:39:21] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1471921 (10Krenair) dn: uid=srijan,ou=people,dc=wikimedia,dc=org sshPublicKey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDjZ2mI7ERfrBUiVlAF21aR7DrIuQJXsx/YLNkP3h3UEcp298jNl9h+3XTE2YE0T0JYj... [18:39:58] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1471924 (10yuvipanda) @matanya @legoktm would this require scp / rsync access to terbium? [18:40:14] James_F, there was more fun re. those eqiad zend imagescalers today [18:40:24] https://phabricator.wikimedia.org/T84842#1471228 [18:42:13] (03PS2) 10Chad: Configuration for South Azerbaijani Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225835 (https://phabricator.wikimedia.org/T106305) (owner: 10Mjbmr) [18:42:39] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Upgrade beta to Elasticsearch 1.7.0 - https://phabricator.wikimedia.org/T106164#1471929 (10EBernhardson) Based on the server logs (/var/log/elasticsearch/beta-search.log) it looks like all writes were stopped as expected. [18:42:49] (03CR) 10Thcipriani: [C: 032] Configuration for South Azerbaijani Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225835 (https://phabricator.wikimedia.org/T106305) (owner: 10Mjbmr) [18:42:55] (03Merged) 10jenkins-bot: Configuration for South Azerbaijani Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225835 (https://phabricator.wikimedia.org/T106305) (owner: 10Mjbmr) [18:43:36] thcipriani: are you guys creating a new wiki? [18:44:03] Mjbmr: yup [18:44:10] !log demon Synchronized database lists: azbwiki++ (duration: 00m 13s) [18:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:36] thcipriani: https://gerrit.wikimedia.org/r/225214 must be done first. [18:44:58] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: azbwiki++ [18:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:45:20] Mjbmr: kk, thanks for the heads up [18:45:24] Mjbmr: Doing it now [18:47:57] !log demon Synchronized w/static/images/project-logos/azbwiki.png: azbwiki++ (duration: 00m 12s) [18:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:18] !log demon Synchronized wmf-config/InitialiseSettings.php: azbwiki++ (duration: 00m 12s) [18:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:48:43] !log demon Synchronized langlist: azbwiki++ (duration: 00m 12s) [18:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:12] !log demon Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 13s) [18:49:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:19] 6operations, 10Deployment-Systems, 7HHVM: HHVM lock-ups - https://phabricator.wikimedia.org/T89912#1471986 (10swtaarrs) I'm not aware of any fixes for this specific issue. I had the original author of StatCache take a look at @BBlack's comments and he said it shouldn't be possible without some kind of memory... [18:51:43] !log demon Started scap: azbwiki namespace stuff [18:51:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:51:58] Mjbmr: Merged and scapping so i18n rebuilds. [18:52:09] ostriches: thanks. [18:52:48] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [18:53:08] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472002 (10RobH) Logging into smsglobal, and looking @ the outgoing reports, shows a lot of yellow (sent, but not confirmed received) and red (failed). Tracking down @fgiunchedi as an example... [18:53:09] ostriches: https://gerrit.wikimedia.org/r/225834 also need to be done and a backport. [18:53:20] Krenair: Fun [18:53:46] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472004 (10RobH) @Joe: Do you happen to recall which service you tested and liked the best? I'm still trying to find old tasks from then to find my lists. [18:55:14] jzerebecki: you around? [18:55:24] s [18:55:45] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1472015 (10srijan) [18:55:57] jzerebecki: I don't know to make a backport for https://gerrit.wikimedia.org/r/225834 [18:58:55] Mjbmr: after the cherry pick is merged to the wmf branch. run composer update wikibase/wikibase in the wmf branch of wikidata. [19:01:15] Mjbmr: if i'm araound then I can do that (updating wikidata that is). [19:01:44] jzerebecki: can you do the cherry-pick also please? [19:02:26] Mjbmr: yes, but who is going to merge it? [19:02:35] jzerebecki: chad [19:02:40] k [19:02:42] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472065 (10RobH) seems we already had wikimedia.pagerduty.com. Chase advised that OIT has an account, as well as his own demo account from a few months ago. he is going to track it down and... [19:07:19] (03CR) 10Ori.livneh: [C: 032 V: 032] Version update: 7.4.3 -> 7.6.2 [software/sentry] - 10https://gerrit.wikimedia.org/r/225827 (https://phabricator.wikimedia.org/T105374) (owner: 10Gergő Tisza) [19:08:03] (03CR) 10Ori.livneh: [C: 031] Icinga: fix varnishncsa warning on text & mobile caches [puppet] - 10https://gerrit.wikimedia.org/r/226110 (owner: 10Gage) [19:11:34] legoktm: Do you have a minute to talk about extjson and the math extension? [19:11:39] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [19:11:43] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 and eventlogging for legoktm - https://phabricator.wikimedia.org/T106184#1472087 (10Catrope) Approved [19:11:58] physikerwelt1: sure, what's up? [19:12:08] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1472092 (10srijan) [19:13:07] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [19:13:15] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1468417 (10srijan) @Krenair I have added a separate key in the task description. [19:13:23] the constants that determine the rendering mode have been around for quite some time. It would be nice if there was a way to register them before the user config i.e. LocalSettings.php has been evaluated [19:13:38] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.095 second response time [19:14:06] legoktm: the proposed changeset uses callback which seems to be evaluated only after the user defined settings have been read https://gerrit.wikimedia.org/r/#/c/187654/18/extension.json [19:14:20] (03CR) 10Nikerabbit: [C: 031] Add wgSitename and wgMetaNamespace for pnbwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225322 (owner: 10Amire80) [19:14:58] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.095 second response time [19:15:01] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472118 (10chasemp) emailed to follow up on reactivating [19:15:59] physikerwelt1: yeah, that's not going to work. The constants could go into the PHP file, so users using that would still get backwards compatability, and people who use wfLoadExtension() would need to use a string like $wgDefaultUserOptions['math'] = 'png' or 'mathml', etc. [19:16:04] ostriches, did you do WikimediaMaintenance/filebackend/setZoneAccess.php ? [19:16:36] Yerp [19:16:56] but just not populateSitesTable? [19:17:16] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472126 (10RobH) We're getting the old demo account reactivated for testing. [19:18:22] Krenair: I did the foreach wikidataclient.dblist bit by accident before realizing it wouldn't do a whole lot :\ [19:18:25] legoktm: ok there is no hook that is evaluated earlier i.e. right after wfLoadExtension was called [19:18:39] ostriches, hm. why would it not do much? [19:19:28] (03PS1) 10Jforrester: Support VisualEditor configuration for new/auto account enablement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226336 [19:19:30] (03PS1) 10Jforrester: Enable VisualEditor for auto-created accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226337 [19:19:32] (03PS1) 10Jforrester: Enable VisualEditor for 5% of new accounts on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226338 [19:20:07] Krenair: It's not in wikidataclient.dblist yet, and needed https://gerrit.wikimedia.org/r/#/c/225834/ I thought [19:20:55] So, all other wikidata clients got a bonus sites table refresh, heh [19:20:58] (03CR) 10Jforrester: [C: 04-1] "Not just yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226338 (owner: 10Jforrester) [19:21:00] Just not the one I wanted! [19:21:08] heh [19:21:14] 6operations, 10Wikimedia-Logstash: reinstall logstash1001-1003 - https://phabricator.wikimedia.org/T97545#1472138 (10RobH) a:5RobH>3Cmjohnson I'm going to re-assign this from myself to @cmjohnson; as it is onsite disk swaps and server relocations followed by reinstallation. Chris: The next steps are outli... [19:21:16] physikerwelt1: nope. One day the configuration will be in the database, where it won't be able to use things like PHP constants [19:21:29] I don't think I even got a chance to review your config addition for azbwiki before you merged it [19:22:01] It was mostly not my patch [19:22:08] All I did was add the wikiversions.json entry [19:22:43] oh, I didn't add it to wikidataclient.dblist sorry :/ [19:22:58] 6operations, 10Analytics-Cluster: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1472141 (10Ottomata) 3NEW a:3Ottomata [19:23:02] 6operations, 10ops-codfw, 10hardware-requests, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1472149 (10RobH) a:5RobH>3Papaul Assigning this task to Papaul. @Papaul: Please shut down es2004 and replace the faulty memory. I ordered it in pairs, si... [19:23:10] 6operations, 10ops-codfw, 7Database: Faulty memory on es2004 (purchase one module) - https://phabricator.wikimedia.org/T103843#1472158 (10RobH) [19:23:36] 6operations, 10Analytics-Cluster: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1472162 (10Ottomata) [19:23:47] Mjbmr, ostriches, thcipriani: Please tell me you at least checked this against [[wikitech:Add_a_wiki]]'s list of dblists? [19:25:24] legoktm: ok makes a lot sense. Is there a standard way for string constants? For example what's the canonical translation of MW_MATH_CHECK_ALWAYS to a string constant? (I would use just 'always') [19:25:28] Yeah, which is why it's in s3.dblist, all.dblist, wikipedia.dblist, and small.dblist [19:25:53] securepollglobal.dblist [19:25:58] physikerwelt1: no standard yet, you're the first ;) [19:26:14] Krenair: It's not my first wiki creation you know :) [19:26:56] (03PS1) 10Mjbmr: Add azbwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226343 [19:27:10] 10Ops-Access-Requests, 6operations: access request for server side uploads - https://phabricator.wikimedia.org/T106447#1472164 (10RobH) As @matanya is a volunteer, and thus has no direct manager, its been suggested requests such as this have require @mark to sign off as 'manager.' [19:27:21] (03CR) 10KartikMistry: [C: 031] Add wgSitename and wgMetaNamespace for pnbwikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225322 (owner: 10Amire80) [19:28:00] legoktm: in addition if we merge this change it has to be ensured that all the customized language settings will still work as before. In general I'm a little worried that this change could cause a lot of trouble if no serious review happens [19:28:11] is scap done yet? [19:28:28] robh: yt? [19:29:01] 10Ops-Access-Requests, 6operations, 10Analytics-Cluster: Sudo permissions for hdfs user madhuvishy on analytics-hadoop - https://phabricator.wikimedia.org/T104020#1472168 (10RobH) Please note that @madhuvishy already has an existing shell account and has signed L3. As such, this request still requires both... [19:29:25] 6operations: Remove default ciphers in OpenSSL / OpenSSL June security updates - https://phabricator.wikimedia.org/T101082#1472170 (10MoritzMuehlenhoff) 5Open>3Resolved All systems have been updated. [19:30:00] !log updated remaining Ubuntu systems for openssl/export grade update [19:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:17] legoktm: currently a user called paladox is submitting patches and I'm not sure if the user is aware of the possible global implications [19:31:09] physikerwelt1: yes, I don't trust his patches. [19:34:09] (03PS8) 10Eevans: WIP: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [19:34:41] !log demon Finished scap: azbwiki namespace stuff (duration: 42m 57s) [19:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:35:02] Mjbmr: Yerp ^ [19:35:19] 'k [19:39:03] 6operations, 6Labs, 10Labs-Infrastructure: rename holmium to labdns1002 - https://phabricator.wikimedia.org/T106303#1472227 (10RobH) [19:39:07] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1472224 (10RobH) 5Open>3Resolved a:3RobH Allocating wmf4575 as labdns1001; will create the tasks for installation and link them. [19:41:11] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labdns1001 - https://phabricator.wikimedia.org/T106584#1472236 (10RobH) 3NEW a:3RobH [19:42:09] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [19:43:52] just fixed some replication issues on dbstore200[12] [19:44:00] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [19:44:48] 6operations, 10ops-eqiad: server labdns1001/wmf4575 - apply label and update racktables - https://phabricator.wikimedia.org/T106585#1472274 (10RobH) 3NEW a:3Cmjohnson [19:45:06] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1472283 (10mark) Is that all that box will do, backup dns for Labs? [19:46:03] (03PS1) 10Yuvipanda: labstore: Add timeout of 5s to urlopen call to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/226349 (https://phabricator.wikimedia.org/T106076) [19:46:16] (03PS3) 10Yuvipanda: labstore: Use safe_load vs load for yaml loading [puppet] - 10https://gerrit.wikimedia.org/r/223074 [19:46:20] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, and 2 others: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1472288 (10awight) [19:46:22] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Use safe_load vs load for yaml loading [puppet] - 10https://gerrit.wikimedia.org/r/223074 (owner: 10Yuvipanda) [19:46:34] 6operations, 6Labs, 10Labs-Infrastructure: rename holmium to labdns1002 - https://phabricator.wikimedia.org/T106303#1472291 (10RobH) [19:46:38] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1472289 (10RobH) 5Resolved>3Open I misread this allocation. I thought it was a replacement, not a backup. This needs more discussion, as Mark's question demonstrates.... [19:46:40] (03PS2) 10Yuvipanda: labstore: Add timeout of 5s to urlopen call to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/226349 (https://phabricator.wikimedia.org/T106076) [19:46:50] (03CR) 10Yuvipanda: [C: 032 V: 032] labstore: Add timeout of 5s to urlopen call to wikitech [puppet] - 10https://gerrit.wikimedia.org/r/226349 (https://phabricator.wikimedia.org/T106076) (owner: 10Yuvipanda) [19:48:26] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1472298 (10RobH) Also, this sits in a public vlan, unlike other labs boxes. Could a simple labs dns system exist in a ganeti vm or does mixing those two stacks seem horrible? [19:48:27] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1472301 (10Cmjohnson) [19:48:30] 6operations, 10ops-eqiad: relocate wmf3544 from row d into any other row - https://phabricator.wikimedia.org/T105904#1472299 (10Cmjohnson) 5Open>3Resolved Relocated to c7 ge-7/0/12 updated racktables and switch [19:51:37] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [19:53:38] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 5.630 second response time [19:55:51] (03PS9) 10Nemo bis: WIP: Cassandra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [19:57:14] ori, hey [19:57:15] around? [19:57:26] 6operations, 10ops-eqiad, 10Traffic, 5Patch-For-Review: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1472353 (10Cmjohnson) Ordered thermal paste and it should be here in 3 days. Let's work on fixing them next week. I will ping you and we go do in c... [19:57:37] !log re-attributed edits to User:Mirwin~enwiki (T106069) [19:57:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:57:52] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1472363 (10RobH) 3NEW a:3RobH [19:57:57] yay [19:58:07] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1363592 (10RobH) [19:58:08] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1472373 (10RobH) [19:58:27] mirwin's ancient edits are very interesting; many only available to Meta-Wiki sysops though [19:59:16] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1472363 (10RobH) [19:59:18] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472390 (10RobH) 5Open>3stalled On the issue of smsglobal failures. I've called and opened a ticket for the issue. However, as their side shows sent, I expect nothing of consequence from... [19:59:23] 6operations: Investigate smsglobal delivery failures from 2015-06-13 weekend - https://phabricator.wikimedia.org/T102396#1472395 (10RobH) p:5High>3Low [20:00:04] gwicke cscott arlolra subbu: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150722T2000). [20:00:09] (03PS5) 10Yuvipanda: [WIP] labstore: Rewrite of replica-addusers.pl [puppet] - 10https://gerrit.wikimedia.org/r/223564 [20:02:12] 6operations, 10Fundraising-Backlog, 10MediaWiki-extensions-CentralNotice, 10MediaWiki-extensions-Translate, and 3 others: Publishing translations for central notice banners fails - https://phabricator.wikimedia.org/T104774#1472408 (10DStrine) [20:02:36] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1472419 (10Legoktm) >>! In T106563#1471924, @yuvipanda wrote: > @matanya @legoktm would this require scp / rsync access to terbium? Yes, or any other host that lets you run maintenan... [20:02:41] 6operations, 7Monitoring: Switch Icinga from smsglobal to pagerduty - https://phabricator.wikimedia.org/T106589#1472420 (10RobH) Notes: Pagerduty has an email service, so we can email to email@ourportal.pagerduty.com and it will then follow the escalation path within pagerduty. As such, we'll have to create a... [20:03:34] 6operations, 6Labs, 10Labs-Infrastructure, 10hardware-requests: New server: labdns1001 - https://phabricator.wikimedia.org/T106147#1472428 (10RobH) [20:03:37] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labdns1001 - https://phabricator.wikimedia.org/T106584#1472427 (10RobH) [20:04:06] 6operations, 6Labs, 10Labs-Infrastructure: install/setup labdns1001 - https://phabricator.wikimedia.org/T106584#1472236 (10RobH) [20:04:08] 6operations, 10ops-eqiad: server labdns1001/wmf4575 - apply label and update racktables - https://phabricator.wikimedia.org/T106585#1472432 (10RobH) 5Open>3declined the server allocation is not settled, i created this task prematurely, closing. [20:09:46] (03PS10) 10Eevans: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [20:10:08] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [20:10:24] (03PS1) 1020after4: Check for l10n cache before sync-wikiversions [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) [20:11:51] _joe_, did you fix https://phabricator.wikimedia.org/T84842#1471228 ? that URL works for my browser [20:11:59] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [20:19:54] jouncebot: next [20:19:54] In 2 hour(s) and 40 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150722T2300) [20:20:27] I'm going to deploy https://gerrit.wikimedia.org/r/226388 [20:22:20] ostriches, I wonder why the newprojects list message didn't get set [20:22:46] sent* [20:23:47] (03PS11) 10Eevans: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [20:26:22] (03CR) 10Reedy: "Woo :D" (031 comment) [tools/scap] - 10https://gerrit.wikimedia.org/r/226353 (https://phabricator.wikimedia.org/T100573) (owner: 1020after4) [20:26:27] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [20:26:36] !log twentyafterfour Synchronized php-1.26wmf15/includes/libs/MultiHttpClient.php: Deploy https://gerrit.wikimedia.org/r/#/c/226388/ (duration: 00m 12s) [20:26:40] Reedy, any idea why? [20:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:26:54] newproject message? [20:27:01] yeah [20:27:03] it's supposed to be 15 minute delayed by default [20:27:07] PROBLEM - Apache HTTP on mw1160 is CRITICAL - Socket timeout after 10 seconds [20:27:08] not anymore [20:27:16] https://gerrit.wikimedia.org/r/#/c/218712/2/addWiki.php https://gerrit.wikimedia.org/r/#/c/218766/2/modules/scap/files/notifyNewProjects [20:28:19] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.109 second response time [20:28:59] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.082 second response time [20:34:47] PROBLEM - Apache HTTP on mw1159 is CRITICAL - Socket timeout after 10 seconds [20:35:05] about to deploy parsoid. there are no fires burning, right? [20:35:33] ostriches, you wouldn't happen to still have the addWiki console output would you? [20:36:38] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.057 second response time [20:37:00] 10Ops-Access-Requests, 6operations: Requesting access to stat1003 for Srijankedia - https://phabricator.wikimedia.org/T106407#1472569 (10leila) @fgiunchedi >>! In T106407#1470142, @fgiunchedi wrote: > * @srijan are you employed by WMF or volunteer? If it is the latter we'll need also https://wikitech.wikimedi... [20:44:21] 6operations: Update people.wikimedia.org with the 2015 Wikimania group photo - https://phabricator.wikimedia.org/T106598#1472598 (10Legoktm) 3NEW [20:45:27] (03PS23) 10BryanDavis: Add role::labs::mediawiki_vagrant [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T106213) [20:45:49] 6operations, 7Easy: Update people.wikimedia.org with the 2015 Wikimania group photo - https://phabricator.wikimedia.org/T106598#1472605 (10Legoktm) [20:48:30] jynus, hi, any updates on the kartatherian? I might need to update the config puppet pretty soon btw [20:48:31] YuviPanda: here now for a short period [20:48:59] bd808: bah, needs manual rebase [20:49:37] legoktm already replied on the ticket [20:49:49] matanya: cool, so I guess this needs to be in prod [20:49:56] right [20:49:57] matanya: did you see https://phabricator.wikimedia.org/T106444#1469738 btw? [20:50:17] i have, don't know what to say, it is a question for godog [20:50:41] added him [20:55:39] !log updated Parsoid to version 6befc44e [20:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:57:55] 6operations, 6Labs: lvm 'others20150715' snapshot full on labstore1001 - https://phabricator.wikimedia.org/T106601#1472660 (10yuvipanda) 3NEW [21:00:48] (03PS1) 10Alex Monk: Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 [21:08:53] Krenair: It failed at the Cirrus index creation. I live-hacked it and tried re-running it so it would do the newprojects notif. [21:08:55] Guess my hack didn't work [21:09:06] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia - https://phabricator.wikimedia.org/T106499#1472709 (10hashar) [21:09:58] 6operations, 10Continuous-Integration-Infrastructure: Upload new Zuul .deb package on apt.wikimedia.org for precise-wikimedia - https://phabricator.wikimedia.org/T106499#1472719 (10hashar) 5Open>3stalled There is some nasty regression in it so marking T106531 as a blocker. Will craft a new package with //... [21:19:28] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch: Update Wikimedia apt repo to include debs for Elasticsearch & Logstash on jessie - https://phabricator.wikimedia.org/T98042#1472781 (10bd808) With T106126, we should be importing https://download.elastic.co/elasticsearch/elasticsearch/elasticse... [21:23:01] ostriches, I sent it manually and it seems to have worked [21:23:07] Probably due to missing globals [21:23:12] Hmm [21:23:12] (03PS1) 10Yurik: Change kartotherian config params [puppet] - 10https://gerrit.wikimedia.org/r/226419 [21:23:26] wouldn't have been needed when we ran it as a random script piped straight into mwscript eval.php [21:24:16] I'll submit a patch [21:24:32] (03PS1) 10Yurik: Fixed possible port mistype [puppet] - 10https://gerrit.wikimedia.org/r/226421 [21:26:26] https://azb.wikipedia.org/w/api.php?action=query&list=allusers [21:32:31] 6operations, 7HTTPS: ldap-codfw.wikimedia.org & ldap-eqiad.wikimedia.org expire in September 2015 - https://phabricator.wikimedia.org/T106604#1472814 (10Krenair) [21:32:38] 6operations, 7HTTPS, 7LDAP: ldap-codfw.wikimedia.org & ldap-eqiad.wikimedia.org expire in September 2015 - https://phabricator.wikimedia.org/T106604#1472817 (10Krenair) [21:35:07] ori, hello? [21:35:47] (03CR) 10Jforrester: "I guess this works. Maybe we should set up the $wmgVisualEditorAvailableNamespaces array in InitialiseSettings.php too? Or were you thinki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [21:36:54] Krenair: hey [21:36:57] what's up? [21:37:28] ori, static logos... how can we force varnish to stop caching old versions of specific updated ones? [21:38:06] restart varnish [21:38:36] Reedy: Thank you for that. :-P [21:38:38] BAN them [21:38:40] (03PS2) 10Alex Monk: Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 [21:38:41] np [21:38:42] which ones? [21:38:43] Yeah [21:38:55] (03CR) 10Alex Monk: "I didn't think we'd get anything useful out of that?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [21:39:00] you can ban specific assets from varnihs [21:39:03] "one off purges" [21:39:35] Reedy, okay... do we have any up-to-date documentation on how we can ban urls? [21:39:47] no. which ones do you need banned? [21:39:58] https://wikitech.wikimedia.org/wiki/Varnish#One-off_purges [21:39:59] Or not [21:40:00] hahah [21:40:10] https://lrc.wikipedia.org/static/images/project-logos/lrcwiki.png was updated [21:40:11] "Don't do this." [21:40:32] I think https://lrc.wikipedia.org/w/static/images/project-logos/lrcwiki.png is the newer version [21:43:58] (03PS24) 10BryanDavis: Add role::labs::mediawiki_vagrant [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T106213) [21:44:21] Krenair: how does it look now? [21:44:30] (03CR) 10Jforrester: "> I didn't think we'd get anything useful out of that?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [21:44:33] (03CR) 10BryanDavis: "PS24 was a manual rebase needed by submodule changes." [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T106213) (owner: 10BryanDavis) [21:44:50] (03CR) 10Yuvipanda: [C: 032] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/193665 (https://phabricator.wikimedia.org/T106213) (owner: 10BryanDavis) [21:49:14] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1472988 (10GWicke) @robh, @fgiunchedi: Sorry if you sent a link already, but is there a procurement ticket for the SSDs? What is the status on that order? [21:50:00] ori, I still see two different versions [21:50:11] not me. clear your browser cache [21:50:48] jgage: yt? [21:52:34] ori, done, and tried a different browser. still different [21:53:19] esams vs eqiad? [21:53:59] Krenair: try one more time [21:54:53] gwicke: are you asking about the replacement ssds or the defective ones? [21:54:58] ref: https://phabricator.wikimedia.org/T102557#1472988 [21:55:07] robh: the replacements [21:55:10] both are valid questions mind you, just clarifying =] [21:55:11] Reedy, could be, I'm hitting esams [21:55:21] ahh, so, i can tell you that 3 of the 6 hit the dc last week [21:55:27] although, if the RMAs come back that would also be interesting [21:55:28] and the other 3 were amazon order and had shipping issues [21:55:40] i fixed it today,a nd the other 3 will arrive to chris either tomororw or friday [21:55:45] ori, still an issue, and also I tried curl: Last-Modified: Tue, 16 Jun 2015 16:05:14 GMT [21:55:46] ah, cool! [21:55:58] Krenair: file a task and assign it to me. sorry [21:56:00] baically another newegg order [21:56:03] ok [21:56:11] robh: we are getting closer to re-trying the expansion with smaller instances, so good timing [21:56:12] i'll update task with this info as well, but just wanted to make sure its what you were asking =] [21:56:49] robh: thanks! [21:57:16] 6operations: install / setup tungsten for temp use - wikimania 2015 video transcoding - https://phabricator.wikimedia.org/T106563#1473038 (10RobH) https://rt.wikimedia.org/Ticket/Display.html?id=9473 tracks the ssd order for quantity 6, summary: 3 of the 6 are onsite and the other 3 will arrive on friday (latest... [21:57:17] quite welcome [21:57:58] robh: I think that's a different ticket [21:58:08] arrhghasdfasdjk [21:58:10] i have too many tabs! [21:58:14] https://phabricator.wikimedia.org/T102557 is the C* one [21:58:23] yea, thx for noticing, i didnt ;D [21:59:09] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1473043 (10RobH) https://rt.wikimedia.org/Ticket/Display.html?id=9473 tracks the order. summary: 3 of the 6 are @ eqiad, and the other 3 will be onsite either tomorrow or Friday [21:59:12] there we go [22:01:00] 6operations, 7Varnish: Figure out purging of static logo updates - https://phabricator.wikimedia.org/T106620#1473054 (10Krenair) 3NEW a:3ori [22:01:06] (03CR) 10Gilles: "We now have a nice dashboard covering thumbnail generation stats: https://grafana.wikimedia.org/#/dashboard/db/media" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224393 (https://phabricator.wikimedia.org/T105680) (owner: 10Gilles) [22:01:42] 6operations, 7Varnish: Figure out purging of static logo updates - https://phabricator.wikimedia.org/T106620#1473064 (10Krenair) [22:02:20] (03CR) 10Reedy: [C: 032] Add azbwiki to wikidataclient.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226343 (owner: 10Mjbmr) [22:03:49] 6operations, 7Varnish: Figure out purging of static logos for updates - https://phabricator.wikimedia.org/T106620#1473071 (10Krenair) [22:03:57] robh hiyaaa [22:04:08] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:04:13] CRITICAL [22:04:58] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [22:05:10] geez the bot has so many problems and is so drama prone :p [22:05:27] Is jenkins broken? [22:08:16] (03CR) 10Reedy: [V: 032] "Screw you jenkins." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226343 (owner: 10Mjbmr) [22:09:02] !log reedy Synchronized database lists: Add azbwiki to wikidataclient.dblist (duration: 00m 11s) [22:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:09] 6operations, 10Analytics-Cluster: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1473094 (10Ottomata) [22:09:12] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1473093 (10Ottomata) [22:09:15] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1473099 (10Ottomata) [22:09:18] 6operations, 7Easy: Update people.wikimedia.org with the 2015 Wikimania group photo - https://phabricator.wikimedia.org/T106598#1473096 (10Krenair) I'd use https://commons.wikimedia.org/wiki/File:Wikimania_2015_%E2%80%93_Hackathon_group_photo.jpg this year actually, since it's for technical users (I don't thin... [22:09:37] !log running in screen as reedy on tin foreachwikiindblist wikidataclient.dblist extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https [22:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:54] Reedy, oh, I think hashar posted something about that to wikitech-l earlier? [22:10:07] I still don't read mailing list :P [22:10:28] (03CR) 10Mobrovac: [C: 04-1] "Nice work! One small comment in-lined." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [22:10:33] Reedy: https://lists.wikimedia.org/pipermail/wikitech-l/2015-July/082503.html [22:10:41] Reedy, https://lists.wikimedia.org/pipermail/wikitech-l/2015-July/082503.html [22:10:44] force-merges can break zuul [22:14:57] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [22:16:07] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [22:16:35] could someone +2 https://mail.google.com/mail/u/0/#inbox/14eb7a73085c7d3c - it should be a noop for the maps [22:16:41] (03CR) 10Alex Monk: [C: 04-1] "Not ready, see task" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) (owner: 10Alex Monk) [22:17:01] yurik, gmail link? :) [22:17:01] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1473136 (10Ottomata) I was able to build a 0.8.1.1 Jessie package using Alex's debianization. I did it by adding the noted missing dependency jars to ext/ and cha... [22:17:12] LOLOLOLOL [22:17:16] sorry [22:17:25] i hope its not a key :) [22:17:54] Krenair, https://gerrit.wikimedia.org/r/226419 [22:18:24] I don't think it is [22:18:27] (03PS2) 10Yuvipanda: Change kartotherian config params [puppet] - 10https://gerrit.wikimedia.org/r/226419 (owner: 10Yurik) [22:18:34] (03CR) 10Yuvipanda: [C: 032 V: 032] Change kartotherian config params [puppet] - 10https://gerrit.wikimedia.org/r/226419 (owner: 10Yurik) [22:18:49] yurik: ^ [22:22:30] legoktm: since I'm the first one to handle the string constant problem I spend a little more work on a solution that should be easy to generalize https://gerrit.wikimedia.org/r/#/c/226436/ [22:23:09] YuviPanda, thx! [22:24:29] YuviPanda, also, there is this patch - but i'm less sure about it - for some reason there is a wrong port number in it. https://gerrit.wikimedia.org/r/#/c/226421/ [22:24:41] any idea why it might be? [22:24:55] (03PS2) 10Yuvipanda: Fixed possible port mistype [puppet] - 10https://gerrit.wikimedia.org/r/226421 (owner: 10Yurik) [22:24:55] akosiaris is afk for the next week [22:25:19] (03CR) 10Yuvipanda: [C: 032 V: 032] "I suspect so too" [puppet] - 10https://gerrit.wikimedia.org/r/226421 (owner: 10Yurik) [22:25:41] thx! [22:26:19] * yurik is waiting for caboom [22:26:48] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1473173 (10GWicke) @robh, thanks! @fgiunchedi, @cmjohnson: It might be worth installing and possibly testing those new SSDs in one of the boxes that currently have broken disks.... [22:27:22] (03CR) 10Jforrester: [C: 031] Config for replacement of wgVisualEditorNamespaces with an associative array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226414 (owner: 10Alex Monk) [22:27:47] I filled up the rest of our spaces in swat today, but I have another few things to do so I'll likely end up doing more things afterwards [22:28:19] physikerwelt1: hmm apparently you're using about half a TB on NFS. we don't have any space constraints or anything, but is that all needed atm? [22:28:30] (03CR) 10Alex Monk: "Chris, can you check through this?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225840 (https://phabricator.wikimedia.org/T61702) (owner: 10Alex Monk) [22:28:44] 6operations, 10Analytics-Cluster, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1473174 (10awight) @ellery: Ping, we're waiting to see if you think we should care about the statistical differences between udp2log and kafkatee? [22:29:03] matanya: do you still need the files in NFS for the video project? [22:29:32] gwicke: i can add the 2 disks tomorrow morning...which one do you prefer? [22:29:55] cmjohnson: you mean which node? [22:30:11] yes [22:30:23] let me check which one has the intels right now [22:30:33] 1008 has intels right now iirc [22:30:50] yeah, ack [22:31:02] so any of the other two would be good [22:31:20] k [22:31:36] gives us the option to start testing with the intel disks on 1008 once the instance size is down some more [22:31:46] without blocking on the disk test [22:32:09] cmjohnson: thanks! [22:32:54] yw...I will do that first thing tomorrow [22:39:16] 6operations, 10CirrusSearch, 6Discovery, 5codfw-rollout: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1473188 (10chasemp) [22:41:38] 6operations, 6Discovery, 5codfw-rollout: Cirrus search in codfw - https://phabricator.wikimedia.org/T105703#1473195 (10chasemp) [22:41:51] 6operations, 10Analytics-Cluster, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1473197 (10awight) a:5AndyRussG>3awight [22:44:49] (03PS1) 10Brion VIBBER: Enable 240p Theora and WebM video transcodes for low-bandwidth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226445 (https://phabricator.wikimedia.org/T104063) [22:49:00] greg-g, i need to do some work on the maps cluster, its in prod, but there is no outside access yet, so its not production in that sense [22:49:32] 10Ops-Access-Requests, 6operations: New production ssh key for awight - https://phabricator.wikimedia.org/T106625#1473201 (10awight) 3NEW [22:50:20] greg-g, the updates are done via git deploy start|sync, so i shouldn't conflict with anyone due to git deploy global tin lock [22:52:16] !log populateSitesTable.php finished [22:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:54:08] yurik: kk [22:54:14] yurik: log what you're doing :) [22:54:22] greg-g, oki, will do [22:59:27] (03CR) 10BryanDavis: Cassanra logstash setup (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [23:00:04] RoanKattouw ostriches rmoen Krenair: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150722T2300). Please do the needful. [23:00:05] James_F gilles jdlrobson Krenair: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:05] (03PS12) 10Eevans: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [23:00:21] ok [23:01:02] anyone else want to do that? [23:01:14] * James_F waves. [23:01:49] * RoanKattouw is in a meeting [23:02:22] Is rmoen actively handling these? [23:02:50] (03PS13) 10Eevans: Cassanra logstash setup [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) [23:02:53] ok, I'll do it [23:03:20] Krenair: I'm not [23:04:24] (03CR) 10Eevans: Cassanra logstash setup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [23:04:26] rmoen: weren't you going to start joining in? [23:04:52] greg-g: yes, can I do tomorrow ? [23:04:56] :) of course [23:05:03] VE still requires submodule update so I'll do those two in one go [23:05:05] just curious/misunderstanding [23:05:26] (03CR) 10Eevans: Cassanra logstash setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [23:05:39] Sigh. I guess I should re-add the docs for these. [23:06:07] In case you don't already know: git submodule update has a --recursive option [23:08:05] we're only updating VE-MW, nothing from VE core today [23:09:45] (03CR) 10Mobrovac: Cassanra logstash setup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [23:18:59] PROBLEM - puppet last run on mw2109 is CRITICAL puppet fail [23:19:37] !log krenair Synchronized php-1.26wmf15/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/226447/ (duration: 00m 13s) [23:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:20:10] James_F, ^ [23:20:28] Thanks [23:20:50] Krenair: Can confirm that everything seems fine. [23:21:04] thanks [23:21:49] other than jenkins, which is snafu [23:21:53] Yeah. [23:22:06] gilles, hey, around? [23:22:16] Krenair: yes [23:22:55] (03CR) 10Alex Monk: [C: 032] Re-enable thumbnail chaining with a single reference thumbnail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224393 (https://phabricator.wikimedia.org/T105680) (owner: 10Gilles) [23:23:23] (03Merged) 10jenkins-bot: Re-enable thumbnail chaining with a single reference thumbnail [mediawiki-config] - 10https://gerrit.wikimedia.org/r/224393 (https://phabricator.wikimedia.org/T105680) (owner: 10Gilles) [23:24:05] gilles, syncing [23:24:07] please check [23:24:08] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/224393/ (duration: 00m 13s) [23:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:35] !log krenair Synchronized wmf-config/InitialiseSettings-labs.php: https://gerrit.wikimedia.org/r/#/c/224393/ (duration: 00m 12s) [23:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:24:57] Krenair: Worth merging https://gerrit.wikimedia.org/r/#/c/226336/ too? [23:25:16] Krenair: (Empty config for the code we just pushed.) [23:26:00] looks like a no-op, will do with my block of about 5 patches [23:26:39] actually I think that's an underestimate [23:28:13] Kk. [23:28:19] (And yes, it should be a no-op.) [23:28:58] Krenair: checking... [23:29:22] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1473365 (10awight) [23:29:49] 6operations, 6Analytics-Backlog, 6Performance-Team, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1473368 (10ori) To investigate this, @Krinkle and I collected 10 minutes' worth of requests to `poweredby... [23:31:45] Krenair: looks good, thank you [23:32:05] gilles, great, np [23:32:20] jdlrobson, hey, around? [23:34:04] (03CR) 10BryanDavis: "We can test this out in the beta cluster via a cherry-pick when you think it is basically done to see what the logstash events look like a" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/226025 (https://phabricator.wikimedia.org/T100970) (owner: 10Eevans) [23:35:01] Hm. Florian isn't around either. [23:35:34] wait, wtf? [23:35:39] looks like this was already done? [23:37:12] 6operations, 6Analytics-Backlog, 6Release-Engineering, 7Varnish: Verify traffic to static resources from past branches does indeed drain - https://phabricator.wikimedia.org/T102991#1473421 (10ori) [23:37:48] Krenair: Think someone +2'ed the extension branch in preparation? [23:38:00] no, I would've noticed it earlier if they'd done so [23:38:13] and would be yelling at twentyafterfour [23:38:18] :p [23:38:30] Hmm. [23:38:43] 6operations, 10CirrusSearch, 6Discovery, 3Discovery-Cirrus-Sprint: Request Elasticsearch hardware for secondary CirrusSearch in codfw - https://phabricator.wikimedia.org/T105707#1473430 (10chasemp) Why 24 boxes here and not 23 or $number? Is there load analysis I can use to justify? I don't understand w... [23:39:39] James_F, 18:08 logmsgbot: twentyafterfour Synchronized php-1.26wmf15/extensions/MobileFrontend/includes/MobileFrontend.hooks.php: deploy https://gerrit.wikimedia.org/r/#/c/226313/ (duration: 00m 13s) [23:39:58] !log deployed kartotherian [23:40:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:40:24] Krenair: Ah. [23:40:26] (03CR) 10Alex Monk: [C: 032] Add namespace aliases for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://phabricator.wikimedia.org/T58169) (owner: 10Gerrit Patch Uploader) [23:40:35] (03CR) 10Alex Monk: [C: 032] Use NewUserMessage on gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225409 (https://phabricator.wikimedia.org/T106169) (owner: 10Alex Monk) [23:40:42] (03CR) 10Alex Monk: [C: 032] Add rollbacker to arwikisource and autopatrolled+rollbacker to arwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225519 (https://phabricator.wikimedia.org/T97271) (owner: 10Alex Monk) [23:40:49] (03Merged) 10jenkins-bot: Add namespace aliases for shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/118654 (https://phabricator.wikimedia.org/T58169) (owner: 10Gerrit Patch Uploader) [23:40:55] (03CR) 10Alex Monk: [C: 032] Revert "Only use the RSS proxy on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225717 (https://phabricator.wikimedia.org/T90513) (owner: 10Alex Monk) [23:41:04] (03CR) 10Alex Monk: [C: 032] Stop meta bureaucrats from removing sysop or bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225722 (https://phabricator.wikimedia.org/T106291) (owner: 10Alex Monk) [23:41:15] (03Merged) 10jenkins-bot: Use NewUserMessage on gomwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225409 (https://phabricator.wikimedia.org/T106169) (owner: 10Alex Monk) [23:41:40] (03Merged) 10jenkins-bot: Add rollbacker to arwikisource and autopatrolled+rollbacker to arwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225519 (https://phabricator.wikimedia.org/T97271) (owner: 10Alex Monk) [23:42:03] (03Merged) 10jenkins-bot: Revert "Only use the RSS proxy on foundationwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225717 (https://phabricator.wikimedia.org/T90513) (owner: 10Alex Monk) [23:42:25] (03Merged) 10jenkins-bot: Stop meta bureaucrats from removing sysop or bureaucrat [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225722 (https://phabricator.wikimedia.org/T106291) (owner: 10Alex Monk) [23:42:27] (03CR) 10Alex Monk: [C: 032] Support VisualEditor configuration for new/auto account enablement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226336 (owner: 10Jforrester) [23:42:55] (03Merged) 10jenkins-bot: Support VisualEditor configuration for new/auto account enablement [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226336 (owner: 10Jforrester) [23:45:47] RECOVERY - puppet last run on mw2109 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:16] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1473511 (10Cmjohnson) 1042-1045 have base install w/out puppet certs. [23:47:20] !log krenair Synchronized wmf-config/CommonSettings.php: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=171578&oldid=171570 (duration: 00m 12s) [23:47:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:47:47] !log krenair Synchronized wmf-config/InitialiseSettings.php: https://wikitech.wikimedia.org/w/index.php?title=Deployments&diff=171578&oldid=171570 (duration: 00m 12s) [23:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:48:19] 6operations, 6Discovery, 10Maps: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1473520 (10Yurik) p:5Triage>3High [23:48:48] 6operations, 6Discovery, 10Maps, 3Discovery-Maps-Sprint: Grant log file access to Yurik & Maxsem on maps-test200{1-4} - https://phabricator.wikimedia.org/T106629#1473526 (10MaxSem) [23:49:07] hm, warning spam [23:49:12] about portals [23:50:06] could someone set right perms on https://phabricator.wikimedia.org/T106629 -- i can't see why my service is not starting [23:50:18] thx :) [23:51:17] !log krenair Synchronized wmf-config/InitialiseSettings.php: partially revert https://gerrit.wikimedia.org/r/#/c/118654/ (duration: 00m 12s) [23:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:51:30] okay, that cut down on the warnings [23:51:43] still one going through [23:52:16] actually, this one makes no sense [23:54:30] 6operations, 10CirrusSearch, 6Discovery, 5codfw-rollout: Implement multi-DC support in CirrusSearch - https://phabricator.wikimedia.org/T105709#1473600 (10EBernhardson) [23:54:57] !log krenair Synchronized wmf-config/InitialiseSettings.php: re-try reverted portion of https://gerrit.wikimedia.org/r/#/c/118654/ using NS IDs instead of not-necessarily-defined constants which were causing warning flood (duration: 00m 13s) [23:55:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:57:06] that looks better [23:58:37] (03CR) 10Alex Monk: [C: 032] Remove old VE namespaces on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225541 (https://phabricator.wikimedia.org/T92797) (owner: 10Alex Monk) [23:59:01] (03Merged) 10jenkins-bot: Remove old VE namespaces on testwiki and mediawikiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225541 (https://phabricator.wikimedia.org/T92797) (owner: 10Alex Monk)