[00:00:48] (03PS3) 1020after4: Move maniphest status settings into custom/wmf-defaults.php [puppet] - 10https://gerrit.wikimedia.org/r/205797 (https://phabricator.wikimedia.org/T548) [00:04:14] mhh gwicke urandom restbase p99 skyrocketed at 19.00 UTC :| http://grafana.wikimedia.org/#/dashboard/db/restbase [00:04:24] PROBLEM - Disk space on analytics1021 is CRITICAL: DISK CRITICAL - free space: / 1068 MB (3% inode=91%) [00:05:09] godog: looking [00:05:47] hmm i'll take a look at analytics1021 [00:05:52] mean also increased significantly [00:06:16] yeah I just noticed by looking at icinga [00:07:19] the C* metrics look fairly normal [00:09:09] godog: that was not long after the downgrade to 2.1.3 [00:09:28] the restart for that started at 18:31 [00:09:40] precisely [00:12:17] logstash for rb shows spikes of logs upon restarts, but back to normal levels after https://logstash.wikimedia.org/#dashboard/temp/pBtal9lZQUaKbSBfea2IMA [00:12:17] 6operations, 5Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1394854 (10Dzahn) it was reverted in https://gerrit.wikimedia.org/r/#/c/219297/ [00:12:37] 6operations, 5Patch-For-Review: Install fonts-wqy-zenhei on all mediawiki app servers - https://phabricator.wikimedia.org/T84777#1394855 (10Dzahn) a:5Dzahn>3None [00:12:47] godog: yeah, same story from the 5xx metrics [00:13:16] and the C* logs are also looking normal [00:14:41] (03CR) 10Dzahn: "it exists in trusty: http://packages.ubuntu.com/trusty/fonts-wqy-zenhei" [puppet] - 10https://gerrit.wikimedia.org/r/219297 (owner: 10Ori.livneh) [00:14:52] there were some compaction fixes in 2.1.7, but this a larger change in latencies than I'd expect from that [00:14:53] !log experimenting with HHVM shutdown via /stop on the admin server on mw1041 [00:14:58] Logged the message, Master [00:16:05] (03CR) 10Dzahn: "where did we get these errors?" [puppet] - 10https://gerrit.wikimedia.org/r/219297 (owner: 10Ori.livneh) [00:18:23] godog: let me perform a rolling restart of the restbase instances, to rule out that it's one worker with a semi-hanging driver connection to cassandra [00:18:44] PROBLEM - HHVM rendering on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [00:18:54] yep [00:19:00] analyitcs1021 disk space warning: i emptied kafkaServer-gc.log, will discuss recent increase in disk use with ottomata in our meeting tomorrow morning [00:19:15] PROBLEM - Apache HTTP on mw1041 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [00:19:48] !log rolling restart of restbase instances to rule out backend connections as a source for high p99 latencies [00:19:50] Woo, Pygments live on mediawiki.org [00:19:52] Logged the message, Master [00:19:58] 6operations: Move static-bugzilla from zirconium to ganeti - https://phabricator.wikimedia.org/T101734#1394861 (10Dzahn) [00:20:44] RECOVERY - HHVM rendering on mw1041 is OK: HTTP OK: HTTP/1.1 200 OK - 66010 bytes in 0.127 second response time [00:21:14] RECOVERY - Apache HTTP on mw1041 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [00:21:18] (03PS1) 10BBlack: Merge branch 'master' into wmf [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220356 [00:21:20] (03PS1) 10BBlack: Apply ssl_stapling_file fix, re-ported [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220357 [00:21:36] (03CR) 10BBlack: [C: 032 V: 032] Merge branch 'master' into wmf [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220356 (owner: 10BBlack) [00:21:46] (03CR) 10BBlack: [C: 032 V: 032] Apply ssl_stapling_file fix, re-ported [software/nginx] (wmf) - 10https://gerrit.wikimedia.org/r/220357 (owner: 10BBlack) [00:23:42] godog: restart is done [00:24:29] but, based on the distribution of timeout reports in logstash I'm sceptical about the single-working-connection theory [00:26:18] interestingly, http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-cf-latencyrate does not show much of a change [00:27:44] hard to tell from those though, being from cassandra they started flowing again after the restart [00:28:11] if you look at the 24 hour view you see the numbers from 2.1.7 as well [00:28:28] none of those are anywhere near 5000 [00:29:49] ah yeah that's true [00:30:58] 6operations, 10vm-requests, 5Patch-For-Review: EQIAD: 1 VM request for planet - https://phabricator.wikimedia.org/T101899#1394873 (10Dzahn) getting IP now, installer starts, getting console. problem then was after installer is finished and instance reboots it goes into a cycle and PXE boots again and again... [00:35:53] 10Ops-Access-Requests, 6operations, 10SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#1394889 (10Dzahn) [00:36:13] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [00:36:49] 10Ops-Access-Requests, 6operations, 10SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#1394891 (10Dzahn) @chasemp is the option "Access Request" from the Security menu the appropriate choice here? [00:39:19] (03PS1) 10Mjbmr: Update the logo of lrcwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220358 [00:40:08] 6operations, 6Labs, 10wikitech.wikimedia.org, 7Ipv6: Set IPv6 PTR for wikitech-static - https://phabricator.wikimedia.org/T103621#1394899 (10Dzahn) login at https://mycloud.rackspace.com/ (somehow) [00:43:06] 6operations, 10SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#1394907 (10Dzahn) [00:43:31] !log experimenting with httpd on mw1041 again [00:43:35] Logged the message, Master [00:45:24] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 7 below the confidence bounds [00:47:27] godog: our config specifies pretty much immediate retry [00:47:37] and default is one retry at most [00:50:21] 10Ops-Access-Requests, 6operations, 7LDAP: Request "wmf" group assignments for account "sniedzielski" - https://phabricator.wikimedia.org/T103191#1394928 (10Dzahn) @Niedzielski Thanks for understanding. Sorry, it was our fault to do it wrong in the first place. [00:51:09] 10Ops-Access-Requests, 6operations, 6Discovery, 10SEO, 3Discovery-Analysis-Sprint: Get Oliver Keyes access to Google Webmaster Tools for all Wikimedia domains - https://phabricator.wikimedia.org/T101157#1394930 (10Dzahn) [00:52:47] 10Ops-Access-Reviews, 6operations: Review access to sodium for John Lewis - https://phabricator.wikimedia.org/T102124#1394935 (10Dzahn) 5Open>3declined a:3Dzahn [00:52:48] 10Ops-Access-Requests, 6operations: Requesting access to sodium for John Lewis - https://phabricator.wikimedia.org/T102075#1394937 (10Dzahn) [00:53:11] gwicke: heh with that pattern jumping the p99 I can't think of anything except a timeout [00:53:56] (03PS3) 10Dzahn: labstore: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219070 [00:54:42] (03CR) 10Dzahn: [C: 032] labstore: switch to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/219070 (owner: 10Dzahn) [00:57:10] godog: it's definitely the driver timeout [00:57:52] the value matches what we saw for 5xx p99 before the upgrade [00:58:31] so it basically means that ~1% of requests are now succeeding on second try [01:01:10] now restarting the cassandra instances to rule that out as well [01:01:47] !log rolling restart of cassandra instances to rule out a single node in funky state causing elevated p99 latency [01:01:52] Logged the message, Master [01:13:17] restart has finished now [01:15:56] godog: looks promising [01:16:15] p99 of 1000 requests from tin is 46ms now [01:16:47] was 1209 before [01:16:56] yeah the mean was cropping up as well and going back down now [01:17:24] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [01:27:01] gwicke: still not seeing p99 decreasing in grafana tho [01:27:24] yeah, 4xx p99 has moved, but 2xx not yet [01:28:03] (03PS5) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.26 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 [01:30:55] and 4xx p99 has pretty much reverted now [01:31:41] godog: to me it looks like nothing is too badly on fire right now; the latency isn't great, but there is a chance that it'll fix itself after a while [01:33:41] if it doesn't fix itself until tomorrow, then we could consider moving back to 2.1.7 [01:33:55] despite the lack of reliable metrics [01:34:51] to me no metrics is a non-starter btw [01:35:26] we still have the critical request metrics [01:35:42] the ones we actually have the alerts on [01:36:36] as well as host metrics and nodetool [01:39:16] and no way to look back for sure at what was happening to cassandra say, 10m ago [01:39:41] yeah, not internals [01:39:50] only bottom line, IO, network [01:39:51] !log ori Synchronized php-1.26wmf11/extensions/SyntaxHighlight_GeSHi: I0e5f2d3b2 (duration: 00m 13s) [01:39:56] Logged the message, Master [01:40:18] godog: http://underscoopfire.com/wp-content/uploads/2011/01/zack-morris.jpg [01:41:02] ori: yeah we've had plenty of cassandra/restbase timeouts [01:41:58] we are trying to lower the load per instance, but hardware is not our friend [01:47:45] PROBLEM - puppet last run on wtp2001 is CRITICAL puppet fail [01:52:25] (03PS1) 10Ori.livneh: Move performance.wikimedia.org out of operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/220369 (https://phabricator.wikimedia.org/T101974) [01:52:38] (03CR) 10Ori.livneh: [C: 032] Move performance.wikimedia.org out of operations/puppet [puppet] - 10https://gerrit.wikimedia.org/r/220369 (https://phabricator.wikimedia.org/T101974) (owner: 10Ori.livneh) [01:59:16] (03PS1) 10Ori.livneh: Set www-data as owner of performance docroot [puppet] - 10https://gerrit.wikimedia.org/r/220371 [01:59:42] (03CR) 10Ori.livneh: [C: 032 V: 032] Set www-data as owner of performance docroot [puppet] - 10https://gerrit.wikimedia.org/r/220371 (owner: 10Ori.livneh) [02:03:56] (03PS1) 10Ori.livneh: Use Apache 2.4 grant syntax for perf.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/220372 [02:04:16] (03CR) 10Ori.livneh: [C: 032 V: 032] Use Apache 2.4 grant syntax for perf.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/220372 (owner: 10Ori.livneh) [02:05:25] ori: \o/ thanks [02:05:49] paravoid: (1) I still need to do the same for xenon [02:05:52] (2) You need to go to SLEEP [02:06:08] it's 5 AM! [02:07:32] bblack: I happened to spot https://gerrit.wikimedia.org/r/220297 today and all I kept shouting in my head was "now you have two problems." [02:07:37] :P [02:09:14] RECOVERY - puppet last run on wtp2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:10:33] Katie: you missed the other 3-4 related patches before it. I think we went from about 22 problems to 67 problems or so :) [02:10:47] Heh. [02:11:42] help rid our world of regexes: clean up the public canonical URI-space we expose and honor in MediaWiki :) [02:13:40] My mind is still boggling a bit at the former logic of basically reusing the request URI. Ori's change looks really good, though. I tested it a bit today. [02:14:07] yeah I like it too [02:24:42] !log l10nupdate Synchronized php-1.26wmf10/cache/l10n: (no message) (duration: 07m 21s) [02:24:55] Logged the message, Master [02:28:16] !log LocalisationUpdate completed (1.26wmf10) at 2015-06-24 02:28:16+00:00 [02:28:21] Logged the message, Master [02:31:04] PROBLEM - Disk space on analytics1021 is CRITICAL: DISK CRITICAL - free space: / 1068 MB (3% inode=91%) [02:54:49] !log l10nupdate Synchronized php-1.26wmf11/cache/l10n: (no message) (duration: 10m 34s) [02:54:56] Logged the message, Master [03:00:45] !log LocalisationUpdate completed (1.26wmf11) at 2015-06-24 03:00:45+00:00 [03:00:50] Logged the message, Master [03:14:49] (03PS1) 10BBlack: ciphersuites: re-order ECDSA ahead of RSA [puppet] - 10https://gerrit.wikimedia.org/r/220377 [03:16:21] Katie: https://phabricator.wikimedia.org/rMW8d9243cf34f1f9ffa3be145349e4c6edae4a5b7a is a bit strange [03:19:21] current google advice also mentions using absolutes, but probably more to avoid silly misconfig on our end that due to requiring it on theirs: https://support.google.com/webmasters/answer/139066?hl=en [03:19:41] yeah i'm going to revert it [03:19:43] personally, I think the absolutes are well worth the few extra bytes to avoid ambiguity [03:20:48] that same link also says "A server-side 301 redirect is the best way to ensure that users and search engines are directed to the correct page" [03:21:41] (so I still think it makes sense, in addition to all these other protections, that at least for view fetches, we might do "if request.uri != internalcanonicaluri then redirect") [03:30:52] (the debateable part is who knows if it really strengthens the signal over rel=canonical which seems to be working fine, and wherever those links do exist it adds an extra reload -> wasted time/reqs) [04:15:26] (03PS1) 10KartikMistry: Enable ContentTranslation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220385 (https://phabricator.wikimedia.org/T103639) [04:21:04] PROBLEM - puppet last run on mc2007 is CRITICAL puppet fail [04:36:15] (03CR) 10Dzahn: [C: 031] "https://blog.cloudflare.com/ecdsa-the-digital-signature-algorithm-of-a-better-internet/" [puppet] - 10https://gerrit.wikimedia.org/r/220377 (owner: 10BBlack) [04:36:45] RECOVERY - puppet last run on mc2007 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [05:02:54] RECOVERY - Disk space on analytics1021 is OK: DISK OK [05:03:28] !log removed old logs and did 'apt-get clean' on analytics1021 to make space [05:03:33] Logged the message, Master [05:09:07] (03PS1) 10Springle: repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220397 [05:10:24] (03CR) 10Springle: [C: 032] repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220397 (owner: 10Springle) [05:10:29] (03Merged) 10jenkins-bot: repool db1045 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220397 (owner: 10Springle) [05:10:54] PROBLEM - Hadoop NodeManager on analytics1020 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:12:37] !log springle Synchronized wmf-config/db-eqiad.php: repool db1045 (duration: 00m 13s) [05:12:41] Logged the message, Master [05:17:50] bblack: https://gerrit.wikimedia.org/r/#/c/220398/ [05:19:41] ^ +1, and I'm out for the nite. cya :) [05:19:48] bye! [05:25:04] RECOVERY - Hadoop NodeManager on analytics1020 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [05:27:35] <_joe_> good night brandon [05:46:32] !log LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 24 05:46:32 UTC 2015 (duration 46m 31s) [05:46:37] Logged the message, Master [05:51:41] (03PS2) 10Ori.livneh: Use cronolog and logrotate to avoid Puppetmaster Apache reloads [puppet] - 10https://gerrit.wikimedia.org/r/219788 [05:51:56] (03PS3) 10Ori.livneh: Use cronolog and logrotate to avoid Puppetmaster Apache reloads [puppet] - 10https://gerrit.wikimedia.org/r/219788 [05:52:02] (03CR) 10Ori.livneh: [C: 032 V: 032] Use cronolog and logrotate to avoid Puppetmaster Apache reloads [puppet] - 10https://gerrit.wikimedia.org/r/219788 (owner: 10Ori.livneh) [05:56:11] _joe_: seems to work [05:58:50] (03CR) 10Matanya: "Totally fine by me. I will just verify with Aaron." [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [06:00:35] PROBLEM - puppet last run on rdb2001 is CRITICAL Puppet has 1 failures [06:01:44] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 1 failures [06:01:44] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 1 failures [06:02:04] PROBLEM - puppet last run on cp4011 is CRITICAL Puppet has 1 failures [06:02:13] PROBLEM - puppet last run on cp3038 is CRITICAL Puppet has 1 failures [06:02:24] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 1 failures [06:02:34] PROBLEM - puppet last run on cp1053 is CRITICAL Puppet has 1 failures [06:02:35] PROBLEM - puppet last run on cp1068 is CRITICAL Puppet has 1 failures [06:02:53] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 1 failures [06:03:44] PROBLEM - puppet last run on db2058 is CRITICAL Puppet has 1 failures [06:04:04] PROBLEM - puppet last run on db2047 is CRITICAL Puppet has 1 failures [06:04:33] PROBLEM - puppet last run on lvs1003 is CRITICAL Puppet has 1 failures [06:04:34] PROBLEM - puppet last run on es2010 is CRITICAL Puppet has 1 failures [06:04:34] PROBLEM - puppet last run on mw1236 is CRITICAL Puppet has 1 failures [06:04:43] PROBLEM - puppetmaster https on labcontrol1001 is CRITICAL - Socket timeout after 10 seconds [06:04:54] PROBLEM - puppet last run on mw1255 is CRITICAL Puppet has 1 failures [06:05:13] PROBLEM - puppet last run on mw2110 is CRITICAL Puppet has 1 failures [06:05:46] PROBLEM - puppet last run on mw2196 is CRITICAL Puppet has 1 failures [06:05:46] PROBLEM - puppet last run on mw2024 is CRITICAL Puppet has 1 failures [06:05:53] PROBLEM - puppet last run on mw1199 is CRITICAL Puppet has 1 failures [06:16:14] RECOVERY - puppet last run on db2058 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:16:33] RECOVERY - puppet last run on rdb2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:16:33] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:16:44] RECOVERY - puppet last run on cp1068 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:16:44] RECOVERY - puppet last run on cp1053 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:16:44] RECOVERY - puppet last run on lvs1003 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:16:54] RECOVERY - puppet last run on mw1236 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [06:16:54] RECOVERY - puppet last run on es2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:16:55] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 0 seconds ago with 0 failures [06:17:14] RECOVERY - puppet last run on mw1255 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [06:17:27] (03PS1) 10Ricordisamoa: Enable the SandboxLink extension on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220408 (https://phabricator.wikimedia.org/T103643) [06:17:33] RECOVERY - puppet last run on mw2110 is OK Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:17:43] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:17:43] RECOVERY - puppet last run on cp3048 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:18:03] RECOVERY - puppet last run on cp4011 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:18:03] RECOVERY - puppet last run on mw2196 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:18:04] RECOVERY - puppet last run on mw2024 is OK Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:18:04] RECOVERY - puppet last run on cp3038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:18:04] RECOVERY - puppet last run on mw1199 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:18:14] RECOVERY - puppet last run on db2047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:07] <_joe_> ori: uhm except it doesn't [06:51:41] (03PS15) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) [06:52:42] <_joe_> I'm going to verify this is effectively a noop, then merge it ^^ [07:04:33] (03PS16) 10Giuseppe Lavagetto: varnish: add generation of the dynamic list of directors [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) [07:04:50] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] "Let's go" [puppet] - 10https://gerrit.wikimedia.org/r/217818 (https://phabricator.wikimedia.org/T97975) (owner: 10Giuseppe Lavagetto) [07:22:57] (03PS1) 10Giuseppe Lavagetto: ganglia: eqiad defaults to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/220411 [07:23:37] 6operations, 10Deployment-Systems, 10RESTBase, 6Release-Engineering, 6Services: Get ops feedback regarding the use of SSH for deployment system control channel. - https://phabricator.wikimedia.org/T102687#1395514 (10MoritzMuehlenhoff) >>! In T102687#1381061, @fgiunchedi wrote: > I can see ssh working, in... [07:27:25] 6operations, 10Deployment-Systems, 10RESTBase, 6Release-Engineering, 6Services: Get ops feedback regarding the use of SSH for deployment system control channel. - https://phabricator.wikimedia.org/T102687#1395521 (10Joe) I strongly oppose to using mcollective, FWIW. I tried it in the past and it's way w... [07:29:57] (03PS2) 10Giuseppe Lavagetto: ganglia: eqiad defaults to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/220411 [07:31:21] (03CR) 10Giuseppe Lavagetto: [C: 032] ganglia: eqiad defaults to ganglia_new [puppet] - 10https://gerrit.wikimedia.org/r/220411 (owner: 10Giuseppe Lavagetto) [07:38:52] (03PS1) 10Giuseppe Lavagetto: varnish: enable confd of a few codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/220415 [07:39:48] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: enable confd of a few codfw hosts [puppet] - 10https://gerrit.wikimedia.org/r/220415 (owner: 10Giuseppe Lavagetto) [07:42:35] <_joe_> uhm a few puppet failures incoming :( [07:44:25] <_joe_> nothing tragic though [07:44:38] (03PS1) 10Giuseppe Lavagetto: confd: create sub-directories [puppet] - 10https://gerrit.wikimedia.org/r/220416 [07:45:01] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] confd: create sub-directories [puppet] - 10https://gerrit.wikimedia.org/r/220416 (owner: 10Giuseppe Lavagetto) [07:45:24] PROBLEM - puppet last run on cp2005 is CRITICAL Puppet has 2 failures [07:46:30] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/220098 (https://phabricator.wikimedia.org/T103491) (owner: 10Hashar) [07:47:14] RECOVERY - puppet last run on cp2005 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [07:57:25] good morning [07:59:29] morning [08:03:37] (03PS1) 10Giuseppe Lavagetto: varnish: fix directors template [puppet] - 10https://gerrit.wikimedia.org/r/220420 [08:05:41] PROBLEM - confd service on cp2002 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state activating [08:05:52] PROBLEM - confd service on cp2003 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state activating [08:05:52] PROBLEM - confd service on cp2004 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state activating [08:05:55] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: fix directors template [puppet] - 10https://gerrit.wikimedia.org/r/220420 (owner: 10Giuseppe Lavagetto) [08:06:02] PROBLEM - confd service on cp2001 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state activating [08:09:38] <_joe_> this is known as well ^^ [08:09:52] <_joe_> I need to add the SRV records in codfw as well [08:09:58] is confide related to ectd ? [08:10:06] <_joe_> so it's kinda expected [08:10:08] <_joe_> hashar: yes [08:10:49] sounds like puppet templates + notify => Service['foo'] [08:10:49] <_joe_> it's a daemon that watches etcd for changes in keys and generates files based on templates [08:10:58] <_joe_> yeah something like that [08:11:13] that is solely for varnish right ? or do you guys plan to extend its use to other areas? [08:11:38] <_joe_> hashar: varnish and pybal for now [08:11:45] \O/ [08:12:00] <_joe_> so yes, we have a way to programmatically pool/depool services [08:12:12] PROBLEM - confd service on cp2005 is CRITICAL: NRPE_CHECK_SYSTEMD_STATE CRITICAL - Service is in state activating [08:12:19] <_joe_> https://wikitech.wikimedia.org/wiki/Conftool [08:18:14] (03PS1) 10Jcrespo: Repool es1004 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220422 [08:19:13] (03CR) 10Jcrespo: [C: 032] Repool es1004 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220422 (owner: 10Jcrespo) [08:19:36] _joe_: thanks for the link. I have slightly enhanced the presentation :D [08:19:38] (03PS1) 10Giuseppe Lavagetto: etcd: add client SRV records to all datacenters [dns] - 10https://gerrit.wikimedia.org/r/220423 [08:19:52] <_joe_> hashar: thanks a lot [08:20:15] if you ever have some spare time, the etcd article is currently a stub :D https://wikitech.wikimedia.org/wiki/Etcd [08:20:50] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: add client SRV records to all datacenters [dns] - 10https://gerrit.wikimedia.org/r/220423 (owner: 10Giuseppe Lavagetto) [08:23:03] <_joe_> yes I know [08:23:07] <_joe_> thank you for that too [08:23:30] <_joe_> hashar: actually, next quarter I should close up a few things regarding this project, docs is one of them [08:23:50] we all have the same problem [08:24:02] I have been procrastinating on writing the nodepool doc for a couple weeks already :-(((( [08:24:05] https://wikitech.wikimedia.org/wiki/Nodepool <-- lame [08:27:06] !log jynus Synchronized wmf-config/db-eqiad.php: Repool es1004 (duration: 00m 14s) [08:27:11] Logged the message, Master [08:38:30] (03CR) 10Lokal Profil: "Thanks for the feedback. I'll implement the changes and push a new patch." [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [08:38:59] 6operations, 10Deployment-Systems, 10RESTBase, 6Release-Engineering, 6Services: Get ops feedback regarding the use of SSH for deployment system control channel. - https://phabricator.wikimedia.org/T102687#1395679 (10mobrovac) >>! In T102687#1395514, @MoritzMuehlenhoff wrote: > How many services are we ta... [08:43:50] (03PS1) 10Muehlenhoff: Add cdbs to default packages [puppet] - 10https://gerrit.wikimedia.org/r/220427 [08:46:41] RECOVERY - confd service on cp2004 is OK: NRPE_CHECK_SYSTEMD_STATE OK - Service confd is in the desired state (active - running) [08:46:52] RECOVERY - confd service on cp2001 is OK: NRPE_CHECK_SYSTEMD_STATE OK - Service confd is in the desired state (active - running) [08:47:24] <_joe_> \o/ [08:47:30] (03CR) 10Addshore: [C: 04-1] Add Phragile module. (0310 comments) [puppet] - 10https://gerrit.wikimedia.org/r/218930 (https://phabricator.wikimedia.org/T101235) (owner: 10Jakob) [08:47:42] RECOVERY - confd service on cp2005 is OK: NRPE_CHECK_SYSTEMD_STATE OK - Service confd is in the desired state (active - running) [08:48:12] RECOVERY - confd service on cp2002 is OK: NRPE_CHECK_SYSTEMD_STATE OK - Service confd is in the desired state (active - running) [08:48:22] RECOVERY - confd service on cp2003 is OK: NRPE_CHECK_SYSTEMD_STATE OK - Service confd is in the desired state (active - running) [08:52:18] _joe_: does it make sense to add es to conftool ? [08:52:52] <_joe_> es? [08:52:59] the bd's [08:53:02] *db's [08:53:08] <_joe_> not now [08:53:14] in general [08:54:06] what exactly? [08:54:18] jynus Synchronized wmf-config/db-eqiad.php: Repool es1004 [08:54:32] this is a good example ^ ? [08:55:48] I would say no, at least no now [08:56:05] care reasoning ? [08:56:30] it may be changed to other system in the future [08:57:19] thanks [08:57:57] <_joe_> matanya: we have to think well to a model for that, for now it's just gonna be pybal and varnish [08:58:10] ok, makes sense [08:58:25] there were suggestions about proxys/pool of connections [09:00:02] <_joe_> jynus: yeah there are a ton of things we need to consider before changing anything [09:00:05] also thing that all of those are above the application [09:00:14] dbs are below [09:00:27] <_joe_> but the idea of having mediawiki's "state" managed via conftool is tempting [09:00:27] and more stateful [09:00:43] <_joe_> "state" as in "dynamic config" [09:00:58] <_joe_> so the list of redis servers, the lists of dbs, etc [09:01:10] <_joe_> sometimes in the future :) [09:01:44] my definition above is that repooling at the wrong time will bring down the application :-) [09:01:59] I am not against that, I am willing to do that [09:02:07] but we require more features [09:02:09] <_joe_> jynus: well, that is true now as well, right? [09:02:20] <_joe_> jynus: oh, surely so [09:02:42] yes, but what is the problem if a varnish suddenly disappears? [09:02:52] <_joe_> jynus: how cna that happen? [09:02:53] lvs will notice, dont? [09:03:20] <_joe_> I mean I don't see why that would happen if not with someone actively removing it from the pool [09:03:27] think of a hw problem [09:03:32] <_joe_> oh yes [09:03:36] <_joe_> lvs will notice [09:03:41] that is my point [09:03:48] <_joe_> jynus: my idea is to leave all the logic in mediawiki [09:04:02] mysql suddenly disapearing is more than likely [09:04:09] even without hw problems [09:04:14] <_joe_> just make it fetch the config from e.g. a json array we generate instead than from a php file [09:04:23] <_joe_> so no difference at all from now [09:04:27] _joe_, yes [09:04:34] again, I am all for it [09:04:40] <_joe_> maybe I don't get your point [09:04:47] but we need more features [09:04:51] <_joe_> I mean at the app level, nothign would change [09:05:01] but the list is ON the app [09:05:11] <_joe_> jynus: and it will remain like that [09:05:13] why do we need another list? [09:05:45] <_joe_> I mean, we could - if we want - remove it from mediawiki-config and have it generated on the single server from data in conftool [09:05:56] <_joe_> that would mean you don't need a deploy to change state [09:06:01] because then I do not understand matanya's suggestion [09:06:07] we would still need that [09:06:23] if that doesn't change [09:06:27] <_joe_> which is very desirable if we want to use repoauth mode sooner or later? [09:06:52] <_joe_> jynus: say that you read the list of databases from a json file instead than from the php variable [09:07:05] <_joe_> we could make that file change via conftool [09:07:35] but you still have to deploy that file [09:07:37] <_joe_> and we won't need to re-deploy the app (or a single file) [09:07:50] <_joe_> no that would be taken care by confd or something similar [09:08:08] <_joe_> you issue a conftool command -> confd changes the list immediately everywhere [09:08:26] <_joe_> but well, this is well in the future [09:08:37] so my point is [09:08:40] <_joe_> and will need some discussions with devs as well [09:08:43] I want that change to happen [09:08:51] I do not see it in the near future [09:08:58] <_joe_> oh me neither [09:09:25] jynus: i was suggestiong exactly what _joe_ describes. (knowing it is far in the future) [09:09:31] <_joe_> well, when we decide to move to use HHVM's repoauth mode this will make our lives _much_ easier [09:09:50] mainly because it is below the app on the stack [09:10:20] <_joe_> but again, in the future. right now I need to test my pet [09:10:31] <_joe_> who's working _very_ well AFAICS for now [09:11:06] 6operations, 5Patch-For-Review, 7discovery-system: conftool-syncer is too slow in production - https://phabricator.wikimedia.org/T103482#1395700 (10Joe) 5Open>3Resolved [09:12:03] 6operations, 5Patch-For-Review, 7discovery-system: confctl fails if only one data is set - https://phabricator.wikimedia.org/T103481#1395702 (10Joe) 5Open>3Resolved [09:12:18] snapshoting is failing to connect to labswiki [09:16:17] (03PS3) 10Muehlenhoff: contint: no more install openjdk-6 [puppet] - 10https://gerrit.wikimedia.org/r/220098 (https://phabricator.wikimedia.org/T103491) (owner: 10Hashar) [09:16:35] !log performing a master failover of es1008 into es1009 [09:16:39] Logged the message, Master [09:16:59] ^switchover, normal maintenance [09:17:07] !log apt-get upgrade on gallium and lanthanum [09:17:11] Logged the message, Master [09:17:48] (03CR) 10Muehlenhoff: [C: 032] contint: no more install openjdk-6 [puppet] - 10https://gerrit.wikimedia.org/r/220098 (https://phabricator.wikimedia.org/T103491) (owner: 10Hashar) [09:18:37] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1395712 (10hashar) [09:24:00] !log removing java 6 from gallium and lanthanum https://phabricator.wikimedia.org/T103491 [09:24:04] Logged the message, Master [09:42:15] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1395761 (10hashar) Purged the packages from gallium and lanthanum. Same for labs machines with: salt '*' cmd.run 'apt-get remove --yes --purge openjdk... [09:43:07] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1395765 (10hashar) [09:46:30] (03PS1) 10Giuseppe Lavagetto: hiera: enable dynamic directors in codfw, fix ganglia regexes [puppet] - 10https://gerrit.wikimedia.org/r/220433 [09:46:42] PROBLEM - Disk space on analytics1021 is CRITICAL: DISK CRITICAL - free space: / 1061 MB (3% inode=91%) [09:47:00] (03PS1) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [09:47:53] (03CR) 10Giuseppe Lavagetto: [C: 032] hiera: enable dynamic directors in codfw, fix ganglia regexes [puppet] - 10https://gerrit.wikimedia.org/r/220433 (owner: 10Giuseppe Lavagetto) [09:51:40] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1395787 (10hashar) Java alternatives on labs instances no more shows java6: ``` salt '*' cmd.run 'ls -l /etc/alternatives/j*|grep java-6' i-0000063a.eqiad.... [09:51:47] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1395788 (10hashar) [09:52:08] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1395792 (10hashar) [09:52:11] 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Remove Java 6 from CI Jenkins slaves - https://phabricator.wikimedia.org/T103491#1395789 (10hashar) 5Open>3Resolved a:3hashar Java 6 is gone. Thank you @MoritzMuehlenhoff ! [09:53:17] <_joe_> ok, ganglia is broken for analytics and logstash, but I know why [09:53:23] <_joe_> I can fix this easily at least [09:57:40] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1395808 (10hashar) [09:58:34] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1391234 (10hashar) @MoritzMuehlenhoff I copy pasted your last comment ( T103479#1391521 ) to the task detail and added some checkbox in front of each machine. I guess you want to fill sub tasks... [10:00:35] (03CR) 10Addshore: rsync wikidata json dumps to labs /public/dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [10:01:19] (03PS1) 10Giuseppe Lavagetto: ganglia: use actual list of aggregators, not ganglia_class [puppet] - 10https://gerrit.wikimedia.org/r/220438 [10:03:23] (03PS7) 10Addshore: rsync wikidata json dumps to labs /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) [10:04:44] (03PS1) 10Jcrespo: Promote es1009 as the new master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220439 [10:06:25] 6operations, 10Wikimedia-Git-or-Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#1395849 (10hashar) 3NEW [10:06:27] (03PS2) 10Giuseppe Lavagetto: ganglia: use actual list of aggregators, not ganglia_class [puppet] - 10https://gerrit.wikimedia.org/r/220438 [10:06:34] (03CR) 10Jcrespo: [C: 032] Promote es1009 as the new master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220439 (owner: 10Jcrespo) [10:06:53] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1395855 (10hashar) [10:07:39] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1391234 (10hashar) ytterbium.wikimedia.org is Gerrit production host. Gerrit itself uses Java 7, so we can probably just purge Java 6: {T103668} [10:07:52] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] ganglia: use actual list of aggregators, not ganglia_class [puppet] - 10https://gerrit.wikimedia.org/r/220438 (owner: 10Giuseppe Lavagetto) [10:10:36] !log jynus Synchronized wmf-config/db-eqiad.php: Switchover master es1008 -> es1009 (duration: 00m 12s) [10:10:40] Logged the message, Master [10:14:47] <_joe_> ok ganglia for analytics is back [10:19:27] gah analytics1021 is filling its disk and i don't see what's doing it. root partition is filling at about 22MB/5mins, but /var/ only grows about 1MB during that time, and i don't see growth of /home, /tmp, etc. [10:19:32] i've been watching it with this: [10:19:33] while true ; do date ; df -B 1M / ; sudo du -sk -B 1M / --exclude=/proc --exclude=/var --exclude=/usr --exclude=/home --exclude=/tmp ; sudo du -sk -B 1M /usr /home /tmp ; sudo du -sk -B 1M --exclude=/var/spool/kafka /var ; sudo du -sk -B 1M /var/log; ls -lsh /tmp/hsperfdata_kafka/5103 /var/log/kafka/kafkaServer-gc.log /var/log/kafka/kafka.log ; echo -------------------- ; sleep 300 ; done [10:21:04] the only process that i see writing is kafka, but all of its spool partitions are mounted and the only other places it writes are those files in /tmp and /var/log/ [10:21:16] I do not think deleted files would be shown on du [10:21:40] just random posibility I found once [10:21:56] yeah i zeroed kafkaServer-gc.log and it still shows as large [10:22:12] lsof may help [10:22:30] yeah, i was examining that [10:22:43] i guess i'll restart kafka to see if it releases some filehandles [10:23:15] i think iotop shows easyly the relation between pids and io [10:27:28] well it shows that kafka is doing a lot of disk io, which is expected. but it's not clear to me why this kafka node is filling its disk when the others are not. [10:31:09] !log restarting kafka on analytics1021 [10:31:13] Logged the message, Master [10:34:20] <_joe_> I think I should add some form of automatic logging to the SAL to confctl [10:35:35] need to rewrite sal to not be shit [10:36:07] * Nemo_bis would be content enough if it respected sentence case [10:37:10] i dunno if it helps other people, but i find the l10nupdate messages in SAL to be unhelpful noise [10:37:19] PROBLEM - YARN NodeManager Node-State on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:37:19] i have to visually grep them out when i read it [10:37:35] <_joe_> they are. [10:37:41] <_joe_> useful I mean [10:38:03] well i'm glad they're useful to someone [10:38:57] <_joe_> jgage: need help with analytics1021? [10:39:08] RECOVERY - YARN NodeManager Node-State on analytics1013 is OK YARN NodeManager analytics1013.eqiad.wmnet:8041 Node-State: RUNNING [10:40:54] _joe_: thanks, i'll let you know in a few mins [10:41:01] can't tell yet whether restarting kafka helped [10:41:38] this node has always had problems; either this is another manifestation of that, or ottomata enabled something to help troubleshoot its problems [10:41:51] but even if it does fill its disk we have 3 others so the service will be ok [10:42:15] and i have a meeting with hin in 5.5 hours, so we'll discuss [10:42:30] which i guess means i should go to bed soon [10:43:18] PROBLEM - YARN NodeManager Node-State on analytics1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [10:43:42] and we can also discuss why the nodemanagers have been running out of memory lately :P [10:43:58] <_joe_> java 21577 jmxtrans 1w REG 9,3 20556482612 711545 /var/log/jmxtrans/jmxtrans.log (deleted) [10:44:09] <_joe_> I think this may be the issue [10:44:23] <_joe_> jmxtrans doesn't rotate correctly it seems [10:44:45] <_joe_> !log restarting jmxtrans on analytics1021 [10:44:50] Logged the message, Master [10:44:59] RECOVERY - YARN NodeManager Node-State on analytics1014 is OK YARN NodeManager analytics1014.eqiad.wmnet:8041 Node-State: RUNNING [10:45:22] blah jmx [10:45:47] heh suddenly there's a lot more disk space [10:45:49] <_joe_> gee [10:45:57] amazing! [10:46:00] <_joe_> it's defunct now I had to kill -9 it [10:46:03] i heart java, etc [10:46:25] <_joe_> no I heart ops [10:46:30] RECOVERY - Disk space on analytics1021 is OK: DISK OK [10:46:31] <_joe_> I guess it's our fault :P [10:47:05] <_joe_> actually, no, there is no logrotate rule for it [10:47:10] did we do something with jmxtrans recently? [10:47:12] <_joe_> so yes, log4j gone bananas [10:47:17] <_joe_> I heart java! [10:47:20] hehe [10:47:31] 4j$ -> bananas [10:47:56] <_joe_> YuviPanda: oh your learning FOUR I see [10:48:12] <_joe_> (syntax error intended) [10:48:14] ciao Ciao CIao ciAo [10:50:17] <_joe_> jgage: I see jmxtrans produces a ginormous amount of logs [10:50:30] <_joe_> you may want to set the loglevel to "REASONABLE" [10:51:31] haha [10:51:52] YuviPanda: that has already so many more primitives than https://en.wikipedia.org/wiki/Whitespace_(programming_language) [10:51:56] default: UNREASONABLE [10:57:00] (03CR) 10Phuedx: [C: 031] Enable browse prototype on English Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [11:09:09] (03CR) 10Prtksxna: "It can stick around. Removing the config should make the icon disappear for now. We can see later whether we need to get rid of that code " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220121 (https://phabricator.wikimedia.org/T103283) (owner: 10Prtksxna) [11:25:03] 6operations, 10vm-requests, 5Patch-For-Review: EQIAD: 1 VM request for planet - https://phabricator.wikimedia.org/T101899#1396063 (10akosiaris) That error confuses me. It means the for some reason the repo did not have grub-pc (not jub the .deb missing, but the Packages files did not mention grub-pc either),... [11:34:44] (03CR) 10Alexandros Kosiaris: [C: 031] Remove subversion server support [puppet] - 10https://gerrit.wikimedia.org/r/219240 (owner: 10Chad) [11:37:01] Where do we file shell requests in phab now? [11:38:48] andre__: ^ [11:39:32] Reedy, what is a "shell request"? [11:40:17] config changes or maintenance scripts to be run for a wiki [11:40:30] I see the "Shell" project on phab is archived [11:41:42] Wikimedia-Site-requests [11:41:51] Reedy: config changes are Wikimedia-Site-Requests. Maintenance scripts are Wikimedia-General-Unknown and should block T31782 [11:42:06] heh, thanks [11:43:00] Reedy: if you like being sneaky as well, tag as ops so someone will definitely get it ;) [11:43:14] Ops won't deal with this [11:43:27] namespaceDupes for enwiki? [11:43:28] https://phabricator.wikimedia.org/T103672 [11:44:09] Oh that, they won't, nevermind [11:44:12] :D [11:44:15] I'll probably do it [11:44:24] I'm just not sure where to put the moved pages ;) [11:44:59] suffix of -moved or -relocated probably [11:45:56] -FIXORDELETEME [11:46:08] -NAMESPACECONFLICTFIXME [11:47:14] all the page ids are around a similar number, must've been a widerspread problem at one point [11:47:41] (03PS4) 10Phuedx: Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [11:47:57] -WHYDONTYOULOVEMEASACONFLICT [11:48:02] Reedy: ^ [11:48:31] -WHYDOESPAGENAMINGALWAYSFEELLIKEABATTLEFIELD [11:49:07] -TOCONFLICTORNOTTOCONFLICTTHATISTHEQUESTION [11:50:17] -WILLITCONFLICT?THATISTHEQUESTION [11:52:07] I'll deal with it later [12:01:30] (03PS2) 10Lokal Profil: Add DCAT-AP for Wikibase [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) [12:04:51] (03PS1) 10Giuseppe Lavagetto: varnish::director: fix reload script [puppet] - 10https://gerrit.wikimedia.org/r/220443 [12:07:39] PROBLEM - Confd vcl based reload on cp2002 is CRITICAL: reload-vcl failed to run since 0 [12:08:56] (03PS2) 10Giuseppe Lavagetto: varnish::director: fix reload script [puppet] - 10https://gerrit.wikimedia.org/r/220443 [12:09:44] <_joe_> uhm, since 0 is not nice either [12:15:15] (03PS2) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [12:16:41] An user using content translate tool has a 500 error from Parsoid when it saves. Ops issue or code bug? [12:17:48] (03PS1) 10Hashar: nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) [12:18:30] (03CR) 10jenkins-bot: [V: 04-1] nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [12:19:07] (03PS3) 10Giuseppe Lavagetto: varnish::director: fix reload script and nagios check [puppet] - 10https://gerrit.wikimedia.org/r/220443 [12:20:14] (oh, now works again, after 40 minutes of F5 regular try to repost the form) [12:20:18] <_joe_> Dereckson: potentially both [12:21:40] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish::director: fix reload script and nagios check [puppet] - 10https://gerrit.wikimedia.org/r/220443 (owner: 10Giuseppe Lavagetto) [12:22:21] (03CR) 10Lokal Profil: "In addition to the suggestions you can now also specify a target directory for the rdf (in case you don't want it with the dumps)." [puppet] - 10https://gerrit.wikimedia.org/r/219800 (https://phabricator.wikimedia.org/T103087) (owner: 10Lokal Profil) [12:23:38] RECOVERY - Confd vcl based reload on cp2002 is OK: reload-vcl successfully ran 0h, 14 minutes ago. [12:23:53] (03PS3) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [12:24:04] (03CR) 10Hashar: "Done manually on labnodepool in /var/lib/nodepool/.profile and:" [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) (owner: 10Hashar) [12:25:30] (03PS2) 10Hashar: nodepool: provide openstack env variables to system user [puppet] - 10https://gerrit.wikimedia.org/r/220444 (https://phabricator.wikimedia.org/T103673) [12:27:52] (03PS4) 10Muehlenhoff: Enable firejail containment for zotero [puppet] - 10https://gerrit.wikimedia.org/r/220434 (https://phabricator.wikimedia.org/T98852) [12:31:43] (03PS9) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [12:32:10] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 0 below the confidence bounds [12:37:19] PROBLEM - YARN NodeManager Node-State on analytics1015 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:38:26] (03PS10) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [12:39:18] (03PS2) 10Alexandros Kosiaris: Add cdbs to default packages [puppet] - 10https://gerrit.wikimedia.org/r/220427 (owner: 10Muehlenhoff) [12:39:25] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add cdbs to default packages [puppet] - 10https://gerrit.wikimedia.org/r/220427 (owner: 10Muehlenhoff) [12:39:49] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1396270 (10coren) You would see all 60 disks on the H800 if everything was okay (foreign or not). This really needs to be isolated and fixed, or - alternately - the faulty shelves need to be removed f... [12:40:19] (03PS1) 10Hashar: nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 [12:40:21] (03PS1) 10Hashar: nodepool: add diskimage 'devuser' element [puppet] - 10https://gerrit.wikimedia.org/r/220446 (https://phabricator.wikimedia.org/T102880) [12:40:28] PROBLEM - YARN NodeManager Node-State on analytics1014 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:41:05] (03CR) 10jenkins-bot: [V: 04-1] nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 (owner: 10Hashar) [12:42:14] (03PS2) 10Hashar: nodepool: element to prepare an image for Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/220445 [12:43:05] (03Abandoned) 10Alexandros Kosiaris: build 5.3.4 for jessie, remove old patches [debs/ruby-jsduck] - 10https://gerrit.wikimedia.org/r/213954 (https://phabricator.wikimedia.org/T95008) (owner: 10Dzahn) [12:43:49] (03Abandoned) 10Alexandros Kosiaris: make external_networks actually be that, add private-networks [puppet] - 10https://gerrit.wikimedia.org/r/102114 (owner: 10ArielGlenn) [12:46:19] PROBLEM - YARN NodeManager Node-State on analytics1017 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:46:29] PROBLEM - Disk space on analytics1018 is CRITICAL: DISK CRITICAL - free space: / 1068 MB (3% inode=94%) [12:46:58] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [12:47:59] RECOVERY - YARN NodeManager Node-State on analytics1015 is OK YARN NodeManager analytics1015.eqiad.wmnet:8041 Node-State: RUNNING [12:48:46] (03PS11) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [12:49:49] RECOVERY - YARN NodeManager Node-State on analytics1017 is OK YARN NodeManager analytics1017.eqiad.wmnet:8041 Node-State: RUNNING [12:50:19] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [12:50:58] RECOVERY - YARN NodeManager Node-State on analytics1014 is OK YARN NodeManager analytics1014.eqiad.wmnet:8041 Node-State: RUNNING [12:57:46] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1396327 (10coren) After reconsideration, given that there is no data on that array that is not outdated (the eqiad copy is more recent), please replace the controler of labstore2001 with the new model... [12:57:49] PROBLEM - Disk space on labnodepool1001 is CRITICAL: DISK CRITICAL - /tmp/image.uqIG2O33/mnt/tmp/ccache is not accessible: Permission denied [12:59:48] RECOVERY - Disk space on labnodepool1001 is OK: DISK OK [13:02:27] (03PS12) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [13:03:10] (03CR) 10jenkins-bot: [V: 04-1] varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) (owner: 10Giuseppe Lavagetto) [13:06:44] (03PS13) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [13:11:25] (03CR) 10Hoo man: rsync wikidata json dumps to labs /public/dumps (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215585 (https://phabricator.wikimedia.org/T100885) (owner: 10Addshore) [13:12:46] 6operations, 10Wikimedia-Git-or-Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#1396340 (10demon) Gerrit should already be using Java 7, hence the javaHome setting. The OpenJDK6 packages are probably just leftovers from before we swapped. [13:14:46] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service - https://phabricator.wikimedia.org/T102281#1396342 (10hashar) So on `labnodepool1001.eqiad.wmnet` I run: sudo -u nodepool nodepool image-build ci-jessie-wikimedia The first sudo error is: ``` ++ id -u ++... [13:26:06] (03PS14) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [13:28:44] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1396353 (10hashar) [13:32:46] 6operations, 5Continuous-Integration-Isolation: Figure out fine sudo rules for the nodepool service / diskimage-builder - https://phabricator.wikimedia.org/T102281#1396365 (10hashar) So diskimage-builder is a python utility that apply shell based templates. Some commands are run on the machine outside of the... [13:38:00] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:42:45] (03PS15) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [13:43:38] PROBLEM - Disk space on analytics1012 is CRITICAL: DISK CRITICAL - free space: / 1061 MB (3% inode=94%) [13:45:18] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [13:45:40] (03CR) 10BBlack: [C: 031] varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) (owner: 10Giuseppe Lavagetto) [13:48:48] (03PS16) 10Giuseppe Lavagetto: varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) [13:50:03] (03CR) 10BBlack: [C: 031] varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) (owner: 10Giuseppe Lavagetto) [13:52:18] PROBLEM - YARN NodeManager Node-State on analytics1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:57:39] whys it gotta be like this nodemanager [13:58:59] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: actually use the dynamic directors list [puppet] - 10https://gerrit.wikimedia.org/r/217820 (https://phabricator.wikimedia.org/T97975) (owner: 10Giuseppe Lavagetto) [13:59:18] RECOVERY - YARN NodeManager Node-State on analytics1020 is OK YARN NodeManager analytics1020.eqiad.wmnet:8041 Node-State: RUNNING [14:02:53] (03PS1) 10Giuseppe Lavagetto: varnish: add semicolon in template [puppet] - 10https://gerrit.wikimedia.org/r/220453 [14:03:38] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: add semicolon in template [puppet] - 10https://gerrit.wikimedia.org/r/220453 (owner: 10Giuseppe Lavagetto) [14:04:28] PROBLEM - puppet last run on cp2001 is CRITICAL Puppet has 1 failures [14:05:29] PROBLEM - puppet last run on cp2017 is CRITICAL Puppet has 1 failures [14:06:08] PROBLEM - puppet last run on cp2021 is CRITICAL Puppet has 1 failures [14:06:09] PROBLEM - puppet last run on cp2009 is CRITICAL Puppet has 1 failures [14:06:09] PROBLEM - puppet last run on cp2007 is CRITICAL Puppet has 1 failures [14:06:09] RECOVERY - puppet last run on cp2001 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [14:06:24] <_joe_> this is the semicolon missing, ignore please [14:07:18] PROBLEM - puppet last run on cp2023 is CRITICAL Puppet has 1 failures [14:07:39] PROBLEM - puppet last run on cp2024 is CRITICAL Puppet has 1 failures [14:07:49] PROBLEM - puppet last run on cp2026 is CRITICAL Puppet has 1 failures [14:12:08] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1396409 (10coren) I will provide an explcit wiring diagram shortly. [14:12:23] !log krenair Synchronized php-1.26wmf10/extensions/SemanticForms/includes/SF_AutoeditAPI.php: T103653 live hack (duration: 00m 13s) [14:12:27] Logged the message, Master [14:22:19] RECOVERY - puppet last run on cp2009 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:22:20] RECOVERY - puppet last run on cp2021 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [14:22:20] RECOVERY - puppet last run on cp2007 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [14:22:39] PROBLEM - puppet last run on cp2022 is CRITICAL Puppet has 1 failures [14:23:28] RECOVERY - puppet last run on cp2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:29] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [14:23:49] RECOVERY - puppet last run on cp2024 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:36:04] ottomata: hey [14:36:43] that kafkatee pidfile issue needs some lovin', I found oxygen the other day and kafkatee didn't run there because of that [14:36:44] hiya [14:37:32] ok, so i should merge those and rebuild? i can do that real quick now [14:37:51] (03PS5) 10Phuedx: Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [14:37:52] these are independent patches, you also probably need to merge master to debian after merging both [14:37:58] also... they're untested :) [14:38:08] untested in that you didn't run them? [14:38:14] or just untested in prod? [14:38:22] the former heh [14:38:28] oh ha! [14:38:33] uMmm, ok. [14:38:33] sorry! [14:38:38] still better than a bug report, right? :) [14:38:41] (03CR) 10Phuedx: "I've moved the static data into mobile.php to mirror what we do for the WikiGrok extension." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [14:38:41] hahha ayes! [14:39:17] ha, ok! will test them. [14:40:04] (03PS1) 10BBlack: conftool-data: fixups for bits/parsoid [puppet] - 10https://gerrit.wikimedia.org/r/220463 [14:40:07] :D [14:40:33] (03PS1) 10Ottomata: Hack jmxtrans module to remove verbose jmxtrans log files [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/220464 [14:40:39] (03PS1) 10Jcrespo: Depool es2001 and es2002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220465 [14:41:08] (03CR) 10Ottomata: [C: 032] Hack jmxtrans module to remove verbose jmxtrans log files [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/220464 (owner: 10Ottomata) [14:41:14] (03PS6) 10Alexandros Kosiaris: WIP: Add new_wmf_service.py and tests [puppet] - 10https://gerrit.wikimedia.org/r/217548 (https://phabricator.wikimedia.org/T97036) [14:41:26] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1396444 (10GWicke) [14:41:27] (03CR) 10Giuseppe Lavagetto: [C: 032] conftool-data: fixups for bits/parsoid [puppet] - 10https://gerrit.wikimedia.org/r/220463 (owner: 10BBlack) [14:41:36] (03PS1) 10Ottomata: Update jmxtrans module with log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220466 [14:41:51] (03PS2) 10Ottomata: Update jmxtrans module with log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220466 [14:42:42] (03CR) 10Ottomata: [C: 032] Update jmxtrans module with log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220466 (owner: 10Ottomata) [14:44:13] (03PS2) 10BBlack: conftool-data: fixups for bits/parsoid [puppet] - 10https://gerrit.wikimedia.org/r/220463 [14:44:31] (03CR) 10BBlack: [C: 032 V: 032] conftool-data: fixups for bits/parsoid [puppet] - 10https://gerrit.wikimedia.org/r/220463 (owner: 10BBlack) [14:49:08] has SWAT started? can I merge one db depool before that? [14:49:21] jynus: 10 minutes yet [14:49:30] thcipriani, thanks [14:49:32] ori: 9bb3219a67a671ee1f5da2d6f833e9a0e3c86eab is causing breakage on labcontrol1001, can you assist? [14:50:32] (03CR) 10Jcrespo: [C: 032] Depool es2001 and es2002 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220465 (owner: 10Jcrespo) [14:50:34] or godog if you know anything about it? [14:51:37] jynus: [14:51:40] jouncebot, next [14:51:40] In 0 hour(s) and 8 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150624T1500) [14:52:34] Krenair, nice tool, didn't know it [14:53:02] !log jynus Synchronized wmf-config/db-codfw.php: depool es2001 and es 2002 for maintenance (duration: 00m 13s) [14:53:06] Logged the message, Master [14:53:11] thcipriani: I might have something for SWAT too, I'll add it there if I can help getit merged in time [14:53:39] YuviPanda: sure, put it on the list [14:54:49] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [14:54:57] otherdba doesn't have collisions with swat, so he didn't show me all the tricks [14:55:05] :-) [14:58:07] thcipriani: added! I'll try to make a submodule commit. [14:58:32] Krenair: do you know if wikitech is in grop1 or 2? [14:58:34] *group [14:58:37] group1 [14:58:55] YuviPanda: cool, thanks. [14:59:05] https://www.mediawiki.org/wiki/MediaWiki_1.26/Roadmap shows you this YuviPanda [14:59:27] phuedx: kart_ ping for SWAT here very soon :) [14:59:45] Krenair: ah, I had looked at that but was searching for 'wikitech' [14:59:49] it's still called labs... [14:59:57] so I need to backport this to wmf11 as well [15:00:04] manybubbles, anomie, ostriches, thcipriani, marktraceur: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150624T1500). [15:00:21] yes, you do [15:00:36] otherwise it'll get dropped from wikitech when the wmf11 train gets to labswiki later [15:00:51] thcipriani: here [15:01:00] kart_: cool, doing contenttranslation updates [15:01:13] wmf10first [15:01:30] RECOVERY - Disk space on analytics1018 is OK: DISK OK [15:03:38] hullo [15:03:52] thcipriani: hullo [15:04:03] phuedx|SAWT: Hi [15:04:17] thcipriani: do config changes get pushed to testwiki first and then to group1's? [15:04:34] 6operations, 10Wikimedia-Git-or-Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#1396511 (10hashar) Can you please purge the java6 packages? ``` ytterbium:~$ dpkg --get-selections *openjdk-6* openjdk-6-jre install openjdk-6-jre-headless... [15:04:54] that is: i'd like to push to testwiki first to be sure, this code has been lying around on wmflabs for a while but i'd like a quick check first [15:05:02] phuedx|SAWT: config changes generally go out to all wikis, I can pull to testwiki if you want to check. [15:05:06] kk [15:05:14] thcipriani: that'd be great, thanks [15:05:24] :/ [15:06:20] RECOVERY - puppetmaster https on labcontrol1001 is OK: HTTP OK: Status line output matched 400 - 287 bytes in 0.922 second response time [15:06:24] (03PS1) 10Ottomata: Restart jmxtrans when purging log files [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/220474 [15:07:31] * YuviPanda saws phuedx|SWAT [15:07:38] PROBLEM - puppet last run on analytics1020 is CRITICAL Puppet has 1 failures [15:07:47] hey YuviPanda [15:07:49] PROBLEM - puppet last run on analytics1040 is CRITICAL Puppet has 1 failures [15:07:51] how're you? [15:08:02] phuedx|SWAT: pretty good! except for minor injuries here and there :) [15:08:06] ? [15:08:08] oh noes [15:08:08] we're fellows in being swatted now! [15:08:46] !log thcipriani Synchronized php-1.26wmf10/extensions/ContentTranslation: SWAT: Enable publish button when the preference is not to use initial translation (duration: 00m 13s) [15:08:51] Logged the message, Master [15:08:53] (03PS1) 10Giuseppe Lavagetto: conftool: version 0.1.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/220475 [15:08:59] ^ kart_ check wmf10 please [15:09:02] <_joe_> bblack: ^^, FYI [15:09:19] <_joe_> bblack: I'll run the tests this time, though [15:09:21] <_joe_> :P [15:09:38] thcipriani: ok. testing. [15:09:59] PROBLEM - puppet last run on analytics1035 is CRITICAL Puppet has 1 failures [15:10:30] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [15:10:39] PROBLEM - puppet last run on conf1001 is CRITICAL Puppet has 1 failures [15:10:51] (03CR) 10BBlack: [C: 031] "Seems conceptually right, but I don't know this code :)" [software/conftool] - 10https://gerrit.wikimedia.org/r/220475 (owner: 10Giuseppe Lavagetto) [15:11:19] PROBLEM - puppet last run on analytics1038 is CRITICAL Puppet has 1 failures [15:11:45] thcipriani: Give me one more minute. First tests are okay. Checking again. [15:11:48] my fault, fixing ^ [15:11:56] kk [15:12:06] thcipriani: was submodule update Okay? Any issues there? [15:12:09] PROBLEM - puppet last run on analytics1002 is CRITICAL Puppet has 1 failures [15:12:31] (03PS2) 10Giuseppe Lavagetto: conftool: version 0.1.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/220475 [15:12:39] kart_: everything went fine there, apart from the unexplained how it got merged piece. [15:12:39] PROBLEM - puppet last run on conf1003 is CRITICAL Puppet has 1 failures [15:12:58] PROBLEM - puppet last run on analytics1022 is CRITICAL Puppet has 1 failures [15:13:18] thcipriani: okay. Amir is testing another article now. [15:13:41] (03PS1) 10Andrew Bogott: Don't specify period when calling cronolog. [puppet] - 10https://gerrit.wikimedia.org/r/220476 [15:13:42] <_joe_> mmmh can someone look at those puppet failures? [15:13:48] PROBLEM - puppet last run on analytics1013 is CRITICAL Puppet has 1 failures [15:14:08] ori: please read https://gerrit.wikimedia.org/r/#/c/220476/ when you arrive. [15:14:34] thcipriani: looks good on wmf10. Thanks! [15:14:35] !log disabled puppet on labcontrol1001 to hotfix https://gerrit.wikimedia.org/r/#/c/220476/ [15:14:39] Logged the message, Master [15:14:45] kart_: kk, wmf11 up next [15:14:59] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] conftool: version 0.1.1 [software/conftool] - 10https://gerrit.wikimedia.org/r/220475 (owner: 10Giuseppe Lavagetto) [15:15:18] PROBLEM - puppet last run on analytics1014 is CRITICAL Puppet has 1 failures [15:15:40] PROBLEM - puppet last run on analytics1001 is CRITICAL Puppet has 1 failures [15:15:40] PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet has 1 failures [15:15:48] PROBLEM - puppet last run on analytics1032 is CRITICAL Puppet has 1 failures [15:15:59] PROBLEM - puppet last run on analytics1011 is CRITICAL Puppet has 1 failures [15:16:58] PROBLEM - puppet last run on analytics1028 is CRITICAL Puppet has 1 failures [15:17:15] !log thcipriani Synchronized php-1.26wmf11/extensions/ContentTranslation: SWAT: Enable publish button when the preference is not to use initial translation (duration: 00m 12s) [15:17:20] Logged the message, Master [15:17:23] ^ kart_ check please [15:17:50] well, if any group0s have contenttranslation... [15:17:57] thcipriani: did you merge two patches?? [15:18:12] (03PS2) 10Ottomata: Restart jmxtrans when purging log files [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/220474 [15:18:12] 7Puppet, 6Community-Liaison, 10MediaWiki-extensions-NavigationTiming, 6Performance-Team, and 3 others: Track state (region) - https://phabricator.wikimedia.org/T101819#1396569 (10Gilles) 5Open>3Resolved [15:18:23] kart_: looked like there were two submodule bumps on wmf11 and wmf10 [15:18:30] oh. wait. updated two files right? [15:18:41] (it) [15:19:14] for wmf11 it was d458936e64d4ae9f770844ed1f0534e9231887a2 and aaaf0d7c7863c77096f3ded88a49e801d285ac30 [15:19:17] (03CR) 10Ottomata: [C: 032] Restart jmxtrans when purging log files [puppet/jmxtrans] - 10https://gerrit.wikimedia.org/r/220474 (owner: 10Ottomata) [15:19:27] ok. We're good. [15:19:42] thcipriani: you can enable testwiki for wmf11 :) [15:19:54] (03PS1) 10Ottomata: Update jmxtrans module with restart of service during log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220479 [15:19:59] (03CR) 10jenkins-bot: [V: 04-1] Update jmxtrans module with restart of service during log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220479 (owner: 10Ottomata) [15:20:04] (03PS2) 10Ottomata: Update jmxtrans module with restart of service during log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220479 [15:20:56] (03CR) 10Ottomata: [C: 032] Update jmxtrans module with restart of service during log purge hack [puppet] - 10https://gerrit.wikimedia.org/r/220479 (owner: 10Ottomata) [15:20:59] PROBLEM - puppet last run on analytics1031 is CRITICAL Puppet has 1 failures [15:21:13] kart_: right, ok, doing the config [15:21:29] PROBLEM - puppet last run on analytics1012 is CRITICAL Puppet has 1 failures [15:21:39] PROBLEM - puppet last run on analytics1021 is CRITICAL Puppet has 1 failures [15:21:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220385 (https://phabricator.wikimedia.org/T103639) (owner: 10KartikMistry) [15:22:09] PROBLEM - puppet last run on conf1002 is CRITICAL Puppet has 1 failures [15:22:11] (03Merged) 10jenkins-bot: Enable ContentTranslation in testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220385 (https://phabricator.wikimedia.org/T103639) (owner: 10KartikMistry) [15:23:08] PROBLEM - puppet last run on analytics1029 is CRITICAL Puppet has 1 failures [15:23:19] RECOVERY - puppet last run on analytics1012 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:23:49] RECOVERY - Disk space on analytics1012 is OK: DISK OK [15:23:58] PROBLEM - puppet last run on analytics1034 is CRITICAL Puppet has 1 failures [15:23:58] PROBLEM - puppet last run on analytics1039 is CRITICAL Puppet has 1 failures [15:24:19] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [15:24:32] ottomata: about? analytics puppet issues galore [15:24:35] !log thcipriani Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable ContentTranslation in testwiki [[gerrit:220385]] (duration: 00m 12s) [15:24:39] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [15:24:39] Logged the message, Master [15:24:47] ^ kart_ ok now check test wiki [15:24:49] PROBLEM - puppet last run on analytics1019 is CRITICAL Puppet has 1 failures [15:24:54] thcipriani: I merged them in the wmf* branches, and made the submodule updates, but puthing them through git review is taking ages [15:24:59] PROBLEM - puppet last run on analytics1036 is CRITICAL Puppet has 1 failures [15:24:59] will continue poking [15:25:07] YuviPanda: k, thanks [15:25:14] <_joe_> ottomata: those criticals are yours [15:25:16] <_joe_> Error: /bin/rm /var/log/jmxtrans/*.log* returned 1 instead of one of [0] [15:25:19] <_joe_> Error: /Stage[main]/Jmxtrans/Exec[jmxtrans-log-purge]/returns: change from notrun to 0 failed: /bin/rm /var/log/jmxtrans/*.log* returned 1 instead of one of [0] [15:25:35] <_joe_> ottomata: and that exec is horribly wrong too [15:25:38] RECOVERY - puppet last run on analytics1020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:48] RECOVERY - puppet last run on analytics1040 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:25:58] yes _joe_, i know. [15:26:00] <_joe_> you remove the running log file, so it remains as a deleted file hanging up [15:26:05] see [15:26:11] recent merge that is making things more better [15:26:24] https://gerrit.wikimedia.org/r/#/c/220474/ [15:26:39] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:26:44] thcipriani: okay! [15:26:49] RECOVERY - puppet last run on conf1001 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:27:59] PROBLEM - DPKG on oxygen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [15:28:19] thcipriani: looks cool :) [15:28:34] thcipriani: Thank you and thanks for listing my boring story. [15:28:42] listening* [15:28:49] RECOVERY - puppet last run on conf1003 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [15:28:56] kart_: not a boring story, a weird and disconcerting story. [15:29:07] phuedx|SWAT: ok, you're up [15:29:09] RECOVERY - puppet last run on analytics1022 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:29:11] yo [15:29:12] here [15:29:13] :) [15:29:18] RECOVERY - puppet last run on analytics1038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:29:30] even got an irc nick to boot [15:29:58] RECOVERY - puppet last run on analytics1013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:00] RECOVERY - puppet last run on analytics1002 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:11] (03PS2) 10Andrew Bogott: Specify 1minute rather than '1 minute' for cronolog period. [puppet] - 10https://gerrit.wikimedia.org/r/220476 [15:30:46] ori: update, https://gerrit.wikimedia.org/r/#/c/220476/ is much less concerning now but probably still of interest. [15:31:16] (03CR) 10Ori.livneh: [C: 032] Specify 1minute rather than '1 minute' for cronolog period. [puppet] - 10https://gerrit.wikimedia.org/r/220476 (owner: 10Andrew Bogott) [15:31:58] RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:31:59] RECOVERY - puppet last run on analytics1032 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:32:08] hm, he lurks [15:32:09] RECOVERY - puppet last run on analytics1011 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:32:09] andrewbogott: https://gerrit.wikimedia.org/r/#/c/218383/ and https://gerrit.wikimedia.org/r/#/c/218383/ and https://gerrit.wikimedia.org/r/#/c/218380/ :) [15:32:17] andrewbogott: for salt and puppet autosigning, I made those lsat week... [15:32:29] dammit [15:32:31] ok :) [15:32:43] I should’ve stayed in bed yesterday [15:33:10] RECOVERY - puppet last run on analytics1014 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:33:29] RECOVERY - Host labstore2001 is UPING OK - Packet loss = 0%, RTA = 43.42 ms [15:33:39] RECOVERY - puppet last run on analytics1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:34:12] YuviPanda: two of those are the same? [15:34:23] andrewbogott: copy paste failure [15:34:31] (03PS9) 10Chad: Allow text-lb to redirect svn access to Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/219228 [15:34:31] andrewbogott: but there are three patches in a series. [15:34:49] RECOVERY - puppet last run on analytics1028 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:35:24] hm, YuviPanda, actually my patches clean up puppetsigner.py as well, so maybe I like them better. [15:35:31] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [15:35:32] Also, you should add me as a reviewer to things so I know about them :) [15:35:38] andrewbogott: ouch, yes. I forgot :| [15:35:46] andrewbogott: I poked you on the phab ticket and assumed that's enough... [15:35:54] (03Merged) 10jenkins-bot: Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [15:36:01] andrewbogott: hmm, I like having hiera flags be specific about what they do than be about where they are applied [15:36:02] moritzm: Can we get a vote from you about auto-signing in labs? [15:36:40] andrewbogott: so I'd prefer it the setting was named after what it was doing (autosigning) vs where it is turned on in... [15:36:50] but yeah, will need to amend to fix puppetsigner... [15:37:12] thcipriani: git review hates me... [15:37:18] can't seem to do the submodule bumps here... [15:37:26] it's been stuck for a while. [15:37:39] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:50] YuviPanda: ok, so probably you like my salt patches fine then? [15:37:53] (03PS1) 10Gilles: Enable TinyRGB ICC profile swapping [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220485 (https://phabricator.wikimedia.org/T100976) [15:37:55] phuedx|SWAT: ok, going to pull down to test wiki now [15:38:03] thcipriani: kk [15:38:19] RECOVERY - puppet last run on conf1002 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [15:38:39] ACKNOWLEDGEMENT - DPKG on oxygen is CRITICAL: DPKG CRITICAL dpkg reports broken packages ottomata testing new kafkatee package [15:38:39] ACKNOWLEDGEMENT - puppet last run on oxygen is CRITICAL Puppet has 1 failures ottomata testing new kafkatee package [15:38:48] (03CR) 10Yuvipanda: [C: 031] Switch on salt auto_accept for labs. [puppet] - 10https://gerrit.wikimedia.org/r/220306 (https://phabricator.wikimedia.org/T102504) (owner: 10Andrew Bogott) [15:39:40] RECOVERY - puppet last run on analytics1021 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:39:43] phuedx|SWAT: should be on testwiki now [15:39:50] PROBLEM - puppet last run on baham is CRITICAL Puppet has 1 failures [15:39:56] lemme know if you're ready to make like [15:40:09] PROBLEM - puppet last run on labvirt1006 is CRITICAL puppet fail [15:40:18] PROBLEM - puppet last run on mc2005 is CRITICAL Puppet has 1 failures [15:40:19] PROBLEM - puppet last run on radon is CRITICAL Puppet has 1 failures [15:40:19] PROBLEM - puppet last run on elastic1020 is CRITICAL Puppet has 1 failures [15:40:19] thcipriani: testing now [15:40:29] PROBLEM - puppet last run on cp1067 is CRITICAL Puppet has 2 failures [15:40:38] PROBLEM - puppet last run on cp3021 is CRITICAL Puppet has 1 failures [15:40:48] PROBLEM - puppet last run on elastic1003 is CRITICAL Puppet has 1 failures [15:40:48] PROBLEM - puppet last run on achernar is CRITICAL Puppet has 1 failures [15:40:49] RECOVERY - Host labstore2001 is UPING OK - Packet loss = 0%, RTA = 44.82 ms [15:40:49] PROBLEM - puppet last run on analytics1003 is CRITICAL Puppet has 1 failures [15:40:49] RECOVERY - puppet last run on analytics1031 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:41:00] PROBLEM - puppet last run on db1056 is CRITICAL Puppet has 1 failures [15:41:08] PROBLEM - puppet last run on mc1017 is CRITICAL Puppet has 1 failures [15:41:09] PROBLEM - puppet last run on analytics1018 is CRITICAL Puppet has 2 failures [15:41:32] YuviPanda: want me to try to make the bump patch? [15:41:38] PROBLEM - puppet last run on lvs3003 is CRITICAL Puppet has 1 failures [15:41:46] thcipriani: yes please... [15:42:00] thcipriani: I merged them into wmf10 and wmf11 [15:42:08] PROBLEM - puppet last run on db2054 is CRITICAL Puppet has 1 failures [15:42:09] RECOVERY - puppet last run on elastic1020 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:42:18] YuviPanda: kk [15:42:38] RECOVERY - puppet last run on elastic1003 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:42:39] thcipriani: tested on testwiki and it seems ok [15:42:39] PROBLEM - puppet last run on mw2064 is CRITICAL Puppet has 1 failures [15:42:44] :) [15:42:54] (03PS8) 10Rush: Setup a node pool file from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [15:42:59] RECOVERY - puppet last run on analytics1019 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:42:59] RECOVERY - puppet last run on analytics1029 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:43:00] phuedx|SWAT: okie doke, going to sync live [15:43:08] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:43:26] (03CR) 10Rush: "NOTE: I added 'include pybal::confd' in modules/lvs/manifests/balancer.pp" [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush) [15:43:44] 6operations, 10ops-codfw: Labstore2001 controler or shelf failure - https://phabricator.wikimedia.org/T102626#1396726 (10Papaul) New controller in place. All shelves connected the same way it was on old controller. [15:43:49] RECOVERY - puppet last run on analytics1039 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:43:50] RECOVERY - puppet last run on analytics1034 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:44:17] !log thcipriani Synchronized wmf-config: SWAT: Enable browse prototype on test- and enwiki [[gerrit:219451]] (duration: 00m 12s) [15:44:23] Logged the message, Master [15:44:29] PROBLEM - puppet last run on mw2010 is CRITICAL Puppet has 1 failures [15:44:38] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:44:39] PROBLEM - puppet last run on mw1027 is CRITICAL Puppet has 1 failures [15:44:42] phuedx|SWAT: seeing a lot of this in fatalmonitor: Undefined variable: wmgMFIsBrowseEnabled in /srv/mediawiki/wmf-config/mobile.php on line 245 [15:45:33] andrewbogott: was in a meeting [15:45:42] but had enough spare cycles to look over the patch [15:45:51] ori: no worries; things are fixed for me, as long as they aren’t now broken elsewhere :) [15:46:29] phuedx|SWAT: reverting [15:46:39] PROBLEM - puppet last run on mw2019 is CRITICAL Puppet has 1 failures [15:46:48] thcipriani: of course -- go go -- but i'm not sure what's causing that [15:46:58] PROBLEM - puppet last run on mw2081 is CRITICAL Puppet has 1 failures [15:48:28] RECOVERY - puppet last run on analytics1018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:15] !log rebooting es2001 and es2002 [15:49:21] Logged the message, Master [15:51:45] YuviPanda: https://gerrit.wikimedia.org/r/#/c/218380/1 doesn’t do anything does it? It sets an argument that isn’t definied in the class it’s passed to... [15:51:55] am I missing the other half of that patch? [15:52:56] andrewbogott: ouch. [15:52:59] andrewbogott: that seems like it... [15:53:07] ok, stay tuned, I’ll commit a new patchset. [15:53:20] andrewbogott: cool. :) [15:54:03] 6operations, 10ops-codfw: Replace H800 controller on labstore2001 - https://phabricator.wikimedia.org/T102786#1396775 (10Papaul) 5Open>3Resolved Installation of new controller complete [15:54:19] RECOVERY - puppet last run on baham is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [15:54:59] RECOVERY - puppet last run on cp1067 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:09] RECOVERY - puppet last run on cp3021 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [15:55:14] (03PS1) 10Thcipriani: Revert "Merge "Enable browse prototype on test- and enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220487 [15:55:19] RECOVERY - puppet last run on achernar is OK Puppet is currently enabled, last run 56 seconds ago with 0 failures [15:55:19] RECOVERY - puppet last run on analytics1003 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [15:55:19] RECOVERY - puppet last run on mw2064 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:55:19] RECOVERY - puppet last run on mw2010 is OK Puppet is currently enabled, last run 46 seconds ago with 0 failures [15:55:30] RECOVERY - puppet last run on db1056 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:55:31] RECOVERY - puppet last run on mw1027 is OK Puppet is currently enabled, last run 7 seconds ago with 0 failures [15:55:32] (03CR) 10Thcipriani: [C: 032] "SWAT revert" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220487 (owner: 10Thcipriani) [15:55:38] RECOVERY - puppet last run on mc1017 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [15:55:39] (03Merged) 10jenkins-bot: Revert "Merge "Enable browse prototype on test- and enwiki"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220487 (owner: 10Thcipriani) [15:55:59] RECOVERY - puppet last run on mw2081 is OK Puppet is currently enabled, last run 33 seconds ago with 0 failures [15:56:09] RECOVERY - puppet last run on lvs3003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:29] RECOVERY - puppet last run on labvirt1006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:39] RECOVERY - puppet last run on mc2005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:39] RECOVERY - puppet last run on radon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:56:39] RECOVERY - puppet last run on db2054 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [15:56:41] thcipriani: ping me when you've reverted on live [15:56:51] thcipriani: thanks btw :) [15:57:08] !log thcipriani Synchronized wmf-config: SWAT: Revert Enable browse prototype on test- and enwiki (duration: 00m 15s) [15:57:10] ^ phuedx: [15:57:13] Logged the message, Master [15:57:29] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:57:50] YuviPanda: any chance I could bump your patch? [15:57:56] is it critical? [15:59:40] YuviPanda: if we make the backport to wmf11 now it'll roll at 11ish with the train [15:59:51] acceptable? [15:59:55] thcipriani: yup, totally! [16:00:15] kk, I'll make and merge that bump [16:00:24] thcipriani: ty! [16:00:37] thcipriani: mind giving me a poke when doing the train as well? [16:00:52] YuviPanda: that's twentyafterfour 's department [16:00:59] alright! [16:01:02] I'll keep an eye out :0 [16:01:03] :) [16:01:09] thcipriani: thanks for doing the bump! :) [16:01:13] eh? [16:01:32] * twentyafterfour was pretending to be canadian [16:01:54] (03PS9) 10Rush: Setup a node pool file from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 [16:01:55] thcipriani: did you see those when it was on testwiki too btw? [16:02:37] twentyafterfour: ran out of time for swat, YuviPanda has a submodule bump that needs a full scap, ok to merge the bump into wmf11 pre-train? [16:02:53] thcipriani: sure [16:03:05] twentyafterfour: okie doke, thanks! [16:03:15] I don't normally do a full scap other than tuesday but I can today if it's needed [16:04:04] there's time to do it now. there's nothing on the calendar for 2 hours [16:04:06] I tried to automate puppet service silence in icinga across the lvs fleet and icinga is not honoring the external command file [16:04:23] so some puppet run things may crop up for lvs here fyi [16:05:18] bd808: that is true. sigh. YuviPanda: Ok, doing now :) [16:05:31] oh? [16:06:17] (03CR) 10Rush: [C: 032] Setup a node pool file from etcd for lvs cluster [puppet] - 10https://gerrit.wikimedia.org/r/219481 (owner: 10Rush) [16:06:39] yeah, bd808 pointed out that there's no deploys coming for the next couple hours, so SWAT can be a bit extended. [16:07:04] * bd808 is always making more work for folks [16:07:39] 6operations, 6Labs: Recover home folders and /data/project from wikimetrics1 - https://phabricator.wikimedia.org/T103530#1396826 (10yuvipanda) I'm going to: # Bring back /data/project # Copy the old contents of your home folders into /data/project/home This means your actual /home folders will not be on NFS,... [16:08:32] (03CR) 10Phuedx: Enable browse prototype on test- and enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219451 (https://phabricator.wikimedia.org/T101155) (owner: 10Jdlrobson) [16:09:07] 6operations, 6Labs: Recover home folders and /data/project from wikimetrics1 - https://phabricator.wikimedia.org/T103530#1396827 (10yuvipanda) alright, if you run puppet on instances now you'll get /data/project back. I've copied over the contents of home folders into /data/project/home as well. [16:10:11] dang: 145994 Undefined variable: wmgMFIsBrowseEnabled in /srv/mediawiki/wmf-config/mobile.php on line 245 is huge on fatalmonitor, even though it's been reverted :( [16:10:40] wha!? [16:10:43] !log ori Synchronized php-1.26wmf11/includes/page/Article.php: I0e5f2d3b2: Revert r47388 / 8d9243cf3: Use Title::getLocalURL() for rel=canonical links (duration: 00m 13s) [16:10:47] Logged the message, Master [16:10:59] YuviPanda: I’m wrong, that patch was fine after all. [16:11:13] andrewbogott: did I make it take up autosign in a separate patch? [16:11:15] andrewbogott: oh [16:11:16] andrewbogott: right [16:11:21] andrewbogott: it takes an arbitrary list of arguments [16:11:23] I remember now! [16:11:24] it’s passed in as an arbitrary... [16:11:26] yeah :) [16:11:49] YuviPanda: did you get that submodule bump to wmf10? looks like it's merged now. [16:11:59] thcipriani: oh? no? [16:12:14] thcipriani: I merged it in the wmf10 and wmf11 branches, but not the submodule bumps [16:12:21] i merged the submodule bump [16:12:24] to 11 [16:12:34] ah! [16:12:39] who made the bump? me? [16:12:47] git review never returned so I assumed it didn't make it? [16:12:52] thcipriani: over what period does fatalmonitor operate? [16:12:57] (03PS2) 10Andrew Bogott: puppetmaster: Enable autosigning puppet certs for labs [puppet] - 10https://gerrit.wikimedia.org/r/218380 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [16:12:59] (03PS2) 10Andrew Bogott: Switch on salt auto_accept for labs. [puppet] - 10https://gerrit.wikimedia.org/r/220306 (https://phabricator.wikimedia.org/T102504) [16:13:25] thcipriani: actually, firstly, is the number increasing? [16:13:34] YuviPanda: https://git.wikimedia.org/blobdiff/mediawiki%2Fcore/8c3d5909b4733663deb683c3bf4901b9d63c95d3/extensions%2FOpenStackManager [16:13:57] phuedx: yes it seem to be slowly creeping up, but the new code has been scapped out [16:14:01] thcipriani: oh, I see. [16:14:04] thcipriani: so that did vaguely work? [16:14:08] it was sky-rocketing [16:14:16] 6operations, 6Labs: Recover home folders and /data/project from wikimetrics1 - https://phabricator.wikimedia.org/T103530#1396833 (10mforns) Awesome, thanks! Personally, I don't need shared home folders much. [16:14:41] YuviPanda: yeah, there's something weird happening with submodule bumps happening today I think [16:15:10] (03PS1) 10Andrew Bogott: Remove some uses of scope.lookupvar by passing args more explicitly. [puppet] - 10https://gerrit.wikimedia.org/r/220489 [16:15:11] thcipriani: i'm still not sure how this notice could occur -- is there a test harness for mediawiki-config? the variable is clearly defined in initialisesettings in the same patch :/ [16:15:34] thcipriani, phuedx: https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor looks ok [16:15:51] (03CR) 10jenkins-bot: [V: 04-1] Remove some uses of scope.lookupvar by passing args more explicitly. [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [16:15:56] jouncebot, next [16:15:56] In 1 hour(s) and 44 minute(s): MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150624T1800) [16:16:06] bd808: looking at the one on fluorine [16:16:32] (03Abandoned) 10Andrew Bogott: labs: Have salt master auto accept keys [puppet] - 10https://gerrit.wikimedia.org/r/218383 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [16:16:50] thcipriani: should I do something? [16:16:56] (03Abandoned) 10Andrew Bogott: salt: Allow salt master to auto accept salt keys [puppet] - 10https://gerrit.wikimedia.org/r/218379 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [16:16:57] TELL ME, SCAPMASTER! [16:17:07] The fatalmonitor script on fluorine can be deceiving. It just shows a summary of the last 1000 lines of hhvm.log [16:17:07] YuviPanda: no, I'm getting everything lined up on tin right now :) [16:17:13] are you deploying twentyafterfour? [16:17:14] WONDERFUL! [16:17:24] oh, that session is from yesterday [16:17:26] nevermind [16:17:41] so if that file isn't changing rapidly it shows things that happened quite a while ago [16:18:21] YuviPanda: ok, I think I’ve sorted out all those patches and made a reasonable 2-patch series. Still waiting for Moritz to approve though. [16:18:36] andrewbogott: +1 [16:19:00] thcipriani: the last line for that error was timestamped 15:56:58 (~20 minutes ago) [16:19:09] PROBLEM - Host labstore2001 is DOWN: PING CRITICAL - Packet loss = 100% [16:19:20] bd808: okie doke, good to know. phuedx everything should be fine after the revert, the line it's referring to doesn't exist anymore. [16:19:28] (03PS2) 10Andrew Bogott: Remove some uses of scope.lookupvar by passing args more explicitly. [puppet] - 10https://gerrit.wikimedia.org/r/220489 [16:21:58] YuviPanda: Ok, starting scap [16:21:58] (03CR) 10Andrew Bogott: Remove some uses of scope.lookupvar by passing args more explicitly. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [16:22:02] Krinkle_: hey! [16:22:13] thcipriani: wheeee [16:22:29] !log thcipriani Started scap: SWAT: Automatically add to shell group when adding to a project [[gerrit:220468]] [16:22:34] Logged the message, Master [16:22:43] so confused about these auto-submodule bumps [16:23:06] auto-submodule bumps? [16:23:42] (03CR) 10Yuvipanda: Labs: More puppetization fixes for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/218666 (https://phabricator.wikimedia.org/T102478) (owner: 10coren) [16:23:52] Krenair: yeah, just started happening today, lemme see if I can find a good example [16:25:15] Krenair: so like: https://github.com/wikimedia/mediawiki/commit/181ed3d667a83fc086b81425ea5989f6e2feddc7 without a corresponding gerrit patch: https://gerrit.wikimedia.org/r/#/q/181ed3d667a83fc086b81425ea5989f6e2feddc7,n,z [16:26:05] thcipriani, wat [16:26:14] PROBLEM - ensure confd service on lvs3001 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [16:26:27] how has this started happening automatically? I guess we need to update the docs now? [16:26:38] did we do a gerrit upgrade or something? [16:27:35] Krenair: Not that I'm aware of, I only became aware of it when I started looking at deployments lined up for SWAT this morning [16:30:30] thcipriani: which host are you seeing wmgMFIsBrowseEnabled issues on? [16:31:03] RECOVERY - DPKG on oxygen is OK: All packages OK [16:31:31] jdlrobson: just blew up in fatalmonitor post-scap [16:32:00] as phuedx stated i have no idea why it wouldn't be defined.. [16:32:14] andrewbogott: sure, I'll look into it tomorrow morning [16:32:24] moritzm: thanks [16:32:54] (03PS1) 10Giuseppe Lavagetto: varnish: allow picking which director is dynamic [puppet] - 10https://gerrit.wikimedia.org/r/220492 [16:33:02] jdlrobson: do you have access to https://logstash.wikimedia.org/#/dashboard/elasticsearch/fatalmonitor ? [16:33:13] <_joe_> bblack: ^^ this is a first stab at the problem which made the code better in general I think [16:33:29] i think so greg-g but i always take a while to remember howto use it [16:33:31] <_joe_> but lemme test it :P [16:34:06] jdlrobson: it's one of the best ways to diagnose issues :) [16:34:09] everyone should learn it [16:34:39] jdlrobson, you just log in with wikitech credentials [16:34:47] if you're in the wmf group it should work [16:34:56] https://logstash.wikimedia.org/#dashboard/temp/1PLreol6ScGWyZmJXXZ3Mg [16:35:00] thcipriani: jdlrobson ^ [16:35:16] and you do appear to be in that group: member: uid=jdlrobson,ou=people,dc=wikimedia,dc=org [16:35:53] greg-g, Krenair: thanks for that link [16:36:25] PROBLEM - DPKG on oxygen is CRITICAL: DPKG CRITICAL dpkg reports broken packages [16:36:33] it doesn't seem to give me any useful information... what is mw1136 ? [16:36:38] the host [16:36:46] mw1136.eqiad.wmnet [16:36:50] yeh but what does that mean :) [16:37:01] what do you mean? it's a hostname :) [16:37:16] maybe i should rephrase the question - how do i know which wikis this fatal is happening on? [16:37:27] i'm looking at the code and cannot understand why that variable would ever be undefined [16:37:32] it's in InitialiseSettings [16:37:37] sorry -- i'm in a meeting right now [16:37:53] in and out of here [16:38:23] RECOVERY - DPKG on oxygen is OK: All packages OK [16:38:28] jdlrobson: was only enabled for enwiki seemingly, should have been wmf10 branch [16:38:55] (03PS2) 10Giuseppe Lavagetto: varnish: allow picking which director is dynamic [puppet] - 10https://gerrit.wikimedia.org/r/220492 [16:40:19] bd808: oh boy scap failures [16:40:47] Krenair: a little late but no I'm not deploying right now [16:40:53] :) [16:41:04] bd808: https://gist.github.com/thcipriani/5a935ea0dee2217fa42c [16:41:14] thcipriani: like yesterday? [16:41:35] twentyafterfour: yeah, looks like [16:41:43] PROBLEM - Disk space on snapshot1001 is CRITICAL: DISK CRITICAL - free space: / 2 MB (0% inode=59%) [16:41:46] slightly different [16:41:49] thcipriani: not fun but not the end of the world. disk is full on snapshot1001.eqiad.wmnet [16:42:03] probably need to prune some old branches? [16:42:11] i don't really understand how this stuff works, but the only reason there would be a fatal is if InitaliseSettings.php was not running (or an old version was being executed) and mobile.php wasn't [16:42:16] but i don't understand how that's possible.. [16:42:46] thcipriani: i'm not sure i understand, the error is in the wmf-config, why would a branch affect that? [16:42:50] bd808: so do I trim branches and the re-scap? Or...? [16:42:52] ^^^ what jdlrobson said [16:43:06] bd808: I didn't get around to pruning branches yesterday [16:43:20] thcipriani: I think you're good actually, but twentyafterfour should trim some branches [16:43:43] * twentyafterfour has been meaning to automate that part [16:44:03] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [16:44:06] bd808: kk [16:45:44] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60576 bytes in 1.886 second response time [16:46:12] (03CR) 10Andrew Bogott: [C: 031] puppetmaster: Enable autosigning puppet certs for labs [puppet] - 10https://gerrit.wikimedia.org/r/218380 (https://phabricator.wikimedia.org/T102504) (owner: 10Yuvipanda) [16:49:29] (03CR) 10Matanya: "you will need a person with the ruby=foo" [puppet] - 10https://gerrit.wikimedia.org/r/220489 (owner: 10Andrew Bogott) [16:49:51] (03CR) 10Alex Monk: "And submit please?" [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [16:49:54] andrewbogott_afk: were you able to fix 9bb3219a67a671 and labcontrol1001 ? [16:50:35] (03PS1) 10Jcrespo: Repool es2001, es2002 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220494 [16:51:22] (03CR) 10Jcrespo: [C: 032] Repool es2001, es2002 after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220494 (owner: 10Jcrespo) [16:52:21] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1397025 (10Cmjohnson) @fgiunchedi I have swapped the 2 disks with 800Gb Intel s3700 ssds. [16:53:04] bd808: should we still keep old branches around for 5 weeks? Seems like I remember something about cache lifetime increasing recently [16:53:17] cmjohnson1: nice! [16:53:38] hopefully better results [16:54:14] either way we should get some clarity [16:55:04] cmjohnson1: sweet, thanks! [16:55:41] thcipriani: any... progress? [16:55:50] YuviPanda: one proxy left [16:55:55] then rebuild cdb [16:56:09] ah cool [16:59:35] (03PS1) 10Dzahn: Bugzilla: remove module, keep static version [puppet] - 10https://gerrit.wikimedia.org/r/220495 (https://phabricator.wikimedia.org/T103193) [17:03:29] (03PS1) 10Rush: confd: hiera specify confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/220497 [17:03:50] PROBLEM - ensure confd service on lvs4004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:03:51] PROBLEM - ensure confd service on lvs3003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:04:03] !log thcipriani scap failed: OSError [Errno 2] No such file or directory: '/var/lock/scap' (duration: 41m 33s) [17:04:07] Logged the message, Master [17:04:38] (03CR) 10John F. Lewis: [C: 04-1] "Looks good though will break puppet on zirconium as it still calls for role::bugzilla." [puppet] - 10https://gerrit.wikimedia.org/r/220495 (https://phabricator.wikimedia.org/T103193) (owner: 10Dzahn) [17:04:40] 6operations, 10Wikimedia-Git-or-Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#1397069 (10MoritzMuehlenhoff) That won't work: There's a gerrit deb which depends on openjdk-6-jre [17:04:42] PROBLEM - ensure confd service on lvs4001 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:05:02] !log scap completed with the exception of snapshot1001 that's disk is full [17:05:05] ^ YuviPanda [17:05:06] Logged the message, Master [17:05:20] PROBLEM - ensure confd service on lvs4003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:05:21] PROBLEM - ensure confd service on lvs3002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:05:30] PROBLEM - ensure confd service on lvs1004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:05:31] (03PS1) 10Dzahn: misc-web: delete Varnish config for dev.wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/220498 (https://phabricator.wikimedia.org/T305) [17:05:32] thcipriani: looking! [17:05:36] who's the best person to talk to about InitialiseSettings.php, when it gets run, and how it might not define a variable for mobile.php to consume? [17:05:42] PROBLEM - ensure confd service on lvs3004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:06:29] (03PS1) 10Cmjohnson: Adding dns for labcontrol1002 and removing virt1000 name from mgmt [dns] - 10https://gerrit.wikimedia.org/r/220499 [17:06:43] (03CR) 10jenkins-bot: [V: 04-1] Adding dns for labcontrol1002 and removing virt1000 name from mgmt [dns] - 10https://gerrit.wikimedia.org/r/220499 (owner: 10Cmjohnson) [17:06:51] PROBLEM - ensure confd service on lvs4002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:07:15] (03CR) 10Rush: [C: 032] confd: hiera specify confd::srv_dns [puppet] - 10https://gerrit.wikimedia.org/r/220497 (owner: 10Rush) [17:09:24] was there a change to nginx and submodule again [17:11:10] (03PS2) 10Dzahn: Bugzilla: remove module, keep static version [puppet] - 10https://gerrit.wikimedia.org/r/220495 (https://phabricator.wikimedia.org/T103193) [17:12:33] (03PS2) 10Cmjohnson: Adding dns for labcontrol1002 and removing virt1000 name from mgmt [dns] - 10https://gerrit.wikimedia.org/r/220499 [17:12:34] 7Blocked-on-Operations, 6operations, 10Parsoid: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1397086 (10ssastry) >>! In T102039#1393314, @cscott wrote: > But `git deploy service restart` worked fine when I was doing my... [17:12:41] RECOVERY - ensure confd service on lvs1004 is OK: PROCS OK: 1 process with args /usr/bin/confd [17:12:49] thcipriani: seems ok! [17:13:04] YuviPanda: nice! [17:13:20] ok I cleaned up one old branch [17:13:21] (03Abandoned) 10Dzahn: move public IP from virt1000 to labcontrol1002 [dns] - 10https://gerrit.wikimedia.org/r/220314 (owner: 10Dzahn) [17:13:32] (03Abandoned) 10Dzahn: remove virt1000 [dns] - 10https://gerrit.wikimedia.org/r/220311 (https://phabricator.wikimedia.org/T1002005) (owner: 10Dzahn) [17:13:50] 7Blocked-on-Operations, 6operations, 10Parsoid: Disabling agent forwarding breaks dsh based restarts for Parsoid (required for deployments) - https://phabricator.wikimedia.org/T102039#1397094 (10ssastry) Oh, as soon as I typed that .. I realized that this doesn't happen when ariel / akosiaris restart. So, ma... [17:15:01] PROBLEM - puppet last run on lvs1002 is CRITICAL puppet fail [17:15:06] and I shared my script that automates the process [17:15:13] (that is, I submitted it to gerrit) [17:15:24] https://gerrit.wikimedia.org/r/#/c/220500/ [17:16:31] PROBLEM - puppet last run on lvs1001 is CRITICAL puppet fail [17:17:01] (03CR) 10John F. Lewis: [C: 031] Bugzilla: remove module, keep static version [puppet] - 10https://gerrit.wikimedia.org/r/220495 (https://phabricator.wikimedia.org/T103193) (owner: 10Dzahn) [17:17:07] (03PS1) 1020after4: symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220503 [17:17:18] (03PS1) 10Dzahn: dev.wikimedia.org: delete puppet module [puppet] - 10https://gerrit.wikimedia.org/r/220504 (https://phabricator.wikimedia.org/T305) [17:19:19] (03CR) 10Cmjohnson: [C: 032] Adding dns for labcontrol1002 and removing virt1000 name from mgmt [dns] - 10https://gerrit.wikimedia.org/r/220499 (owner: 10Cmjohnson) [17:20:04] thcipriani, finished syncing? [17:20:11] jynus: yup [17:20:18] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1397111 (10GWicke) @MoritzMuehlenhoff: is the plan to smear the leap, or will we let it happen normally? In principle it does make a difference to Cassandra as it's using timestamps for conflict... [17:20:20] thank you! [17:20:42] (03PS1) 10Rush: confd: hiera confd::srv_dns to eqiad/confd.yaml [puppet] - 10https://gerrit.wikimedia.org/r/220505 [17:21:29] 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1397124 (10MoritzMuehlenhoff) Patch needs more work, latest installed systems didn't install 3.19 [17:21:41] twentyafterfour, I also see some log from you, can I quickly sync 1 file? [17:22:11] RECOVERY - Host labstore2001 is UPING OK - Packet loss = 0%, RTA = 43.75 ms [17:23:15] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "small comment, then LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/220505 (owner: 10Rush) [17:24:21] RECOVERY - puppet last run on lvs1002 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:24:45] phuedx|AFK: I might be able to to help you grok how multiversion & InitializeSettings works. It definitely has some non-intuitative quirks [17:24:47] !log jynus Synchronized wmf-config/db-codfw.php: repool es2001 and es2002 after maintenance (duration: 00m 13s) [17:24:51] Logged the message, Master [17:24:56] Robh: did you see the email to security@ ? [17:25:39] (03PS2) 10Rush: confd: hiera confd::srv_dns to eqiad/confd.yaml [puppet] - 10https://gerrit.wikimedia.org/r/220505 [17:27:52] ok, tomorrow I will be able to close T101084 [17:28:40] I will also restart db1018 to check performance next week without P_S [17:28:54] (03CR) 10Rush: [C: 032] confd: hiera confd::srv_dns to eqiad/confd.yaml [puppet] - 10https://gerrit.wikimedia.org/r/220505 (owner: 10Rush) [17:28:55] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1397149 (10Cmjohnson) Physically moved the server to row/rack C7 Completed dns changes for labcontrol1002 https://g... [17:29:47] (03PS3) 10Giuseppe Lavagetto: varnish: allow picking which director is dynamic [puppet] - 10https://gerrit.wikimedia.org/r/220492 [17:30:14] paravoid: yt? [17:30:25] Robh: ping [17:30:26] ottomata: yes but @ SoS :) [17:30:37] the irony! [17:30:42] 6operations: Ensure kernel and OpenJDK fixes for leap second are present - https://phabricator.wikimedia.org/T103479#1397168 (10MoritzMuehlenhoff) The patches to smear the leap second have only merged into the Linux kernel last week. We plan to disable NTP on the 29th, so the leap second won't be communicated t... [17:31:21] RECOVERY - puppet last run on lvs1001 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [17:31:59] haha [17:32:02] np paravoid :) [17:32:22] Is anyone online here subscribed to the security@ list? [17:32:57] I was cc'ed on a mail to security@ but I'm on vacation (and on my phone) [17:33:17] I'd like to make sure someone is following up on it promptly [17:34:55] cscott_phone: yes but I also see csteipp on my screen so I'm sure he'll follow up soon [17:35:10] PROBLEM - ensure confd service on lvs1001 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:35:10] cscott_phone: like, deal with it, not just respond to the email :) [17:35:20] cscott_phone: but thanks for caring! enjoy your vacation :) [17:35:21] hello, can I talk to someone about an XSS issue? [17:35:38] michael stone asked me to poke on this channel [17:36:08] wirtha: that's what I was poking them about above [17:36:33] ah okay, I just got on the channel so I don't have scrollback [17:36:51] PROBLEM - ensure confd service on lvs1002 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:36:52] wirtha: by second-hand report, csteipp is on it, I'm trying to get more direct confirmation [17:37:09] hi wirtha, i think csteipp should be here soon if you're able to stick around for a little while. he was on a train when you emailed. [17:37:35] yeah I can lurk for a while [17:37:43] great ok [17:38:01] PROBLEM - ensure confd service on lvs1003 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:38:31] PROBLEM - ensure confd service on lvs1004 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:39:04] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums: terbium et al - php-luasandbox must install without errors and luasandbox must be enabled - https://phabricator.wikimedia.org/T101583#1397204 (10faidon) [17:40:01] PROBLEM - ensure confd service on lvs1005 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:40:20] PROBLEM - ensure confd service on lvs1006 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/confd [17:42:00] (03PS1) 10Giuseppe Lavagetto: lvs: fix name of conftool cluster [puppet] - 10https://gerrit.wikimedia.org/r/220508 [17:42:10] <_joe_> chasemp: ^^ [17:42:17] yeah I see that [17:42:17] <_joe_> we should allow more than one process [17:42:33] was just trying to catch it to see what it was actually doing [17:42:37] but yeah on that now [17:42:38] <_joe_> before others shout at us [17:42:53] <_joe_> yeah I think it just forks [17:43:44] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs: fix name of conftool cluster [puppet] - 10https://gerrit.wikimedia.org/r/220508 (owner: 10Giuseppe Lavagetto) [17:45:30] (03PS1) 10Rush: confd: nrpe two procs is still sane [puppet] - 10https://gerrit.wikimedia.org/r/220510 [17:46:02] (03PS2) 10Rush: confd: nrpe two procs is still sane [puppet] - 10https://gerrit.wikimedia.org/r/220510 [17:46:09] (03CR) 10Rush: [C: 032] confd: nrpe two procs is still sane [puppet] - 10https://gerrit.wikimedia.org/r/220510 (owner: 10Rush) [17:46:19] (03CR) 10Rush: [V: 032] confd: nrpe two procs is still sane [puppet] - 10https://gerrit.wikimedia.org/r/220510 (owner: 10Rush) [17:50:28] paravoid: nm, i figured it out: was getting weird postinst errors after your changes, but it was due to an old upstart init file being in place, and the #DEBHELPER# include in the postinst script saw it and then got upset [17:51:01] RECOVERY - ensure confd service on lvs1003 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:51:22] RECOVERY - ensure confd service on lvs4003 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:51:51] RECOVERY - ensure confd service on lvs3004 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:53:01] paravoid: pong (Re: Hey!) [17:53:12] RECOVERY - ensure confd service on lvs1004 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:53:19] Krinkle: hi! :) [17:53:27] Krinkle: really nice work on the perf.wm.org stuff [17:54:39] paravoid: thx [17:54:40] RECOVERY - ensure confd service on lvs1005 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:54:48] I have a couple of ideas to throw at you :) [17:54:50] (03PS4) 10Andrew Bogott: Get rid of unnecessary WikitechPrivateSettings.php [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [17:54:51] RECOVERY - ensure confd service on lvs1006 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:55:21] RECOVERY - ensure confd service on lvs3003 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:55:35] paravoid: Sure :) I'm all ears [17:55:37] the first one is, 95p is so vastly higher than median usually, that it completely hides the variations in median [17:55:52] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1397255 (10BBlack) >>! In T102557#1394797, @GWicke wrote: > @bblack, we also have 18x1T Samsung 840 Evos in the Cassandra cluster, which have worked well so far and cost significa... [17:55:59] Note I took https://gdash.wikimedia.org/dashboards/frontend/ as start [17:56:04] this is something that we sometimes fix with a logScale, but in this case it might be better to just add boxes to hide/unhide [17:56:16] yeah, the other idea is this [17:56:22] Ah, interesting. Yeah [17:56:22] this is already superior to gdash in some ways [17:56:34] greg-g: uh, can I sync out a small fix for a flow fatal before the train? patch is https://gerrit.wikimedia.org/r/220514 [17:56:36] I just took for granted that we want those two sine everybody does that [17:56:37] good point :) [17:56:40] RECOVERY - ensure confd service on lvs4004 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:56:53] and I'd love if it we could merge efforts and places to look at [17:57:01] paravoid: Yeah. [17:57:11] RECOVERY - ensure confd service on lvs1001 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:57:14] paravoid: To be honest, I actually hope to replace this particular script relative soon. [17:57:17] you're already 90% there [17:57:29] replace with what? [17:57:35] paravoid: I like my script for its dynamic nature, but it doesn't scale for unrelated graphs. [17:57:51] It's built very procedurally. [17:58:16] legoktm: sure, just coordinate with twentyafterfour [17:58:17] replace it with something else that exists already or rewrite it you mean? [17:58:24] paravoid: I'm hoping to replace it with a Grafana dashboard. Which has these capabilities and more. And it renders as SVG interactive instead of PNG (it fetches JSON from graphite and renders-clientside) allowing you to freely zoom in etc. [17:58:25] (03Abandoned) 10Dzahn: virt1000: remove from site.pp and DHCP [puppet] - 10https://gerrit.wikimedia.org/r/220304 (https://phabricator.wikimedia.org/T102005) (owner: 10Dzahn) [17:58:34] oh [17:58:37] twentyafterfour: I have a small flow patch that I'm going to sync out before the train https://gerrit.wikimedia.org/r/220514 [17:58:40] honestly, as I was saying to ori [17:58:41] But.. grafana is always broken when I look at it [17:58:42] RECOVERY - ensure confd service on lvs3002 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:58:45] grafana is super slow for me [17:58:51] every button I click it throws javascript exceptions all over the place [17:58:54] the install we have is broken somehow [17:58:57] and broken often and with a weird UX [17:59:00] RECOVERY - ensure confd service on lvs1002 is OK: PROCS OK: 2 processes with args /usr/bin/confd [17:59:05] I like those simple dashboards that gdash provides [17:59:12] paravoid: Yeah, the UX to create new stuff is terrible in Grafana [17:59:27] but the end result I would argue is the best I've seen anywhere in Graphite world. [17:59:34] especially if we fix some of its issues, e.g. gdash doesn't currently allow time periods which is annoying [17:59:38] (hence https://phabricator.wikimedia.org/T98134 ) [18:00:02] RECOVERY - ensure confd service on lvs4001 is OK: PROCS OK: 2 processes with args /usr/bin/confd [18:00:04] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150624T1800). Please do the needful. [18:00:18] paravoid: Yeah, that's a big issue. Makes it unattractive to add new dashboards and makes existing ones too much scroll work [18:00:26] yup [18:00:42] something like your page seems fine to me, honestly [18:00:53] *to complement grafana, not replace it) [18:01:18] bd808: that'd be grand, thanks [18:01:31] (03PS6) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) [18:01:53] i thought it was a fairly innocuous patch that replicated what we're already doing for other features (https://gerrit.wikimedia.org/r/#/c/219451/5) [18:02:10] paravoid: http://play.grafana.org/dashboard/db/graph-styles [18:02:18] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [18:02:33] !log legoktm Synchronized php-1.26wmf11/extensions/Flow/includes/Specials/SpecialEnableFlow.php: https://gerrit.wikimedia.org/r/#/c/220514/ (duration: 00m 15s) [18:02:37] paravoid: You can change the timer similarly to Kibana. And you can drag select inside a graph to focus on any arbirary subset of time [18:02:38] Logged the message, Master [18:02:52] twentyafterfour: I'm done [18:03:35] it was pushed to testwiki and appeared to work fine, then it moved to enwiki and notices abound [18:03:38] Krinkle: I should send you a screenshot from my CPU applet... [18:03:39] (03PS7) 10Dzahn: mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) [18:03:53] mutante: why are you copying everything?? [18:03:53] phuedx: I'll take a look [18:04:00] RECOVERY - ensure confd service on lvs4002 is OK: PROCS OK: 2 processes with args /usr/bin/confd [18:04:22] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: update font packages for jessie [puppet] - 10https://gerrit.wikimedia.org/r/218640 (https://phabricator.wikimedia.org/T102623) (owner: 10Dzahn) [18:04:44] mutante: keep the common set same, change just the differences with a conditional [18:05:04] mutante: this is hard to read/change atm [18:05:21] RECOVERY - ensure confd service on lvs3001 is OK: PROCS OK: 2 processes with args /usr/bin/confd [18:07:41] so that i'm not touching anything on any existing servers [18:07:42] ok [18:08:02] 6operations: linux 3.19 not installed by default on jessie - https://phabricator.wikimedia.org/T103721#1397288 (10fgiunchedi) 3NEW [18:08:07] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1397295 (10Andrew) 5Open>3Resolved thank you! I can do the install. [18:08:18] moritzm: ^ likely in-target [18:11:45] oops, that would be my fault then :) [18:12:00] (03PS1) 10Filippo Giunchedi: install-server: run lsb_release using chroot [puppet] - 10https://gerrit.wikimedia.org/r/220521 (https://phabricator.wikimedia.org/T103721) [18:12:23] paravoid: hehe I wasn't aware too in-target doesn't echo [18:16:21] paravoid: a couple minor updates just rolled out https://performance.wikimedia.org/navtiming/ [18:16:36] phuedx: I think you may have just been bitten by the way that thcipriani synced the changes. It looks like he did a `sync-dir wmf-config` which has a potential for problems when new vars are introduced in InitialiseSettings.php. [18:16:44] mostly just different defaults, but makes it more intuitive I hope [18:16:47] Krinkle: what is the drop in dns lookup? [18:16:53] matanya: I know, right! [18:17:02] matanya: For desktop it went back up a litlte, but for mobile it stayed down [18:17:06] massive massive drop [18:17:12] from 800ms to < 100ms [18:17:13] no idea [18:17:17] could be HTTPS related [18:17:22] was the same day we changed canonical [18:17:26] 6operations, 10ops-eqiad: analytics1016 down due to power issue(?) - https://phabricator.wikimedia.org/T103544#1397345 (10Cmjohnson) I requested a new system board replacement. I will update when I get more details Congratulations: Work Order SR912975302 was successfully submitted. [18:17:30] bd808: is this something i can fix on my side or is it a note for the next deployer? [18:17:32] perhaps due to ability to use HTTP2 or SPDY [18:17:38] was looking at this on the 10th and it was around the 700 [18:17:41] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:17:43] seen best in the last month view [18:17:51] bd808: you mean one file got there before the other one? [18:17:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [18:18:05] so not defined? [18:18:08] phuedx: There is a cache file for the values from that file on each MW host. When the timestamp on that cache file is newer than the timestamp on InitialiseSettings.php then cache is used instead of running InitialiseSettings.php. [18:18:26] ah [18:18:26] sync-dir doesn't touch the file automatically on tin [18:18:31] Krinkle: in the year view it is very very clear [18:18:43] which can mean that it has a stale timestamp [18:19:07] 6operations, 10ops-eqiad: analytics1016 down due to power issue(?) - https://phabricator.wikimedia.org/T103544#1397359 (10Cmjohnson) p:5Normal>3High [18:19:13] there is code in the scap system that touches the file after the sync finishes though [18:19:38] that applies to snc-dir too [18:20:12] but the safe thing to do is touch and sync InitializeSettings first and then sync files that depend on new vars [18:21:45] thcipriani: I *think* they should have gotten applied at almost the same time (as fast as rsync can do the mv commands) [18:22:07] but I have seen this kind of error storm before on new vars [18:22:27] and it is almost always fixed by touching InitializeSettings.php and syncing it again [18:23:12] bd808: ok, made a note of that for when this comes up in the future. Thanks for digging. [18:23:27] thanks bd808/thcipriani [18:23:37] i'm going to resubmit the patch and add it to the evening swat [18:24:00] There is logic in scap.tasks.sync_common that tries to solve this by touching on the hosts right after the rsync finishes but it doesn't always take effect before a smallish error storm [18:24:01] godog: nice, will have a look tomorrow [18:24:12] and all of that has been etched permanently into my brain at the expense of precious memories with my children or whatever :D [18:24:27] phuedx: hehe [18:24:34] moritzm: ok! I'll pester bblack in the meantime :) [18:25:18] bblack: we were talking about https://gerrit.wikimedia.org/r/#/c/220521/1 [18:25:38] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1397382 (10Cmjohnson) I am going to RMA the 2 Samsung disks from restbase1008 [18:26:04] thcipriani: it's a lame quirk of our system but in general when a new var is added to InitializeSettings.php it is safest to sync that file, wait to see if things blow up and then sync other config files that need the new var [18:26:58] cache invalidation is hard yo :) [18:27:18] ^ that [18:27:19] that's the word on the streets. [18:27:38] (03CR) 10Andrew Bogott: [C: 04-1] "Since the new primary labs controller is labcontrol1001, the substitution should be for that instead of 1002." [puppet] - 10https://gerrit.wikimedia.org/r/220309 (https://phabricator.wikimedia.org/T1002005) (owner: 10Dzahn) [18:27:43] 6operations, 10ops-eqiad, 10RESTBase: investigate new restbase machine disks timeouts - https://phabricator.wikimedia.org/T102557#1397386 (10fgiunchedi) @cmjohnson we'll run a few tests today/tomorrow to rule out the disks vs other components btw, I'm not 100% sure the disks are faulty yet [18:27:49] cscott_away: jgage: any updates? [18:31:31] RECOVERY - RAID on restbase1008 is OK Active: 6, Working: 6, Failed: 0, Spare: 0 [18:31:42] RECOVERY - salt-minion processes on restbase1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:31:42] RECOVERY - DPKG on restbase1008 is OK: All packages OK [18:31:45] (03PS1) 10Phuedx: Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220526 (https://phabricator.wikimedia.org/T101155) [18:32:12] RECOVERY - configured eth on restbase1008 is OK - interfaces up [18:32:31] RECOVERY - dhclient process on restbase1008 is OK: PROCS OK: 0 processes with command name dhclient [18:32:31] RECOVERY - Disk space on restbase1008 is OK: DISK OK [18:32:47] (03CR) 10Phuedx: "Jdlrobson's original patch is here: https://gerrit.wikimedia.org/r/#/c/219451/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220526 (https://phabricator.wikimedia.org/T101155) (owner: 10Phuedx) [18:33:39] phuedx: so testwiki doesn't get included in default ? [18:34:05] (03PS1) 10Andrew Bogott: Remove puppet defs for virt1000; add them for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/220528 (https://phabricator.wikimedia.org/T103722) [18:34:27] (03PS1) 1020after4: group1 wikis to 1.26wmf11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220529 [18:35:12] (03CR) 1020after4: [C: 032] symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220503 (owner: 1020after4) [18:35:18] (03Merged) 10jenkins-bot: symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220503 (owner: 1020after4) [18:35:43] (03PS2) 10Andrew Bogott: Remove puppet defs for virt1000; add them for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/220528 (https://phabricator.wikimedia.org/T103722) [18:36:06] jdlrobson: i followed the standard in the rest of the config [18:36:17] testwiki, test2wiki appear to be singled out [18:36:17] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220529 (owner: 1020after4) [18:36:19] throughout [18:36:23] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf11 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220529 (owner: 1020after4) [18:39:21] PROBLEM - puppet last run on mw2157 is CRITICAL Puppet has 1 failures [18:39:41] (03PS3) 10Dzahn: Bugzilla: remove module, keep static version [puppet] - 10https://gerrit.wikimedia.org/r/220495 (https://phabricator.wikimedia.org/T103193) [18:40:17] (03Abandoned) 10Andrew Bogott: Rename virt1000 to labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/219849 (https://phabricator.wikimedia.org/T102646) (owner: 10Andrew Bogott) [18:41:52] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf11 [18:42:00] Logged the message, Master [18:43:29] (03CR) 10Dzahn: [C: 032] Bugzilla: remove module, keep static version [puppet] - 10https://gerrit.wikimedia.org/r/220495 (https://phabricator.wikimedia.org/T103193) (owner: 10Dzahn) [18:44:31] (03CR) 10Dzahn: "merged on puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/219404 (https://phabricator.wikimedia.org/T102361) (owner: 10Alex Monk) [18:44:50] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [18:45:01] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [18:49:01] RECOVERY - Cassandra database on restbase1008 is OK: PROCS OK: 1 process with UID = 111 (cassandra), command name java, args CassandraDaemon [18:49:30] RECOVERY - puppet last run on restbase1008 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:50:12] RECOVERY - NTP on restbase1008 is OK: NTP OK: Offset -0.001926541328 secs [18:51:38] (03PS1) 10Giuseppe Lavagetto: confctl: allow regex expression and a global "all" [software/conftool] - 10https://gerrit.wikimedia.org/r/220536 [18:52:00] <_joe_> bblack: ^^ this will make us happy [18:55:07] (03CR) 10Steinsplitter: [C: 031] Allow a full text search button on Commons whenever possible [mediawiki-config] - 10https://gerrit.wikimedia.org/r/186916 (https://phabricator.wikimedia.org/T19471) (owner: 10Nemo bis) [18:55:40] RECOVERY - puppet last run on mw2157 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [18:56:14] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1397474 (10Eevans) Now that [[https://phabricator.wikimedia.org/T102557#1397025|restbase1008 is fitted with different disks]], I propose... [18:56:35] (03Abandoned) 10Dzahn: virtscripts: replace virt1000 with labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/220309 (https://phabricator.wikimedia.org/T1002005) (owner: 10Dzahn) [19:02:56] (03PS1) 10Dzahn: mariadb: drop grants for bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/220548 (https://phabricator.wikimedia.org/T103193) [19:05:18] 10Ops-Access-Requests, 6operations: Requesting access to sodium for John Lewis - https://phabricator.wikimedia.org/T102075#1397495 (10Dzahn) [19:05:50] RECOVERY - RAID on graphite1002 is OK optimal, 2 logical, 4 physical [19:07:43] 6operations, 5Patch-For-Review: linux 3.19 not installed by default on jessie - https://phabricator.wikimedia.org/T103721#1397502 (10fgiunchedi) [19:07:45] 6operations, 5Patch-For-Review: Switch to Linux 3.19 by default on jessie hosts - https://phabricator.wikimedia.org/T100773#1397504 (10fgiunchedi) [19:13:24] (03PS1) 10Rush: Pybal: switching to confd sourced pool [puppet] - 10https://gerrit.wikimedia.org/r/220552 [19:24:22] (03PS1) 10Ori.livneh: Move Xenon static site to performance/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/220555 [19:24:40] (03PS1) 10Cmjohnson: adding labnet1002 to dhcpd with 10G Mac [puppet] - 10https://gerrit.wikimedia.org/r/220556 [19:26:29] (03PS2) 10Ori.livneh: Move Xenon static site to performance/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/220555 [19:26:36] (03CR) 10Ori.livneh: [C: 032 V: 032] Move Xenon static site to performance/docroot.git [puppet] - 10https://gerrit.wikimedia.org/r/220555 (owner: 10Ori.livneh) [19:30:46] 6operations: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1397561 (10Cmjohnson) I added the 10G Nic card, and updated dhcpd file with the correct MAC. I also disabled pxe for the 1Gb Eth. [19:33:40] 6operations: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1397578 (10Cmjohnson) I do want to check with someone about the network aspect. labnet1001 is currently using both xe-2/1/0 and xe-2/1/2. I believe we may have to make a direct connection to cr1 and cr2 to a different uplin... [19:33:53] (03PS1) 10Ori.livneh: Follow-up to I2af5c88f98: remove reference to xenon/theme [puppet] - 10https://gerrit.wikimedia.org/r/220558 [19:34:40] (03PS2) 10Ori.livneh: Follow-up to I2af5c88f98: remove reference to xenon/theme [puppet] - 10https://gerrit.wikimedia.org/r/220558 [19:34:52] (03CR) 10Ori.livneh: [C: 032 V: 032] Follow-up to I2af5c88f98: remove reference to xenon/theme [puppet] - 10https://gerrit.wikimedia.org/r/220558 (owner: 10Ori.livneh) [19:36:28] (03PS10) 10Chad: Allow text-lb to redirect svn access to Diffusion [puppet] - 10https://gerrit.wikimedia.org/r/219228 [19:36:50] Think we could get that merged today? [19:37:41] PROBLEM - puppet last run on fluorine is CRITICAL Puppet has 1 failures [19:40:51] PROBLEM - puppet last run on strontium is CRITICAL puppet fail [19:41:31] RECOVERY - puppet last run on fluorine is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [19:42:24] 6operations, 10Wikimedia-Git-or-Gerrit: Get rid of the gerrit Debian package and migrate to puppet - https://phabricator.wikimedia.org/T103735#1397604 (10hashar) 3NEW [19:43:49] 6operations, 10Wikimedia-Git-or-Gerrit: Remove Java 6 from ytterbium.wikimedia.org (Gerrit production host) - https://phabricator.wikimedia.org/T103668#1395849 (10hashar) ``` $ apt-cache show gerrit Package: gerrit Version: 2.7-rc2-507-g1e7090b-1 Architecture: all Maintainer: Ryan Lane In... [19:44:07] 6operations, 10Wikimedia-Git-or-Gerrit: Get rid of the gerrit Debian package and migrate to puppet - https://phabricator.wikimedia.org/T103735#1397615 (10demon) WFM. Just stuff the jars in a git repo and call it a day with trebuchet (or new deploy system). The rest can be in puppet. [19:44:17] hashar: I hate that stupid package. [19:46:40] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Build Debian package ruby-jsduck for Jessie - https://phabricator.wikimedia.org/T95008#1397616 (10hashar) 5Resolved>3Open Not yet :-( ``` integration-slave-jessie-1001# apt-get install ruby-jsduck Reading... [19:48:24] ostriches: so lets kill it! [19:50:21] indeed [19:55:10] RECOVERY - puppet last run on strontium is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:57:18] (03PS2) 10Giuseppe Lavagetto: confctl: allow regex expression and a global "all" [software/conftool] - 10https://gerrit.wikimedia.org/r/220536 [20:00:04] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Services – Parsoid / OCG / Citoid / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150624T2000). Please do the needful. [20:00:23] no parsoid deploy today [20:05:21] Anything for OCG/Citoid/RESTbase? [20:07:51] PROBLEM - puppet last run on strontium is CRITICAL puppet fail [20:15:01] RECOVERY - puppet last run on strontium is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:19:18] (03CR) 10BBlack: [C: 031] confctl: allow regex expression and a global "all" [software/conftool] - 10https://gerrit.wikimedia.org/r/220536 (owner: 10Giuseppe Lavagetto) [20:20:41] (03CR) 10BBlack: [C: 031] Pybal: switching to confd sourced pool [puppet] - 10https://gerrit.wikimedia.org/r/220552 (owner: 10Rush) [20:30:19] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1397806 (10Nemo_bis) 1.6.12 was in 2009, not that long ago. ;) [20:30:48] (03PS3) 10Andrew Bogott: Remove puppet defs for virt1000; add them for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/220528 (https://phabricator.wikimedia.org/T103722) [20:32:34] (03CR) 10Andrew Bogott: [C: 032] Remove puppet defs for virt1000; add them for labcontrol1002. [puppet] - 10https://gerrit.wikimedia.org/r/220528 (https://phabricator.wikimedia.org/T103722) (owner: 10Andrew Bogott) [20:32:44] (03CR) 10BBlack: [C: 031] install-server: run lsb_release using chroot [puppet] - 10https://gerrit.wikimedia.org/r/220521 (https://phabricator.wikimedia.org/T103721) (owner: 10Filippo Giunchedi) [20:33:51] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [20:35:32] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60576 bytes in 0.292 second response time [20:40:23] gitblit ok? [20:40:38] it never is [20:44:31] java server/web apps never cease to amaze me with their stunningly horrible reliability. But maybe I'm just easily impressed. [20:47:56] (03CR) 10John F. Lewis: [C: 031] "Removal is good. I assume a DBA would revoke it or so." [puppet] - 10https://gerrit.wikimedia.org/r/220548 (https://phabricator.wikimedia.org/T103193) (owner: 10Dzahn) [20:48:21] that's becaues they were invented before we realized how to build things WebScale. Now we know better! [20:49:11] s/The network is the computer/The network is my demo web app with 3 users/ [20:49:45] (03PS1) 10coren: Make labstore configuration into a module [puppet] - 10https://gerrit.wikimedia.org/r/220618 (https://phabricator.wikimedia.org/T93781) [20:49:46] 6operations, 10ops-eqiad: analytics1016 down due to power issue(?) - https://phabricator.wikimedia.org/T103544#1397843 (10Cmjohnson) A new board has been shippped [20:50:16] (03Abandoned) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [20:51:47] (03PS1) 10Andrew Bogott: Add partman recipe for labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/220619 [20:52:17] (03PS2) 10Andrew Bogott: Add partman recipe for labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/220619 [20:53:07] (03CR) 10Andrew Bogott: [V: 032] Add partman recipe for labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/220619 (owner: 10Andrew Bogott) [20:53:23] (03CR) 10Andrew Bogott: [C: 032] Add partman recipe for labcontrol1002 [puppet] - 10https://gerrit.wikimedia.org/r/220619 (owner: 10Andrew Bogott) [21:00:49] legoktm: HELP! https://phabricator.wikimedia.org/T97334#1397360 [21:03:51] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [21:04:04] YuviPanda: can it wait until after lunch? :P [21:04:34] (03PS2) 10Filippo Giunchedi: install-server: run lsb_release using chroot [puppet] - 10https://gerrit.wikimedia.org/r/220521 (https://phabricator.wikimedia.org/T103721) [21:04:39] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] install-server: run lsb_release using chroot [puppet] - 10https://gerrit.wikimedia.org/r/220521 (https://phabricator.wikimedia.org/T103721) (owner: 10Filippo Giunchedi) [21:04:40] legoktm: ok :) do remmeber to look tho [21:04:41] RECOVERY - Host mw2027 is UPING OK - Packet loss = 0%, RTA = 43.41 ms [21:09:55] !log start cassandra on restbase1008 [21:10:04] Logged the message, Master [21:14:07] 6operations, 10MediaWiki-File-management, 10MediaWiki-Tarball-Backports, 6Multimedia, and 6 others: InstantCommons broken by switch to HTTPS - https://phabricator.wikimedia.org/T102566#1397936 (10demon) That's 6 years ago, yeah pretty long. Also, it never supported InstantCommons, so who cares about PHP4... [21:16:39] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1397946 (10fgiunchedi) I went ahead and started cassandra 2.1.3 on restbase1008, the bootstrap/streaming failure is there from 1004 res... [21:17:02] PROBLEM - puppet last run on mc2012 is CRITICAL Puppet has 1 failures [21:27:22] andrewbogott: i wanted to check on that instance "relay-test" but i cant login on it [21:27:41] mutante: ok… [21:27:47] let me see if I can reach it via salt [21:28:03] oh, I have a root login. Let’s see... [21:28:25] andrewbogott: eh.. sorry, i think it's my fault with the keys [21:28:28] yeah, self-hosted puppet [21:28:47] puppet wasn’t updated, so it has references to hiera but doesn’t know how to use hiera. [21:28:53] Are you able to log in? If not I can probably fix it somewhat [21:29:25] andrewbogott: yes, i am on it now. i will either fix puppet or delete the instance and then do the same for sensu-01 and close that ticket [21:29:28] 6operations, 6Research-and-Data, 7Database: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1398007 (10ggellerman) @jcrespo Could you please expand on what you need here from the Research team? Thanks! [21:29:33] ok thanks [21:29:49] the solution to the hiera thing is to just live-hack defaults to all the hiera calls it needs [21:30:28] tgr: ok to delete https://wikitech.wikimedia.org/wiki/Nova_Resource:Zotero project? says superseded by the citoid project, has no instances [21:30:30] alright, cool [21:30:54] YuviPanda: yes, thanks [21:31:00] tgr: cool [21:33:31] RECOVERY - puppet last run on mc2012 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:33:50] tgr: done [21:34:30] 6operations, 10RESTBase, 7Monitoring, 5Patch-For-Review: Detailed cassandra monitoring: metrics and dashboards done, need to set up alerts - https://phabricator.wikimedia.org/T78514#1398012 (10fgiunchedi) a proposed stopgap for the metrics that has been suggested by @ori is to whip up a diamond collector t... [21:36:50] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 836.405792693 [21:41:12] 6operations, 6Research-and-Data, 7Database: Test and fix db1047 BBU - https://phabricator.wikimedia.org/T103345#1398019 (10jcrespo) @ggellerman While the technical parts on this ticket will be handled by #Operations, there are 2 things: * Acknowledgement of a problem on one of the servers that I understand... [21:42:36] andrewbogott: both instances are deleted now. the projects themselves are still there but without instances. that's ok, right [21:42:49] mutante: do you still need the 'planet' project? [21:42:52] it’s fine — you can delete the projects if you want, or leave them. [21:42:56] marktraceur: what shall I do with orgchart? [21:43:00] if you think someone is going to reuse them [21:43:29] YuviPanda: Murder away [21:43:39] marktraceur: instance, project, everything? [21:43:50] Keep the data just in case? [21:44:04] marktraceur: is there a mysql or something running there you want me to dump from? [21:44:07] YuviPanda: "need" is a relative thing. not really if it's in the way but testing things is nice [21:44:10] cmjohnson1: I’m having partman trouble with virt1000/labcontrol1002. Do you know, should I be configuring hardware raid in bios? [21:44:12] YuviPanda: Mongodb [21:44:31] marktraceur: oh, hmm. can you just say all these on https://phabricator.wikimedia.org/T103137? [21:44:36] Sure [21:44:43] I'll dump the data and make a NFS copy and then delete everything [21:45:14] YuviPanda: why? does it use NFS? [21:45:42] mutante: there's nobody maintaining it, and the code is running off NFS [21:45:59] > orgchart can be murdered, nobody is maintaining it, if I had it to do again I wouldn't do it this way. The data is in mongodb, which can probably be dumped relatively trivially. Thanks! [21:47:09] twentyafterfour: what needs to be done to get review on a config change ? [21:47:13] YuviPanda: i dont understand, it looks like there is no instance in the project [21:47:25] where are you looking? [21:47:34] https://wikitech.wikimedia.org/wiki/Special:NovaInstance [21:47:35] mutante: oh, for planet? [21:47:37] yes [21:47:47] mutante: yes, there's no instance in the project, which is why I'm wondering if you still need the project :) [21:47:51] what made you say nobody is maintaining it [21:47:53] mutante: and also I would like to disable NFS for it [21:48:00] mutante: oh, I thought we were talking about orgchart :) [21:48:09] you can disable NFS because .. there is no instance [21:48:13] (apart from summoning Reedy) [21:48:15] mutante: if I disable NFS for the project, new instances you create will have no NFS [21:49:30] YuviPanda: ok, do it. it wasnt just me, there were other users on it and NFS just meant we could share code across instances [21:49:41] right. [21:49:54] i doubt the other user, hundfred, is still into it [21:50:30] yea, it doesnt matter either way [21:52:09] godog: re the jobchron, just move the syslog job to upstart ? [21:52:49] matanya: the syslog job? yeah should be enough to remove the >> redirection [21:53:50] godog: i.e /usr/sbin/service jobchron restart [21:54:04] without outputing to /dev/null? [21:56:05] YuviPanda: please give me the orgchart project [21:56:10] i'll maintain it [21:56:53] matanya: ah no that's just logrotate, I meant in jobchron's init script [21:57:02] ah, there [21:57:24] sad_trombone.wav for orgchart [21:57:25] matanya: do ask the owner (marktraceur) :) [21:57:36] we need something like that.. [21:57:51] YuviPanda: just did, see in -multimedia [21:58:18] the problem to solve is more organizational than technical [21:59:06] hence i am adopting it :D [21:59:31] yay :) cool matanya [22:00:49] godog: i apologize, 1 am, and i can't keep track of myself, modules/mediawiki/files/jobchron.conf is the init script, as it seems [22:00:56] but i see no redierction there [22:01:58] matanya: no worries, you should go to sleep though :) see JOBCHRON_LOGFILE [22:02:52] * matanya can't see >> if the cursor is on it ... [22:07:12] (03PS3) 10Matanya: jobchron: log rotate [puppet] - 10https://gerrit.wikimedia.org/r/218905 [22:07:29] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1398110 (10jcrespo) p:5Triage>3Low @RobH This was detected after an unusual lag on replication. As you can see on the events log I just generated: ``` db1028:/home/jynus/events.log ``` This has gone f... [22:07:43] matanya: heh, alright. i am not killing it now [22:08:18] matanya: you should move stuff off NFS tho [22:09:27] (03PS1) 10Ori.livneh: Include cronolog in Apache module; use for Pybal logs [puppet] - 10https://gerrit.wikimedia.org/r/220631 [22:09:36] godog: ^ see what you think. I am not 100% sure it's the right approach. [22:10:47] 6operations, 10ops-eqiad, 7Database: Disk issue on db1028 - https://phabricator.wikimedia.org/T103230#1398118 (10jcrespo) Setting `SET GLOBAL innodb_flush_log_at_trx_commit = 1;` to go back to the status quo. [22:10:49] YuviPanda: why move it ? [22:12:16] ori: ok, will take a look in a few (unit not specified but no rush at all [22:12:31] (03CR) 10Aaron Schulz: "Do whatever is simplest" [puppet] - 10https://gerrit.wikimedia.org/r/218905 (owner: 10Matanya) [22:13:22] godog: ^ at your service [22:13:34] matanya: off NFS? https://phabricator.wikimedia.org/T102240 [22:13:55] gosh, i read you should move stuff off DNS tho [22:14:00] i must sleep [22:14:14] matanya: yes [22:14:55] YuviPanda: where do you want it? some local storage ? [22:14:59] matanya: thanks! I can't take a look at the moment, will do so later tho [22:15:08] godog: no rush [22:15:19] matanya: yup. /srv [22:15:22] k [22:17:39] YuviPanda: can you please add me to the project ? [22:17:53] marktraceur: ^ that ok? [22:17:57] (I'm not in -multimedia) [22:18:57] YuviPanda: http://bots.wmflabs.org/~wm-bot/logs/%23wikimedia-multimedia/20150624.txt [22:19:14] eh, those are interesting file names for Apache sites: [22:19:22] 50--etc-apache2-sites-enabled-planet-wikimedia-org.conf [22:19:38] that seems new'ish [22:20:02] but it's from last August.. eh.. cleaning up [22:30:18] !log zirconium - deleting unused apache configs, bugzilla, etherpad, ... [22:30:24] Logged the message, Master [22:32:43] !log zirconium - stop using 443 at all, rm NameVirtualHost *:443 [22:32:48] Logged the message, Master [22:33:06] 6operations, 10ops-eqiad, 10Traffic: eqiad: investigate thermal issues with some cp10xx machines - https://phabricator.wikimedia.org/T103226#1398162 (10BBlack) Re-ran my temp display command from above (just now, a few mins before this comment), and pattern looks unchanged overall (other than cp1059 happenin... [22:34:57] mutante: are you using this instance? https://wikitech.wikimedia.org/wiki/Nova_Resource:Dns [22:36:23] YuviPanda: no, i don't even know why it's called dzahn, while i'm not a user in the project [22:36:53] mutante: you're a member in it, not an admin though [22:36:56] oh, i am, just not admin [22:37:00] had to expand that ..yea [22:37:23] you even added yourself :P [22:37:33] mutante: :D [22:37:47] probably to test if i could [22:38:13] and you then made an instance which YuviPanda wants to kill. the circle of Labs! [22:40:37] mutante: so... do you still want that instance? :0 [22:40:38] :) [22:42:26] YuviPanda: 1 minute.. [22:43:28] gee, i find stuff like replica.my.cnf :) [22:43:48] yeah [22:43:56] I dunno why it was added for every project instead of it being opt in [22:44:00] so many junk users [22:44:40] legoktm: https://gerrit.wikimedia.org/r/#/c/220635/1 :D [22:44:45] YuviPanda: i'm shutting it down .. [22:44:50] i dont know about it [22:44:56] mutante: shutting down or deleting it? [22:45:03] shut down the instance [22:45:19] mutante: oh, so you want to keep it in 'shutoff' state in case someone else wants it? [22:45:28] mutante: I can look at who created it, sec [22:45:44] 6operations, 10Deployment-Systems, 7HHVM, 5Patch-For-Review, 15User-Bd808-Test: Scap should restart HHVM - https://phabricator.wikimedia.org/T103008#1398221 (10bd808) @ori did some additional operational testing up to and including using `--restart` across the entire WMF prod cluster. The restart functio... [22:45:49] i wonder why i'm not an admin but the instance has my name as if i di [22:45:52] d [22:46:08] > | user_id | dzahn | [22:46:08] mutante: you created it! :) [22:46:23] and then i removed myself from admin? [22:46:28] mutante: you might have? [22:46:30] YuviPanda: commented :) [22:46:33] you certainly created it tho [22:46:39] YuviPanda: delete it :p [22:46:47] doesn't the nova resource page history show you who added/removed who as a member/admin? [22:47:00] mutante: can you? :) otherwise I've to add myself to the project, then make myself admin, then delete it... [22:47:03] oh, you aren't projectadmin? [22:47:12] unless they added/removed themselves via ldap hax I guess? [22:47:31] YuviPanda: i'm not, and the "DNS" topic tells me it must be those other existing admins who were using it [22:47:44] mutante: maybe, yeah. but that instance was created by you :P [22:47:45] or maybe it had more instances but all others are alreayd deleted [22:47:47] let me delete it [22:48:02] mutante: you *are* projectadmin [22:48:05] mutante: the nova page lies [22:48:11] hah [22:48:16] they always lie [22:48:18] bunchofdicks [22:48:23] well, i have created many instances just because i helped people creating instances [22:48:45] I don't think you put your name on them... [22:48:57] mutante: so... you're ok with me deleting it? [22:49:02] yes, yes [22:49:06] ok [22:50:09] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1398232 (10Andrew) 5Resolved>3Open Chris, I turn out to be stumped with partman and also to have trapped this bo... [22:50:27] 6operations, 10ops-eqiad, 6Labs, 5Patch-For-Review, 3ToolLabs-Goals-Q4: Rename virt1000 to labcontrol1002, move to same subnet as labcontrol1001 - https://phabricator.wikimedia.org/T102646#1398235 (10Andrew) a:5Andrew>3Cmjohnson [22:51:07] YuviPanda: done [22:51:18] mutante: thanks! [22:52:08] YuviPanda: what's a good way to see all project memberships of a user? [22:52:20] don't know... [22:52:22] mutante: actually [22:52:25] mutante: on any instance [22:52:26] do 'groups' [22:52:50] YuviPanda: oh right, thanks. now i just need an instance, ,hehehe [22:53:01] that works [22:53:11] still has the "petcow" [22:54:03] omg, way too many projects [22:54:22] "svn" :p [22:54:44] 'svnadm' too [22:55:33] i am member in "project-akosiaristests" [22:56:57] (03CR) 1020after4: [C: 031] add dvidshub to whitelist upload URLs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/219599 (owner: 10Matanya) [22:57:16] interestingly i also have groups where the corresponding project doesn't exist (anymore) [22:57:37] * legoktm is adding some things to the swat window [22:57:42] no, these are not project- groups [22:59:32] 7Blocked-on-Operations, 6operations, 6Scrum-of-Scrums: terbium et al - php-luasandbox must install without errors and luasandbox must be enabled - https://phabricator.wikimedia.org/T101583#1398289 (10Mattflaschen) [23:00:04] RoanKattouw, ostriches, rmoen: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150624T2300). Please do the needful. [23:00:38] so...gerrit automatically made my submodule updates again [23:01:21] oh, RoanKattouw isn't here today [23:01:24] I can do swat [23:01:29] phuedx|zzZ: ...ping? [23:01:34] visualeditor, editor-engagement, chasetest, akosiaristests, pdbhandler, math, multimedia .. removing self from all of that and counting [23:02:15] andrewbogott: did you figure it out? [23:02:30] legoktm: I'm here for phudex [23:02:47] Either you or i can swat :) [23:02:57] rmoen: ok. why is the patch adding 531 lines of config...? [23:03:23] !log legoktm Synchronized php-1.26wmf11/extensions/UserMerge: https://gerrit.wikimedia.org/r/#/c/220638/ (duration: 00m 13s) [23:03:29] Logged the message, Master [23:04:09] legoktm: good question. It's an experiment I believe. Really I know nothing about it [23:04:33] other than it failed to deploy before because of an untouched settings file [23:04:56] !log legoktm Synchronized php-1.26wmf11/extensions/CentralAuth: https://gerrit.wikimedia.org/r/#/c/220637/ (duration: 00m 13s) [23:05:03] Logged the message, Master [23:06:06] rmoen: umm, I don't feel comfortable deploying that. do you want to? It looks like it's hardcoding article names, but I have no idea what's going to happen if an article gets moved or deleted... [23:06:21] Its for the browse (tag) experiement [23:06:29] legoktm: if you don't feel comfortable i'll do it [23:06:43] please [23:06:51] I just came back from leave otherwise I would know more about it. [23:07:07] https://www.mediawiki.org/w/index.php?title=Special%3ASearch&profile=all&search=browse+tag+mobile&fulltext=Search is not helpful :/ [23:07:44] I'm all done [23:08:32] (03PS2) 10Robmoen: Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220526 (https://phabricator.wikimedia.org/T101155) (owner: 10Phuedx) [23:09:54] 6operations, 10Fundraising Dash: Create sandbox site for Dash - https://phabricator.wikimedia.org/T87809#1398412 (10atgo) [23:10:01] (03CR) 10Robmoen: [C: 032] Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220526 (https://phabricator.wikimedia.org/T101155) (owner: 10Phuedx) [23:10:03] (03PS1) 10Ori.livneh: Add Pyglet, a syntax-highlighting micro-service(!) [puppet] - 10https://gerrit.wikimedia.org/r/220641 [23:10:07] (03Merged) 10jenkins-bot: Enable browse prototype on test- and enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220526 (https://phabricator.wikimedia.org/T101155) (owner: 10Phuedx) [23:10:51] (03CR) 10Yuvipanda: "Have you considered using nodejs for this?" [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [23:10:53] (03CR) 10jenkins-bot: [V: 04-1] Add Pyglet, a syntax-highlighting micro-service(!) [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [23:11:41] * legoktm slaps YuviPanda [23:12:02] YuviPanda: iojs remember? [23:12:09] no, they kind of merged together [23:12:14] keep up with the news man! [23:13:23] (03CR) 10Ori.livneh: "@Yuvipanda: Actually, I have. Wrapping highlight.js in a simple node.js service might be a good idea, because it will allow VisualEditor t" [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [23:13:25] !log rolling restart of Cassandra staging cluster [23:13:31] Logged the message, Master [23:14:06] (03CR) 10Yuvipanda: ":) fair enough!" [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [23:14:15] I know you were trolling :P [23:14:19] but it's not a bad idea [23:14:37] ori: :) [23:14:41] ori: +1 to not shelling out every time [23:14:51] ori: what would nodejs + highlight give over pygments? [23:15:02] does it have a better incremental parser? [23:15:25] (03PS2) 10Ori.livneh: Add Pyglet, a syntax-highlighting micro-service(!) [puppet] - 10https://gerrit.wikimedia.org/r/220641 [23:15:36] VisualEditor can run it without relying on the server [23:15:39] Parsoid ditto [23:15:42] pyglet is such a cute name [23:15:58] ori: oh, I see. client server equivalence. [23:15:59] intersting [23:16:03] hadn't thought of that [23:16:06] (03CR) 10jenkins-bot: [V: 04-1] Add Pyglet, a syntax-highlighting micro-service(!) [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [23:16:17] whatever jenkins is seeing, local puppet-lint is not [23:16:20] it's probably obvious [23:16:47] oh durr [23:16:48] $ [23:17:20] we should plot 'jenkins vs humans' gerrit votes [23:17:22] ori: why in ops/puppet instead of a "proper" repo? [23:17:22] !log rmoen Synchronized wmf-config/InitialiseSettings.php: Enable browse experiment on test and enwiki (duration: 00m 12s) [23:17:28] Logged the message, Master [23:17:43] bd808: that's his way of not using trebuchet [23:17:46] (03PS3) 10Ori.livneh: Add Pyglet, a syntax-highlighting micro-service(!) [puppet] - 10https://gerrit.wikimedia.org/r/220641 [23:17:50] YuviPanda: shhhhh [23:18:04] heh. I saw him find another way to avoid trebuchet earlier [23:18:12] bd808: it's a single file; if it grows beyond that something went wrong [23:18:14] !log rmoen Synchronized wmf-config/mobile.php: Enable browse experiment on test and enwiki (duration: 00m 14s) [23:18:16] but ops/puppet makes testing in mw-vagrant hard [23:18:20] Logged the message, Master [23:18:29] bd808: ohhhh [23:18:30] fair point [23:18:32] i forgot about that [23:20:45] (03CR) 10Faidon Liambotis: [C: 04-2] "Offtopic for operations/puppet." [puppet] - 10https://gerrit.wikimedia.org/r/220641 (owner: 10Ori.livneh) [23:20:46] sorry, no :) [23:21:22] ops/puppet has more than enough crap :) [23:21:28] 74 projects with NFS enabled [23:21:31] not that this is crap (I have no idea) [23:21:33] * YuviPanda continues the slog [23:21:47] that's 60 projects liberated! [23:21:50] YuviPanda: 73 have mutante as the project admin :) [23:21:57] JohnFLewis: :P [23:22:34] Ironically the only project he's not an admin on is the dzahn project [23:24:18] there is no such project, unless Special:NovaProject lies [23:24:48] Which is likely does [23:26:24] yea, i'm sorry for using it so much [23:26:31] (03PS2) 10Rush: Pybal: switching to confd sourced pool [puppet] - 10https://gerrit.wikimedia.org/r/220552 [23:27:17] YuviPanda: +2 [23:28:28] (03CR) 10Rush: [C: 032 V: 032] Pybal: switching to confd sourced pool [puppet] - 10https://gerrit.wikimedia.org/r/220552 (owner: 10Rush) [23:32:05] ori: https://phabricator.wikimedia.org/T103760 consent for NFS killing from something you're involved in :) [23:32:05] were [23:32:08] puppet3 migraiton [23:36:34] sure [23:37:07] (03PS1) 10BBlack: Get rid of the default_backend setting [puppet] - 10https://gerrit.wikimedia.org/r/220642 [23:37:09] (03PS1) 10BBlack: Get rid of unused director_options [puppet] - 10https://gerrit.wikimedia.org/r/220643 [23:37:11] (03PS1) 10BBlack: restructure varnish::instance's "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220644 [23:37:13] (03PS1) 10BBlack: move text backend_random into "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220645 [23:38:00] (03CR) 10jenkins-bot: [V: 04-1] restructure varnish::instance's "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220644 (owner: 10BBlack) [23:38:11] (03CR) 10jenkins-bot: [V: 04-1] move text backend_random into "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220645 (owner: 10BBlack) [23:38:26] (03PS4) 10coren: Labs: More puppetization fixes for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/218666 (https://phabricator.wikimedia.org/T102478) [23:38:29] jenkins-bot: hey now, I said it was a WIP, no need to get all testy [23:39:08] 6operations, 6Performance-Team: Move performance's websites out of operations/puppet - https://phabricator.wikimedia.org/T101974#1398495 (10ori) 5Open>3Resolved [23:40:01] (03PS2) 10BBlack: move text backend_random into "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220645 [23:40:03] (03PS2) 10BBlack: restructure varnish::instance's "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220644 [23:42:13] (03PS5) 10coren: Labs: More puppetization fixes for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/218666 (https://phabricator.wikimedia.org/T102478) [23:47:42] (03PS1) 10Filippo Giunchedi: diamond: add cassandra collector for basic metrics [puppet] - 10https://gerrit.wikimedia.org/r/220650 (https://phabricator.wikimedia.org/T78514) [23:54:13] (03PS1) 10Rush: pybal config pool file rearrange [puppet] - 10https://gerrit.wikimedia.org/r/220652 [23:54:25] (03PS3) 10BBlack: move text backend_random into "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220645 [23:54:27] (03PS3) 10BBlack: restructure varnish::instance's "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220644 [23:55:39] (03PS1) 10Jalexander: Add exception for ALA hackathon at WMF Office [mediawiki-config] - 10https://gerrit.wikimedia.org/r/220653 (https://phabricator.wikimedia.org/T103764) [23:57:42] (03PS2) 10Rush: pybal config pool file rearrange [puppet] - 10https://gerrit.wikimedia.org/r/220652 [23:57:43] (03PS2) 10coren: Make labstore configuration into a module [puppet] - 10https://gerrit.wikimedia.org/r/220618 (https://phabricator.wikimedia.org/T93781) [23:58:33] well that's an incredibly unhelpful patch /me fixes himself [23:58:37] (03CR) 10Rush: [C: 032 V: 032] pybal config pool file rearrange [puppet] - 10https://gerrit.wikimedia.org/r/220652 (owner: 10Rush) [23:58:41] (03PS4) 10BBlack: move text backend_random into "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220645 [23:58:43] (03PS4) 10BBlack: restructure varnish::instance's "directors" [puppet] - 10https://gerrit.wikimedia.org/r/220644 [23:58:52] (03PS6) 10coren: Labs: More puppetization fixes for labstore* [puppet] - 10https://gerrit.wikimedia.org/r/218666 (https://phabricator.wikimedia.org/T102478)