[02:24:22] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.2) (duration: 10m 41s) [02:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:33:00] !log l10nupdate@tin ResourceLoader cache refresh completed at Mon May 23 02:33:00 UTC 2016 (duration 8m 38s) [02:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:57:07] PROBLEM - puppet last run on ms-be2013 is CRITICAL: CRITICAL: Puppet has 1 failures [05:20:32] (03PS1) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [05:23:26] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:11:01] ouch just seen kafka1022, it probably needs https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Kafka/Administration#Purge_broker_logs [06:14:25] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2317094 (10Joe) [06:16:21] <_joe_> elukey: are you on it? [06:16:44] _joe_ yep, needs a bit of coffee before purging logs because I don't want to make a mess :D [06:16:49] *need [06:16:59] the partition is already full [06:19:10] (03PS1) 10Giuseppe Lavagetto: admin: add kartik to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/290169 (https://phabricator.wikimedia.org/T135704) [06:25:41] _joe_: Hi! I commented here and I think analytics-privatedata-users is not the right group for this access request :) it should be researchers [06:26:12] https://phabricator.wikimedia.org/T135704#2313018 [06:27:01] madhuvishy: o/ [06:27:04] <_joe_> madhuvishy: uh, thanks! :P [06:27:23] elukey: Morning :) [06:27:43] <_joe_> madhuvishy: guilty as charged, I looked at the "review" ticket and not to the main one :P [06:29:28] _joe_: :) np! [06:30:28] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: puppet fail [06:30:38] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:07] (03PS2) 10Giuseppe Lavagetto: admin: add kartik to researchers [puppet] - 10https://gerrit.wikimedia.org/r/290169 (https://phabricator.wikimedia.org/T135704) [06:31:19] !log Set kafka retention.ms=172800000 for the topic webrequest_upload to free some disk space on kafka1022 [06:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:27] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add kartik to analytics-privatedata-users group - https://phabricator.wikimedia.org/T135704#2317203 (10Joe) Thanks @madhuvishy I'll amend the ticket title. [06:32:07] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add kartik to researchers group - https://phabricator.wikimedia.org/T135704#2317204 (10Joe) [06:32:39] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] admin: add kartik to researchers [puppet] - 10https://gerrit.wikimedia.org/r/290169 (https://phabricator.wikimedia.org/T135704) (owner: 10Giuseppe Lavagetto) [06:32:46] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:57] RECOVERY - Disk space on kafka1022 is OK: DISK OK [06:34:07] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:58] thanks kafka, really glad that you understood and purged your logs so nicely [06:37:19] uhhh _joe_ I might have spoken too soon - the ticket says wikishared db but also links to this ticket - https://phabricator.wikimedia.org/T122524 - where they seem to say that hadoop access is needed too. analytics-privatedata-users may be right after all. Sorry! [06:42:20] !log Removed Kafka temp. override for webrequest_upload retention.ms after freeing some disk space. [06:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:43:19] <_joe_> madhuvishy: uhm let me take a better look at permissions then [06:43:28] <_joe_> not that it's a real issue [06:44:04] _joe_: yeah, sure. so they said wikishared database in the ticket - but far along down the other ticket they've discovered they do need hadoop access [06:44:11] <_joe_> ok, yes, it appears analytics-privatedata-users si the correct one. [06:44:14] i was confused because this one didn't say it [06:44:18] <_joe_> yep [06:44:36] <_joe_> since he needs to be amire80's backup, I'd say let's go with that :P [06:44:41] yes :) [06:44:51] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704#2317227 (10Joe) [06:45:58] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Puppet has 1 failures [06:46:16] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704#2307853 (10Joe) Speaking with @madhuvishy, she realized that we were wrong: @Amire80 has permissions to query hadoop as well,... [06:47:40] <_joe_> madhuvishy: also, help appreciated, it's pretty late in SF [06:47:56] (03PS1) 10Giuseppe Lavagetto: admin: add kartik to analytics-privatedata-users instead [puppet] - 10https://gerrit.wikimedia.org/r/290170 (https://phabricator.wikimedia.org/T135704) [06:48:03] <_joe_> you should seriously not worry about access requests :P [06:48:07] _joe_: ha ha i only confused you ;) [06:49:23] (03PS7) 10Gehel: WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) [06:49:45] (03CR) 10jenkins-bot: [V: 04-1] WIP - Create necessary folders for Postgresql and Cassandra [puppet] - 10https://gerrit.wikimedia.org/r/288215 (https://phabricator.wikimedia.org/T134901) (owner: 10Gehel) [06:49:54] (03PS4) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 [06:50:07] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:50:11] (03CR) 10Giuseppe Lavagetto: [C: 032] admin: add kartik to analytics-privatedata-users instead [puppet] - 10https://gerrit.wikimedia.org/r/290170 (https://phabricator.wikimedia.org/T135704) (owner: 10Giuseppe Lavagetto) [06:50:12] _joe_: oh I care about them somewhat - I'd like folks getting access to running queries to be happy [06:53:43] <_joe_> kart_: you should be able to access stat1002 and perform your queries [06:54:07] 06Operations, 10Ops-Access-Requests, 10Analytics, 10ContentTranslation-Analytics, and 2 others: Add kartik to analytics-privatedata-users - https://phabricator.wikimedia.org/T135704#2317273 (10Joe) 05Open>03Resolved [06:55:12] <_joe_> !log uploaded a new hhvm package for trusty linked to libicu52, T86096 [06:55:13] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [06:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:55:56] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:28] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:56:57] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [06:56:58] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:57:07] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] (03PS1) 10Gehel: Increase time window on which graphite checks are done for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/290171 [06:58:27] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:02] <_joe_> !log installed the new hhvm package on mw2017, T86096 [07:00:03] T86096: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096 [07:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:03:11] (03PS2) 10Gehel: Increase time window on which graphite checks are done for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/290171 [07:03:27] (03CR) 10Gehel: [C: 032 V: 032] Increase time window on which graphite checks are done for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/290171 (owner: 10Gehel) [07:10:15] (03PS5) 10Muehlenhoff: Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 [07:10:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Allow CQL access for multi-instance AQS Cassandra setup [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff) [07:13:51] (03PS1) 10Mobrovac: service::node: Remove firejail and deployment_manage_user [puppet] - 10https://gerrit.wikimedia.org/r/290172 [07:15:06] (03CR) 10jenkins-bot: [V: 04-1] service::node: Remove firejail and deployment_manage_user [puppet] - 10https://gerrit.wikimedia.org/r/290172 (owner: 10Mobrovac) [07:16:29] (03PS2) 10Mobrovac: service::node: Remove firejail and deployment_manage_user [puppet] - 10https://gerrit.wikimedia.org/r/290172 [07:21:50] _joe_: thanks [07:21:57] (03CR) 10Mobrovac: "PCC says it's a no-op basically - https://puppet-compiler.wmflabs.org/2869/" [puppet] - 10https://gerrit.wikimedia.org/r/290172 (owner: 10Mobrovac) [07:23:15] <_joe_> yw kart_ [07:26:18] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/290172 (owner: 10Mobrovac) [07:28:15] (03CR) 10Muehlenhoff: [C: 032 V: 032] service::node: Remove firejail and deployment_manage_user [puppet] - 10https://gerrit.wikimedia.org/r/290172 (owner: 10Mobrovac) [07:56:39] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2317370 (10jcrespo) These also have been imported: ``` 610990450 templatelinks 119323634 externallinks 103124031 categorylinks 92999329 user_properties 80917852 imagelinks ``` Revision table is ongoing now, b... [07:59:19] moritzm: thanks for gerrit/289830 ! [08:00:05] yw, let's pick up the second part when joal is back [08:01:28] !log performing schema change on s7 T130692 [08:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:01:39] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [08:11:13] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2317412 (10mobrovac) I am currently the default go-to guy when it comes to SC* services. This is becoming a bottleneck and, more generally, is not a su... [08:14:22] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2317415 (10Joe) @mobrovac I agree in principle; also I guess the "puppet disabling" will not be needed anymore once we move every service fully to scap... [08:20:10] !log install restarting hhvm on canary systems for librsvg update [08:20:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:20:21] 06Operations, 10DBA: Upgrade m1 db servers - https://phabricator.wikimedia.org/T135973#2317417 (10jcrespo) [08:22:29] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2317434 (10elukey) >>! In T129963#2314781, @elukey wrote: > As we were saying on IRC this "slow" behavior seems to be a risk for us, rather than a win, but i... [08:27:25] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2317435 (10mobrovac) >>! In T135548#2317415, @Joe wrote: > @mobrovac I agree in principle; also I guess the "puppet disabling" will not be needed anymo... [08:28:30] (03PS1) 10Gehel: Maps - Ensure that Redis listen on all interfaces. [puppet] - 10https://gerrit.wikimedia.org/r/290182 [08:32:46] (03CR) 10Gehel: "Puppet compiler : https://puppet-compiler.wmflabs.org/2870/" [puppet] - 10https://gerrit.wikimedia.org/r/290182 (owner: 10Gehel) [08:32:54] (03CR) 10Gehel: [C: 032] Maps - Ensure that Redis listen on all interfaces. [puppet] - 10https://gerrit.wikimedia.org/r/290182 (owner: 10Gehel) [08:35:38] (03PS1) 10Jcrespo: Prepare db1016 and db2010 for jessie reimage; db misc puppet review [puppet] - 10https://gerrit.wikimedia.org/r/290187 (https://phabricator.wikimedia.org/T135973) [08:37:35] (03PS2) 10Jcrespo: Prepare db1016 and db2010 for jessie reimage; db misc puppet review [puppet] - 10https://gerrit.wikimedia.org/r/290187 (https://phabricator.wikimedia.org/T135973) [08:43:33] (03PS2) 10Hashar: contint: bump pip 7.0.1 -> 8.1.2 [puppet] - 10https://gerrit.wikimedia.org/r/289639 [08:44:26] !log rolling restart of restbase2* for openjdk-8 update [08:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:45:29] (03CR) 10Hashar: "Image creation is broken for now due to a HHVM / libicu T86096" [puppet] - 10https://gerrit.wikimedia.org/r/289639 (owner: 10Hashar) [08:50:02] (03CR) 10Jcrespo: [C: 032] Prepare db1016 and db2010 for jessie reimage; db misc puppet review [puppet] - 10https://gerrit.wikimedia.org/r/290187 (https://phabricator.wikimedia.org/T135973) (owner: 10Jcrespo) [08:53:02] 06Operations, 10Traffic, 13Patch-For-Review: varnishmedia: repeated calls to flush_stats() - https://phabricator.wikimedia.org/T132474#2317462 (10ema) 05Open>03Resolved a:03ema [08:56:58] !log stopping, backing up and reimage db2010 T135973 [08:56:59] T135973: Upgrade m1 db servers - https://phabricator.wikimedia.org/T135973 [08:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:03:39] PROBLEM - puppet last run on graphite2001 is CRITICAL: CRITICAL: puppet fail [09:10:12] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2317494 (10fgiunchedi) 03NEW [09:10:39] RECOVERY - RAID on ms-be2012 is OK: OK: optimal, 13 logical, 13 physical [09:21:44] (03CR) 10Santhosh: [C: 031] Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [09:29:40] RECOVERY - puppet last run on graphite2001 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [09:40:07] (03CR) 10Filippo Giunchedi: "LGTM, a couple of nits" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) (owner: 1020after4) [09:51:29] (03PS1) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [09:53:34] ACKNOWLEDGEMENT - puppet last run on ms-be2012 is CRITICAL: CRITICAL: Puppet has 1 failures Filippo Giunchedi sdk broken T135975 [09:55:25] (03PS2) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [09:57:42] (03PS3) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [10:00:02] !log reverted net.netfilter.nf_conntrack_tcp_timeout_time_wait on kafka1013 back to 65 (as set by default by puppet) [10:00:03] (03PS4) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [10:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:01:13] (03PS5) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [10:06:46] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2317611 (10Joe) So, today I upgraded beta and the test hosts in codfw (mw2017 and mw2099); I would like to perform the rolling upgrade of production on Thursday May 26th at 7:00... [10:19:01] <_joe_> MatmaRex: around? [10:19:36] <_joe_> MatmaRex: see my last comments on https://phabricator.wikimedia.org/T58041, any reason not to do a test forced run of updateCollations.php on one wiki? [10:24:08] _joe_: hi [10:24:24] _joe_: assuming the new index is in place, no reason not to [10:24:40] PROBLEM - puppet last run on db2004 is CRITICAL: CRITICAL: puppet fail [10:24:49] !log rolling restart of restbase1* for openjdk-8 update [10:24:51] <_joe_> MatmaRex: the new index is in place on ptwiki, where I wanted to test it [10:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:26:07] <_joe_> so, lets see [10:31:05] (03PS6) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [10:32:29] <_joe_> !log running updateCollations.php --force on ptwiki, T58041 [10:32:30] T58041: updateCollation.php script prohibitively slow for very large wikis - https://phabricator.wikimedia.org/T58041 [10:32:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:36:34] _joe_, is it runing now? [10:36:54] <_joe_> jynus: yes [10:37:10] I see no lag on s2 [10:37:35] oh, now I see 1 second on the slower hosts [10:37:57] <_joe_> it's running excruciatingly slow thouth [10:38:10] <_joe_> less than 100k lines/5 minutes [10:39:47] <_joe_> the original report said nlwiki took 3 hours after the first improvements, I think we're running slower than that. [10:40:04] <_joe_> but this is a forced run [10:47:58] the index is being used: "key: cl_collation_ext; Using index condition; Using where" [10:48:17] <_joe_> yeah it's working fine and not killing the db [10:48:29] <_joe_> I think it's already what I had expected. [10:49:26] (03PS7) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [10:49:29] 06Operations: kvm on ganeti instances getting stuck - https://phabricator.wikimedia.org/T134242#2317665 (10akosiaris) alsafi is on the other hand up for 10 consecutive days without displaying any of the symptoms. I am thinking we 've probably have a decent fix finally. I 'll wait a couple more and if all goes we... [10:50:07] <_joe_> when we do it for the hhvm / libicu upgrade, we could run it in parallel for different shards; It will probably take one week to run though so the impact is not going to be negligible for users [10:51:01] RECOVERY - puppet last run on db2004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:51:38] is this normal, "cl_sortkey = '=/QK/I3/7Q7M+5M/I?/A\Z'" ? [10:52:05] <_joe_> I have no idea [10:54:07] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2317668 (10hashar) I have checked after the week-end and deployment-cache-upload04 shows the FD leak. Via `lsof -X -n|grep deleted`: * Lot of... [10:55:39] the usage of the links* tables is, in my opinion, broken, and I have some ideas to fix them [10:57:31] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 691 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6529767 keys - replication_delay is 691 [10:57:34] !log restarting hhvm on app servers in codfw for librsvg update [10:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:59:49] jynus: _joe_: yeah, that sortkey is normal [11:00:23] ICU sortkeys are binary data [11:00:36] question (far fetched) [11:01:04] is this implemented like this because mysql's collations is unreliable or does it solve a complete different issue? [11:02:01] jynus: yes, or at least, because they were years ago when this was first implemented :) but also because we technically support a bunch of dbmses [11:03:58] I think it is tied also with the fact that text is stored as binary [11:04:00] you'd probably need to ask tim starling, i think he wrote most of the system. (myself and bawolff put it into active use) [11:04:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I'd say also apply it on an openldap server in the same patch. Also a very minor inline comment. Rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) (owner: 10Muehlenhoff) [11:05:19] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2317675 (10Joe) @hashar the reason you see all those deleted "varnishd" lines is that varnish has been updated on disk but not restarted, which... [11:05:28] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2317676 (10ori) >>! In T129963#2317434, @elukey wrote: > He is suggesting to measure hit ratio continuously over the day with fixed time windows to catch differ... [11:08:26] (03PS2) 10Filippo Giunchedi: Update collector version (both branches) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/289963 (owner: 10Eevans) [11:08:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Update collector version (both branches) [software/cassandra-metrics-collector] - 10https://gerrit.wikimedia.org/r/289963 (owner: 10Eevans) [11:10:03] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: deployment-cache-upload04 (m1.medium) / is almost full - https://phabricator.wikimedia.org/T135700#2317678 (10Joe) So the problem - that we have in production too (!!!) is that the logrotate receipt calls ``` invoke-rc.d varnishlog reload ```... [11:11:29] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: Varnishlog doesn't properly rotates logs, varnish.log is empty since forever (was: deployment-cache-upload04 (m1.medium) / is almost full) - https://phabricator.wikimedia.org/T135700#2317679 (10Joe) p:05Low>03High [11:12:55] 06Operations, 06Performance-Team, 13Patch-For-Review: Update memcached package and configuration options - https://phabricator.wikimedia.org/T129963#2317681 (10ori) >>! In T129963#2317434, @elukey wrote: > He is suggesting to measure hit ratio continuously over the day with fixed time windows to catch differ... [11:15:09] (03CR) 10Nikerabbit: [C: 04-1] Enable Compact Language Links as default in Beta (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [11:18:03] /usr/sbin/varnishd (deleted) <--- I am such a newbie _joe_ ! [11:18:51] (03PS2) 10Filippo Giunchedi: Updated cassandra-metrics-collector version(s) [puppet] - 10https://gerrit.wikimedia.org/r/289965 (owner: 10Eevans) [11:18:58] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] Updated cassandra-metrics-collector version(s) [puppet] - 10https://gerrit.wikimedia.org/r/289965 (owner: 10Eevans) [11:20:12] !log deploy new version of cassandra-metrics-collector T135385 [11:20:13] T135385: investigate carbon-c-relay stalls/drops towards graphite2002 - https://phabricator.wikimedia.org/T135385 [11:20:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:23:55] (03PS1) 10Filippo Giunchedi: restbase: add restbase200[789] to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/290199 (https://phabricator.wikimedia.org/T132976) [11:24:36] !log run puppet and roll-restart cassandra-metrics-collector on restbase codfw/eqiad [11:24:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:32:33] (03CR) 10KartikMistry: Enable Compact Language Links as default in Beta (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [11:32:49] (03PS2) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [11:33:07] (03CR) 10Filippo Giunchedi: [C: 04-1] "afaict restbase isn't deployed on restbase2009 yet, to be merged after that has happened" [puppet] - 10https://gerrit.wikimedia.org/r/290199 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [11:38:14] /usr/sbin/varnishd (deleted) <--- I am such a newbie _joe_ ! [11:38:17] grr [11:38:19] sorry [11:42:25] !log restbase deploying 75a94ee to restbase2009 [11:42:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:42:34] godog: ^ [11:43:23] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6485992 keys - replication_delay is 0 [11:43:54] (03CR) 10Mobrovac: [C: 031] "All the new nodes have an up-to-date version of restbase and will be included in all future deploys." [puppet] - 10https://gerrit.wikimedia.org/r/290199 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [11:43:54] RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy [11:49:14] mobrovac: nice, thanks! [11:49:35] np [11:54:52] (03PS3) 10Muehlenhoff: Add a new backup set to backup openldap databases and enable on serpens [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) [11:56:16] (03CR) 10jenkins-bot: [V: 04-1] Add a new backup set to backup openldap databases and enable on serpens [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) (owner: 10Muehlenhoff) [12:08:57] (03PS4) 10Muehlenhoff: Add a new backup set to backup openldap databases and enable on serpens [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) [12:09:52] (03CR) 10Mobrovac: [C: 031] "The dependency has been deployed in prod with I406787829296149f91ebc2d4b456badda464f17d" [puppet] - 10https://gerrit.wikimedia.org/r/289092 (owner: 10GWicke) [12:13:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: add restbase200[789] to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/290199 (https://phabricator.wikimedia.org/T132976) (owner: 10Filippo Giunchedi) [12:15:03] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/2873/" [puppet] - 10https://gerrit.wikimedia.org/r/289092 (owner: 10GWicke) [12:15:27] !log filippo@palladium conftool action : set/pooled=yes; selector: restbase2007.codfw.wmnet [12:15:33] !log filippo@palladium conftool action : set/pooled=yes; selector: restbase2008.codfw.wmnet [12:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:37] !log filippo@palladium conftool action : set/pooled=yes; selector: restbase2009.codfw.wmnet [12:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:15:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:16:49] 06Operations, 10ops-codfw, 10cassandra, 13Patch-For-Review: rack/setup/deploy restbase200[7-9] - https://phabricator.wikimedia.org/T132976#2318105 (10fgiunchedi) 05Open>03Resolved restbase200[789] bootstrapped each with two instances and restbase running, resolving [12:25:53] (03PS2) 10Elukey: Forward x-client-ip & user-agent to AQS [puppet] - 10https://gerrit.wikimedia.org/r/289092 (owner: 10GWicke) [12:26:44] (03PS5) 10Muehlenhoff: Add a new backup set to backup openldap databases and enable on serpens [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) [12:27:34] (03CR) 10Elukey: [C: 032] "Change discussed on IRC with Marko, looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/289092 (owner: 10GWicke) [12:31:03] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [12:32:38] (03CR) 10Muehlenhoff: "This is now enabled on serpens. We can include the OIT LDAP mirror later on. Also dropped the additional require_package() on slapd." [puppet] - 10https://gerrit.wikimedia.org/r/289824 (https://phabricator.wikimedia.org/T120919) (owner: 10Muehlenhoff) [12:33:17] !log stopping, backing up and reimage db1016 T135973 (it will also affect db2010 lag) [12:33:18] T135973: Upgrade m1 db servers - https://phabricator.wikimedia.org/T135973 [12:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:38:12] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic: Varnishlog doesn't properly rotates logs, varnish.log is empty since forever (was: deployment-cache-upload04 (m1.medium) / is almost full) - https://phabricator.wikimedia.org/T135700#2318200 (10Joe) A third option is we just stop varnishlog as... [12:38:49] dbproxy1001 should now complain of a missing backend [12:42:00] effectively, the proxy works [12:44:18] !log restbase restarting to apply https://gerrit.wikimedia.org/r/#/c/289092/ [12:44:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:48:38] (03CR) 10Elukey: [C: 031] "Puppet compiler looks good: http://puppet-compiler.wmflabs.org/2876/" [puppet] - 10https://gerrit.wikimedia.org/r/289981 (owner: 10Dzahn) [12:50:02] (03PS1) 10Ema: Do not run varnishlog on Varnish 3 [puppet] - 10https://gerrit.wikimedia.org/r/290208 (https://phabricator.wikimedia.org/T135700) [12:53:28] (03PS1) 10Giuseppe Lavagetto: varnish: don't run varnishlog on 3.x hosts [puppet] - 10https://gerrit.wikimedia.org/r/290209 (https://phabricator.wikimedia.org/T135700) [12:56:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "This will end up with a duplicate declaration, you need to add the service declaration to a class, not a define." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290208 (https://phabricator.wikimedia.org/T135700) (owner: 10Ema) [12:56:43] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:57:02] !log rolling restart of cassandra on maps-test cluster for openjdk security update [12:57:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:03:53] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 611 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6494725 keys - replication_delay is 611 [13:04:41] 06Operations, 10ops-eqiad, 10Analytics-Cluster: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2318255 (10elukey) Adding a note from the SAL: ``` 10:00 reverted net.netfilter.nf_conntrack_tcp_timeout_time_wait on kafka1013 back to 65 (as set by default by puppet) ``` So now... [13:06:33] (03CR) 10Ema: [C: 031] varnish: don't run varnishlog on 3.x hosts [puppet] - 10https://gerrit.wikimedia.org/r/290209 (https://phabricator.wikimedia.org/T135700) (owner: 10Giuseppe Lavagetto) [13:07:22] 06Operations, 10ops-eqiad, 10Analytics-Cluster: kafka1013 hardware crash - https://phabricator.wikimedia.org/T135557#2303007 (10MoritzMuehlenhoff) And also: Why was net.netfilter.nf_conntrack_tcp_timeout_time_wait set to the kernel default value of 120? The value of 65 should have been set on system startup... [13:07:24] (03Abandoned) 10Ema: Do not run varnishlog on Varnish 3 [puppet] - 10https://gerrit.wikimedia.org/r/290208 (https://phabricator.wikimedia.org/T135700) (owner: 10Ema) [13:07:38] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish: don't run varnishlog on 3.x hosts [puppet] - 10https://gerrit.wikimedia.org/r/290209 (https://phabricator.wikimedia.org/T135700) (owner: 10Giuseppe Lavagetto) [13:12:17] 06Operations, 10Beta-Cluster-Infrastructure, 06Labs, 10Traffic, 13Patch-For-Review: Varnishlog doesn't properly rotates logs, varnish.log is empty since forever (was: deployment-cache-upload04 (m1.medium) / is almost full) - https://phabricator.wikimedia.org/T135700#2318265 (10Joe) 05Open>03Resolved [13:16:50] (03PS1) 10Addshore: DNM Clone grafana piechart plugin [puppet] - 10https://gerrit.wikimedia.org/r/290212 (https://phabricator.wikimedia.org/T121846) [13:17:51] (03CR) 10jenkins-bot: [V: 04-1] DNM Clone grafana piechart plugin [puppet] - 10https://gerrit.wikimedia.org/r/290212 (https://phabricator.wikimedia.org/T121846) (owner: 10Addshore) [13:26:05] 06Operations, 10DBA: Physical location SPOF because of database server distribution on a single rack (D1) - https://phabricator.wikimedia.org/T111992#2318310 (10Volans) Updated with the actual hostnames. @Cmjohnson for the ones in row C I don't know yet which hostnames are in C2 and which in C3. I'll re-chec... [13:27:43] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2318325 (10Gilles) [13:29:34] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6486692 keys - replication_delay is 0 [13:32:15] 06Operations: Automated service restarts for common low-level system services - https://phabricator.wikimedia.org/T135991#2318329 (10MoritzMuehlenhoff) [13:35:43] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2318354 (10Joe) @papaul my proposal would be: # Swap out all mw* servers in row A3 # Install 24 servers in A3 # Remove mw2041-mw2060 from row A4 # Replace them with the remaining 12 servers What will t... [13:39:58] (03PS2) 10Aude: Set interwiki sorting order for West Frisian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288101 (https://phabricator.wikimedia.org/T103207) [13:58:22] 06Operations, 06Performance-Team, 10Thumbor: Package and backport Thumbor dependencies in Debian - https://phabricator.wikimedia.org/T134485#2318405 (10Gilles) [13:59:15] (03PS1) 10Elukey: Add get_hits_ratio calculation to memcached's gmond agent. [puppet] - 10https://gerrit.wikimedia.org/r/290233 (https://phabricator.wikimedia.org/T129963) [14:08:04] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2318423 (10fgiunchedi) >>! In T113714#2298161, @Eevans wrote: >>>! In T113714#2298063, @fgiunchedi wrote: >>>>! In T113714#2297469, @Eevans wrote: >>>>... [14:11:14] PROBLEM - puppet last run on cp2008 is CRITICAL: CRITICAL: puppet fail [14:12:44] (03PS1) 10Giuseppe Lavagetto: mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 [14:14:23] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2003 instances [puppet] - 10https://gerrit.wikimedia.org/r/290237 [14:20:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: assign new eqiad appservers, install with jessie [puppet] - 10https://gerrit.wikimedia.org/r/290236 [14:23:41] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2318465 (10Eevans) >>! In T113714#2318423, @fgiunchedi wrote: >>>! In T113714#2298161, @Eevans wrote: >>>>! In T113714#2298063, @fgiunchedi wrote: >>>>... [14:25:28] (03CR) 10Ottomata: [C: 031] "Only one nit! Haven't tested but I trust ya!" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [14:29:16] (03PS8) 10Elukey: Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) [14:29:22] ottomata1: o/ --^ [14:30:34] (03CR) 10Ottomata: "Hm, ok so, the name of this variable is a little misleading. It is namespaced as aqs::, which leads me to believe it is a parameter on th" [puppet] - 10https://gerrit.wikimedia.org/r/289830 (owner: 10Muehlenhoff) [14:31:05] hiya! :) [14:32:40] elukey: sorry I didn't see ^^ aqs::hosts earlier, see my comment there [14:33:03] (03CR) 10Ottomata: [C: 032] Add Spark Dynamic executors support to Yarn. [puppet/cdh] - 10https://gerrit.wikimedia.org/r/290191 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [14:33:10] elukey: +2 on that, but didn't merge [14:33:44] ottomata: I was thinking to merge it with puppet disabled on the whole cluster, and then to re-enable/restart incrementally [14:34:09] and to leave the master nodes as last step [14:34:21] if anything goes wrong I can rollback easily [14:34:29] (also stopping Oozie might be good) [14:35:34] aye, hm, elukey it seems like a relatively safe change, restarting workers 1 by 1. i'd just merge it and apply it everywhere than restart one by one [14:35:41] prob don't need to stop oozie during worker restarts [14:35:50] never a bad idea during master restarts though, even though things should be ok [14:36:30] (03PS1) 10Mobrovac: lvs::monitor_services: s/eqiad/codfw/ in descriptions where needed [puppet] - 10https://gerrit.wikimedia.org/r/290242 [14:37:13] RECOVERY - puppet last run on cp2008 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:38:01] ottomata: all right got it, I was afraid that yarn would have restarted on yarn-site.xml changes [14:38:28] (03PS2) 10Filippo Giunchedi: cassandra: add restbase2003 instances [puppet] - 10https://gerrit.wikimedia.org/r/290237 [14:38:30] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2005 instances [puppet] - 10https://gerrit.wikimedia.org/r/290243 (https://phabricator.wikimedia.org/T95253) [14:38:31] naw, i don't like service subscribes for stateful services like that [14:38:32] (03PS1) 10Filippo Giunchedi: cassandra: add restbase2006 instances [puppet] - 10https://gerrit.wikimedia.org/r/290244 (https://phabricator.wikimedia.org/T95253) [14:38:35] too unpredictable [14:38:54] all right merging! [14:39:25] (03PS1) 10Muehlenhoff: Rename Hiera variable for aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/290246 [14:40:25] (03CR) 10Ottomata: [C: 031] "One, nit, other than that merge away! Thank you!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/290246 (owner: 10Muehlenhoff) [14:40:38] (03CR) 10Giuseppe Lavagetto: [C: 032] lvs::monitor_services: s/eqiad/codfw/ in descriptions where needed [puppet] - 10https://gerrit.wikimedia.org/r/290242 (owner: 10Mobrovac) [14:42:50] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2318515 (10Papaul) @Joe Rob suggested we use mw2215 which is the next app server but I think we can reused the same name. What do you think. [14:43:40] !log filippo@palladium conftool action : set/pooled=no; selector: restbase2005.codfw.wmnet [14:43:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:40] !log reboot restbase2005 in single user mode for T113714 [14:44:41] T113714: Separate /var on restbase - https://phabricator.wikimedia.org/T113714 [14:44:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:49:49] (03CR) 10Nikerabbit: [C: 04-1] Enable Compact Language Links as default in Beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [14:52:35] (03PS2) 10Muehlenhoff: Rename Hiera variable for aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/290246 [14:52:51] (03CR) 10Muehlenhoff: [C: 032 V: 032] Rename Hiera variable for aqs hosts [puppet] - 10https://gerrit.wikimedia.org/r/290246 (owner: 10Muehlenhoff) [14:53:37] (03PS1) 10Elukey: Add Spark dynamic executor setting to Yarn Namenodes. [puppet] - 10https://gerrit.wikimedia.org/r/290252 (https://phabricator.wikimedia.org/T101343) [14:55:16] (03PS1) 10Muehlenhoff: Add comment [puppet] - 10https://gerrit.wikimedia.org/r/290253 [14:55:25] (elukey am excited to try this one! :D ) [14:55:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add comment [puppet] - 10https://gerrit.wikimedia.org/r/290253 (owner: 10Muehlenhoff) [14:55:40] thanks moritzm :) [14:58:24] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2318543 (10Anomie) If I'm reading the chart at http://site.icu-project.org/download correctly, it seems we're going from Unicode 6.0 in libicu48 to Unicode 6.3 in libicu52. That me... [14:59:53] (03PS2) 10Elukey: Add Spark dynamic executor setting to Yarn Namenodes. [puppet] - 10https://gerrit.wikimedia.org/r/290252 (https://phabricator.wikimedia.org/T101343) [15:00:04] anomie ostriches thcipriani marktraceur: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160523T1500). Please do the needful. [15:00:04] Urbanecm aude MatmaRex: A patch you scheduled for Morning SWAT(Max 8 patches) is about to be deployed. Please be available during the process. [15:00:27] Present [15:00:36] hi [15:01:05] I can SWAT today. [15:01:57] thcipriani: we would like https://gerrit.wikimedia.org/r/#/c/290216/ also in swat, but need to update the wikidata build [15:02:00] (03CR) 10Elukey: [C: 032 V: 032] Add Spark dynamic executor setting to Yarn Namenodes. [puppet] - 10https://gerrit.wikimedia.org/r/290252 (https://phabricator.wikimedia.org/T101343) (owner: 10Elukey) [15:03:04] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290000 (https://phabricator.wikimedia.org/T135774) (owner: 10Urbanecm) [15:03:48] (03Merged) 10jenkins-bot: Adjust groups permissions on fa.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290000 (https://phabricator.wikimedia.org/T135774) (owner: 10Urbanecm) [15:03:51] aude: ack. [15:04:04] 06Operations, 10Phabricator, 10Phabricator-Upstream: PHD ensuring umask goodness - https://phabricator.wikimedia.org/T91648#2318585 (10akosiaris) >>! In T91648#2316165, @Aklapper wrote: > Anyone knows if this is actually still an issue or if T128009 fixed this? > @akosiaris, @chasemp or anyone else? > > Ask... [15:06:05] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:290000|Adjust groups permissions on fa.wikipedia]] (duration: 00m 41s) [15:06:11] ^ Urbanecm check please [15:06:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:07:25] thcipriani, working [15:07:38] (03PS9) 10Thcipriani: Creation of page mover userright on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:08:07] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:08:53] (03Merged) 10jenkins-bot: Creation of page mover userright on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/286168 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [15:10:01] (03PS2) 10Filippo Giunchedi: cassandra: add restbase2005 instances [puppet] - 10https://gerrit.wikimedia.org/r/290243 (https://phabricator.wikimedia.org/T95253) [15:10:14] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] cassandra: add restbase2005 instances [puppet] - 10https://gerrit.wikimedia.org/r/290243 (https://phabricator.wikimedia.org/T95253) (owner: 10Filippo Giunchedi) [15:11:49] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:286168|Creation of page mover userright on enwiki]] (duration: 00m 30s) [15:11:55] ^ Urbanecm check please [15:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:13:04] Is Special:UserGroupRights cached? I can't see it. [15:13:24] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: Puppet has 1 failures [15:13:35] !log performing schema change on s3 T130692 [15:13:37] T130692: Add new indexes from eec016ece6d2b30addcdf3d3efcc2ba59b10e858 to production databases - https://phabricator.wikimedia.org/T130692 [15:13:42] Now working thcipriani. [15:13:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:14:44] Thanks for deploys thcipriani . [15:14:57] Urbanecm: hmm, FWIW I see extendedmover in https://en.wikipedia.org/wiki/Special:ListGroupRights [15:15:09] Urbanecm: thanks for checking the deploys. [15:15:15] Now me too :). [15:15:24] that's good :) [15:15:51] (03PS3) 10Thcipriani: Set interwiki sorting order for West Frisian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288101 (https://phabricator.wikimedia.org/T103207) (owner: 10Aude) [15:16:09] (03CR) 10KartikMistry: Enable Compact Language Links as default in Beta (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [15:16:16] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288101 (https://phabricator.wikimedia.org/T103207) (owner: 10Aude) [15:16:36] thcipriani: i might need a bit more time with updating the build [15:16:48] so maybe it's better to do after swat, depending how long it takes [15:16:55] (03Merged) 10jenkins-bot: Set interwiki sorting order for West Frisian Wikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/288101 (https://phabricator.wikimedia.org/T103207) (owner: 10Aude) [15:17:22] aude: kk, no problem. [15:17:28] (03PS2) 10Nikerabbit: ULS: Stop using /static/current [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) [15:17:32] (03PS3) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [15:17:32] i have some concerns about changes composer is making [15:17:42] adding autoload_static.php [15:18:23] (03CR) 10Nikerabbit: [C: 04-2] "The ULS change should be rolling out with the next train. Let's visit next week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289652 (https://phabricator.wikimedia.org/T135806) (owner: 10Nikerabbit) [15:21:03] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:288101|Set interwiki sorting order for West Frisian Wikibooks]] (duration: 00m 25s) [15:21:05] ^ aude check please [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:21:26] thcipriani: think it's ok [15:22:08] (03PS2) 10Thcipriani: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:22:57] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:23:31] (03Merged) 10jenkins-bot: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:23:43] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [15:24:40] 06Operations, 10ops-codfw: codfw old mw app server decomission - https://phabricator.wikimedia.org/T135468#2318691 (10RobH) Please do not reuse the old names for mw systems. Right now we know that higher # mw systems are newer systems, and its easier to do that for now. Please name these mw2215 up. When we... [15:25:01] (03PS4) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [15:27:07] blerg that was weird. I got https://logstash.wikimedia.org/#/connectionFailed for a second. [15:27:32] (03PS1) 10Jcrespo: Apply firewall to db1016 [puppet] - 10https://gerrit.wikimedia.org/r/290257 (https://phabricator.wikimedia.org/T135973) [15:27:52] thcipriani: i have a patch for the build, though probably jenkins will take a while [15:28:10] idk if it will be too long (if so, suppose i could deploy it myself later) [15:29:00] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:289109|Final Commons configuration for $wgUploadDialog]] (duration: 00m 30s) [15:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:30:08] MatmaRex: whoa lots of errors... Undefined variable: wmgCustomUploadDialog in /srv/mediawiki/wmf-config/CommonSettings.php on line 1984 [15:30:16] trying to figure out if I did this out of order... [15:30:23] wut [15:31:03] thcipriani: aw shit, it probably needs the 'default' value [15:31:12] reverting [15:31:14] 'default' => false [15:31:16] sorry :/ [15:31:29] (yeah, revert it) [15:32:19] !log thcipriani@tin Synchronized wmf-config: SWAT: revert [[gerrit:289109|Final Commons configuration for $wgUploadDialog]] (duration: 00m 28s) [15:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:33:56] 06Operations, 10ops-codfw, 06DC-Ops, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#2318735 (10RobH) p:05Normal>03High Papaul is still using the mifi for ALL onsite work. Is there anything that either Papaul or I can do to move the setup of wifi in codfw along? [15:34:44] (03PS1) 10Thcipriani: Revert "Final Commons configuration for $wgUploadDialog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290258 [15:34:53] (03PS1) 10Bartosz Dziewoński: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290259 (https://phabricator.wikimedia.org/T134775) [15:35:13] thcipriani: this ^ is how it should've been, i think. i can schedule it for later. [15:35:14] (03CR) 10jenkins-bot: [V: 04-1] Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290259 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:35:18] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 702 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6501652 keys - replication_delay is 702 [15:35:43] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290258 (owner: 10Thcipriani) [15:36:51] (03Merged) 10jenkins-bot: Revert "Final Commons configuration for $wgUploadDialog" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290258 (owner: 10Thcipriani) [15:36:53] MatmaRex: I can get it out. Sorry didn't catch that in review :( [15:37:32] neither did i. oh well, it was luckily harmless, just some spammed logs. :) [15:37:47] (03PS2) 10Thcipriani: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290259 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:37:51] if we have time to deploy it now, then i'm still here. thanks [15:38:09] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [15:39:34] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290259 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:40:14] (03Merged) 10jenkins-bot: Final Commons configuration for $wgUploadDialog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290259 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [15:42:13] thcipriani: https://gerrit.wikimedia.org/r/#/c/290256/ is ready if you want to take care of it [15:42:24] needs +2 and hten wait for jenkins again :/ [15:42:50] !log thcipriani@tin Synchronized wmf-config: SWAT: [[gerrit:290259|Final Commons configuration for $wgUploadDialog]] (duration: 00m 28s) [15:42:56] ^ MatmaRex check please [15:42:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:43:00] logs look fine :) [15:43:13] aude: looking [15:43:15] ok [15:43:31] just took some time to figure out the composer issue, but it's okay now [15:44:17] thcipriani: looks alright. thanks! [15:44:28] MatmaRex: cool, thanks for checking. [15:46:29] wowza. gate-and-submit sure doesn't seem happy :( [15:47:27] !log testing thread_pool_max_threads=2000 on db1076 (s2) T133333 [15:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:50] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [15:48:27] volans, I meant large as in traffic, not in server capacity :-) [15:48:38] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:49:20] jynus: do you prefer db1067? both have a weight of 500 :-) [15:49:24] 06Operations, 06Services, 10cassandra, 13Patch-For-Review, 07RESTBase-architecture: Separate /var on restbase - https://phabricator.wikimedia.org/T113714#2318801 (10Eevans) @fgiunchedi completed the conversion of 2005 to 2005-a (in what looks like ~15 minutes); Everything looks perfect. Good work @fgiun... [15:49:43] one enwiki large server or one api for one of the large wikis [15:51:39] 06Operations, 03Discovery-Search-Sprint: Check Icinga alert on CirrusSearch response time - https://phabricator.wikimedia.org/T134852#2279774 (10EBernhardson) a:03EBernhardson [15:53:36] is it just me, or has the etherpad become very janky [15:54:14] better db1073 jynus? I applied the rule start from s2 ;) [15:54:25] where janky here means disconnecting constantly [15:55:04] <_joe_> urandom: probably not just you [15:55:07] <_joe_> let me check [15:56:49] volans, we will apply it on all, but I just mentioned that a high traffic one would be a better test-bed [15:57:06] * aude watching zuul [15:57:10] <_joe_> urandom: works fine for me and etherpad itself seems ok [15:57:46] _joe_: yeah, for me as well now [15:58:13] _joe_: if you didn't do anything, then it must have seen you coming and was afraid [15:58:50] etherpad stutters to signal the ops meeting coming up! [15:59:06] aude: hopefully you've got a minute. I had to kill a job that was hung and blocking up the works for 30 minutes, which requeued all the jobs behind it :( [15:59:30] dbproxy1001 should go back to have 2 backends now [16:01:59] !log testing thread_pool_max_threads=2000 on db1072 (s1) [instead of db1076 (s2)] T133333 [16:02:00] T133333: Audit new eqiad masters configuration - https://phabricator.wikimedia.org/T133333 [16:02:02] !log restarting yarn on analytics10* hosts to pick up the new Spark shuffler process [16:02:04] 06Operations, 10ops-codfw: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2318828 (10Papaul) [16:02:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:02:15] thcipriani: ok [16:05:51] (03CR) 10Nikerabbit: [C: 04-1] Enable Compact Language Links as default in Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [16:06:01] (03CR) 10Bartosz Dziewoński: "Reverted in https://gerrit.wikimedia.org/r/#/c/290258/ , re-deployed in https://gerrit.wikimedia.org/r/#/c/290259/ ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/289109 (https://phabricator.wikimedia.org/T134775) (owner: 10Bartosz Dziewoński) [16:06:08] (03CR) 10Bartosz Dziewoński: "Re-deployed in https://gerrit.wikimedia.org/r/#/c/290259/ ." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290258 (owner: 10Thcipriani) [16:06:43] (03PS1) 10EBernhardson: Point CirrusSearch alerting at more useful metrics [puppet] - 10https://gerrit.wikimedia.org/r/290262 (https://phabricator.wikimedia.org/T134852) [16:10:56] (03CR) 10jenkins-bot: [V: 04-1] Point CirrusSearch alerting at more useful metrics [puppet] - 10https://gerrit.wikimedia.org/r/290262 (https://phabricator.wikimedia.org/T134852) (owner: 10EBernhardson) [16:12:02] !log thcipriani@tin Synchronized php-1.28.0-wmf.2/extensions/Wikidata/extensions/Wikibase/client/includes/Hooks/DataUpdateHookHandlers.php: [[gerrit:290256|Update Wikidata - fix file deletion issue on commons]] (duration: 00m 29s) [16:12:10] ^ aude check please [16:12:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:12:22] thanks [16:12:34] \o/ [16:12:40] it's good :) [16:13:22] aude: thank you! [16:13:36] thank you! (sorry it took so long) [16:13:56] 06Operations, 10netops: cr2-codfw LUCHIP/trinity_pio error messages - https://phabricator.wikimedia.org/T134932#2318924 (10faidon) Mark rebooted the FPC on Friday: > 13:45 mark: Enabled cr2-codfw et-0/* interfaces, reenabling OSPF/OSPF3 > 13:38 mark: Bringing cr2-codfw FPC 0 back up > 13:37 mark: Offlinin... [16:19:48] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: /page/mobile-sections/{title} (Get MobileApps Foobar page) is CRITICAL: Test Get MobileApps Foobar page returned the unexpected status 500 (expecting: 200) [16:19:48] PROBLEM - restbase endpoints health on restbase-test2001 is CRITICAL: /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [16:20:08] PROBLEM - restbase endpoints health on restbase-test2003 is CRITICAL: /page/summary/{title} (Get summary from storage) is CRITICAL: Test Get summary from storage returned the unexpected status 500 (expecting: 200): /media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Test Mathoid - check test formula returned the unexpected status 500 (expecting: 200) [16:20:18] <_joe_> urandom: ^^ the test cluster seems not happy [16:20:27] (03PS1) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [16:21:50] (03CR) 10jenkins-bot: [V: 04-1] RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [16:23:40] (03PS2) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [16:24:07] RECOVERY - restbase endpoints health on restbase-test2001 is OK: All endpoints are healthy [16:24:28] RECOVERY - restbase endpoints health on restbase-test2003 is OK: All endpoints are healthy [16:24:41] (03CR) 10jenkins-bot: [V: 04-1] RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [16:25:11] (03PS2) 10EBernhardson: Point CirrusSearch alerting at more useful metrics [puppet] - 10https://gerrit.wikimedia.org/r/290262 (https://phabricator.wikimedia.org/T134852) [16:25:26] _joe_: hrmm, moritzm was just bouncing Cassandra there [16:26:18] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [16:26:44] guess that was it. [16:26:51] (03PS3) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [16:28:53] !log Bouncing RESTBase on restbase-test200[1-3].codfw.wmnet [16:29:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:32:11] !log Stashbot down due to backing elasticsearch cluster instability. Investigating. [16:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:42] (03CR) 10Mobrovac: "PCC looking good - https://puppet-compiler.wmflabs.org/2880/" [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [16:41:17] (03PS51) 10Alexandros Kosiaris: ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [16:45:37] !log Stashbot back online. Will continue to monitor for a while to see if ES cluster is happier. [16:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:47:53] (03CR) 10KartikMistry: Enable Compact Language Links as default in Beta (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) (owner: 10KartikMistry) [16:48:07] 06Operations, 10Analytics, 06Performance-Team, 10Traffic: A/B Testing solid framework - https://phabricator.wikimedia.org/T135762#2310174 (10Milimetric) I have a quick suggestion to make this play nice with client-side sampling. First, the problem: some Event Logging instrumentation randomly only sends b... [16:48:14] (03PS5) 10KartikMistry: Enable Compact Language Links as default in Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290165 (https://phabricator.wikimedia.org/T134966) [16:52:08] (03PS3) 10Rush: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [16:52:19] (03PS4) 10Rush: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [16:52:36] bd808: Hi, what im thinking we can do with https://phabricator.wikimedia.org/T136010 and https://phabricator.wikimedia.org/T135161 is in mediawiki/vendor we can create a php 5.5 and php 5.6 folder. And add a php check from where we load mediawiki vendor in mw core. [16:52:51] So we ignore php 5.6 folder if we use php 5.5 <- [16:53:03] and ignore php 5.5 if we use php 5.6 -> [16:53:54] paladox: Let's go to #-releng. This isn't really a good conversation to have here [16:54:04] Ok [16:56:38] (03PS5) 10Rush: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [16:57:28] (03PS6) 10Rush: labstore cleanup and role vs module arrange [puppet] - 10https://gerrit.wikimedia.org/r/289964 [16:57:55] 06Operations, 10ops-eqiad: audit/remove two cross-connection patch cables - https://phabricator.wikimedia.org/T132945#2319153 (10RobH) 05Open>03Resolved [17:00:05] gehel: Respected human, time to deploy Weekly Wikidata query service deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160523T1700). Please do the needful. [17:02:06] SMalyshev: let me know when you're ready... [17:03:41] gehel: I think I'm done on tin. I did the checkout and fixed the submodules. So now we need to deploy it and check the GUI is ok [17:04:02] SMalyshev: want me to take over from there ? [17:04:11] gehel: yes [17:04:59] SMalyshev: is beta already up to date? [17:05:07] gehel: yes [17:06:12] SMalyshev: ok, so repo on tin is already fully updated, I just have to "git deploy"? Too easy! [17:07:40] gehel: yes. and check it worked fine :) I hope nothing weird happens. seems to be working fine on beta [17:08:19] (03CR) 10Alexandros Kosiaris: RESTBase: Set up rate limiting (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [17:08:32] gehel: no code changes, so no need for restarts [17:08:43] SMalyshev: I think that "git deploy" does not like new submodules... [17:09:00] gehel: hmm... what does it say? [17:09:45] SMalyshev: it is as precise and talkative as usual: "0/2 minions completed checkout" [17:10:06] gehel: hmm... maybe it's just slow? it happens [17:10:17] dis looking for his notes on debugging git deploy... [17:10:25] !log restarting Yarn Resource manager (master node) on analytics1001 to apply a new Spark configuration. The service will automatically failover to analytics1002 [17:10:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:25] gehel: is git deploy using git at all or just copies files? [17:11:46] SMalyshev: not entirely sure, but I think it is using git [17:11:55] SMalyshev: gehel it is using git [17:12:07] the targets get their git remotes set to point at tin [17:12:16] so they fetch from it and then checkout [17:12:39] submodules with git deploy are def a pain, sometimes it works, sometimes it doesn't [17:12:43] ottomata: and we probably need to do some manual steps if we changed a git submodule? [17:12:46] i am not totally sure of the magic incantation to make it work [17:13:09] like, you changed the submodules' sha that gets checked out? [17:13:09] ottomata: we agreed that sacrifying a goat was the right thing to do [17:13:19] this was deployed before w submodule succesfully? [17:13:56] !log rebooting labvirt1003 [17:13:58] I think we actually changed the submodule to point at another repo. SMalyshev can you confirm? [17:14:02] hm.... maybe it makes sense to remove gui dir on the target and re-checkout. because switching submodules was weird on tin too [17:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:14:07] gehel: yes [17:15:48] (03PS10) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [17:16:15] ottomata: you seem to understand this better than me... Can I just fix the submodules manually on the targets and expect the next deploy to work? [17:16:45] be back in 10 mins [17:16:47] PROBLEM - Host www.toolserver.org is DOWN: CRITICAL - Host Unreachable (www.toolserver.org) [17:18:49] eh, that is a labs instance itself [17:21:59] gehel: i doubt it [17:22:12] to make it easy on you gehel, i'd just remove the directories on both tin and the targets, if you can [17:22:20] then run pupppet on tin [17:22:22] and then on the targets [17:23:00] ottomata: removing the directory on target will mean downtime... not good. Unless I can do it one node after the other... [17:23:06] hola! we have a new domain: analytics.wikimedia.org and I was wondering how can we bust the varnish cache of it so our deployments are visible (cc bblack) [17:24:07] ottomata: to make your proposal work, should I delete the whole project directory? Or just the submodule? [17:25:36] gehel: you probably can [17:25:49] gehel: i'm not sure, but you could first try just the submodule [17:25:55] ottomata: I like the "probably"... [17:26:16] oh gehel, that probably can was meant for one node at a time [17:26:17] that would work [17:26:19] first tin [17:26:24] then each node on by one [17:26:26] would be fine [17:26:29] especially if you let puppet do it [17:26:59] if the dir is missing on the target, puppet's run of git deploy sync (or something) will notice and attempt to clone and checkout the version currently on tin [17:27:27] * SMalyshev back [17:27:30] ottomata: yep, the rolling restart should not be the issue... I'm checking to see if we have some transient files in the deployment directory before deleting anyting, but that should work... [17:27:49] SMalyshev: and I thought this was too easy... [17:28:04] gehel: famous last words :) [17:28:27] hehe, gehel, you might as well try just deleting the submodule first [17:28:29] it might work! :) [17:28:34] make sure tin is in the correct state [17:28:42] tin was fine [17:28:44] delete the submodule on the target [17:29:08] still is as far as I can see [17:29:15] see if you can git deploy! :) [17:29:20] SMalyshev: just to check, do you see any issue with removing the whole deployment directory and starting from scratch again? It seems we have one log file in the deployment dir, and that would require a server restart ... [17:29:24] dunno though, hm it might be unhappy about an unclean working copy [17:29:31] (03PS1) 10CSteipp: Enable Ex:OATH on CentralAuth wikis, limited rights [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290271 (https://phabricator.wikimedia.org/T107605) [17:30:13] * gehel needs a coffee first, breaking production always goes better with coffee... [17:30:38] PROBLEM - puppet last run on cp4009 is CRITICAL: CRITICAL: puppet fail [17:34:01] (03PS2) 10Ori.livneh: Add get_hits_ratio calculation to memcached's gmond agent. [puppet] - 10https://gerrit.wikimedia.org/r/290233 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [17:34:27] (03CR) 10Ori.livneh: [C: 032 V: 032] Add get_hits_ratio calculation to memcached's gmond agent. [puppet] - 10https://gerrit.wikimedia.org/r/290233 (https://phabricator.wikimedia.org/T129963) (owner: 10Elukey) [17:35:06] !log putting wdqs1001 in maintenance to fix deployment issues [17:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:44:43] RECOVERY - Host www.toolserver.org is UP: PING OK - Packet loss = 0%, RTA = 2.82 ms [17:50:21] :) [17:51:14] (03PS4) 10Mobrovac: RESTBase: Set up rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/290264 [17:52:58] (03CR) 10Mobrovac: RESTBase: Set up rate limiting (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [17:54:27] (03PS1) 10CSteipp: Enable Ex:OATHAuth on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290278 (https://phabricator.wikimedia.org/T135889) [17:54:35] (03PS1) 10Catrope: Enable Flow beta feature on frwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290279 (https://phabricator.wikimedia.org/T135702) [17:55:04] (03CR) 10Mobrovac: "PCC looking good for PS4 too - https://puppet-compiler.wmflabs.org/2881/" [puppet] - 10https://gerrit.wikimedia.org/r/290264 (owner: 10Mobrovac) [17:55:24] RECOVERY - puppet last run on cp4009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:00:35] ottomata: since you seem to understand those deployments... how is the .gitmodule rewritten as part of the deploy? I have the submodule pointing to the wrong remote [18:02:20] (03PS1) 10Ori.livneh: Provision Diamond collector for Memcached [puppet] - 10https://gerrit.wikimedia.org/r/290282 [18:02:52] gehel: https://github.com/wikimedia/operations-puppet/blob/production/modules/deployment/files/modules/deploy.py#L316 [18:03:52] gehel: you might be able to fix the remote in .gitmodules manually yourself [18:04:33] ottomata: looking at the code, it should fail just the same at next deployment... [18:04:39] (03PS2) 10Ori.livneh: Provision Diamond collector for Memcached [puppet] - 10https://gerrit.wikimedia.org/r/290282 [18:04:53] gehel: oh if you modify it to something git doesn't expect? [18:04:58] (03CR) 10Ori.livneh: [C: 032 V: 032] Provision Diamond collector for Memcached [puppet] - 10https://gerrit.wikimedia.org/r/290282 (owner: 10Ori.livneh) [18:05:20] hm, gehel ja i think you would do a git deploy first, so that at least the .gitmodules file matches the working copy, then maybe you can edit [18:05:21] dunno though [18:05:47] ottomata: we have a submodule name and path which are differents, I'm checking the code to see if there isn't an assumption there... [18:05:58] OHHH [18:06:02] this gets really messy [18:06:15] i don't remmeber why or how, but i remember having a similar problem [18:06:17] we had it in scap too [18:06:24] i filed a bug report and i think they fixed it there [18:06:35] i wouldn't be surprised if it doesn't work for git deploy... [18:06:50] yep : https://github.com/wikimedia/operations-puppet/blob/production/modules/deployment/files/modules/deploy.py#L381 [18:07:01] module name and path should be the same... [18:10:08] !log removing maintenance from wdqs1001 [18:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:22:01] (03PS1) 10Nuria: Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) [18:22:22] (03PS2) 10Nuria: Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) [18:23:29] (03CR) 10jenkins-bot: [V: 04-1] Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [18:24:03] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2320614 (10RobH) >>! In T131184#2299085, @EBernhardson wrote: > I'm thinking it will be simpler to give them a service cluster name, makes thi... [18:25:06] ebernhardson: So I'm not sure if its clear from the tasks, since they have a lot of moving bits, but the relforge systems are already onsite (as they were old restbase systems) and we're still pending the receiving of the upgraded disks and memory [18:25:27] i'll summarize on one shortly, just fyi =] [18:25:46] !log temporarily turning off pdns and recursor on holmium (https://phabricator.wikimedia.org/T106303) [18:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:26:33] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 06Labs, 10hardware-requests: eqiad: (2) Relevance forge servers - https://phabricator.wikimedia.org/T131184#2320621 (10EBernhardson) Service cluster documented [18:27:26] (03PS3) 10Ori.livneh: Ensure we pull latest on analytics.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/290284 (https://phabricator.wikimedia.org/T134506) (owner: 10Nuria) [18:32:45] PROBLEM - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [18:40:13] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.009 second response time [18:41:17] (03PS3) 10Andrew Bogott: Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/255047 (https://phabricator.wikimedia.org/T106303) [18:42:20] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review, 07perfnotice: Support HTTP/2 - https://phabricator.wikimedia.org/T96848#2320679 (10Krinkle) [18:45:13] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.009 second response time [18:45:14] (03PS1) 10Urbanecm: Typo in enwiki's extendedmovers settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290290 (https://phabricator.wikimedia.org/T133981) [18:46:03] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 1.541 second response time [18:46:21] andrewbogott: ^ is that related to holmium maybe? [18:46:41] unlikely but I'll check it out [18:46:48] you mean checker.tools.wmflabs.org right? [18:46:56] (03CR) 10Alex Monk: [C: 032] Typo in enwiki's extendedmovers settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290290 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [18:47:11] Can somebody deploy https://gerrit.wikimedia.org/r/290290 ? It's fix for T133981 . Thanks. [18:47:12] T133981: Creation of page mover userright - https://phabricator.wikimedia.org/T133981 [18:47:41] too slow Urbanecm :p [18:47:54] (03Merged) 10jenkins-bot: Typo in enwiki's extendedmovers settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290290 (https://phabricator.wikimedia.org/T133981) (owner: 10Urbanecm) [18:48:49] :) So thanks :) [18:49:01] !log krenair@tin Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/290290 (duration: 00m 38s) [18:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:49:31] Urbanecm, that fixed it [18:50:13] PROBLEM - check_listener_gc on thulium is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - string OK not found on https://127.0.0.1:443https://payments-listener.wikimedia.org/globalcollect - 191 bytes in 0.009 second response time [18:51:13] (03PS2) 10Dzahn: memcached: move logrotate,snapshot files into module [puppet] - 10https://gerrit.wikimedia.org/r/289981 [18:51:21] andrewbogott: yeah [18:52:04] RECOVERY - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is OK: DNS OK - 0.061 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [18:52:22] (03CR) 10Dzahn: [C: 032] "thanks elukey" [puppet] - 10https://gerrit.wikimedia.org/r/289981 (owner: 10Dzahn) [18:52:29] YuviPanda: apparently it was, although I don't know how/why [18:52:33] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.479 second response time [18:53:04] andrewbogott: I guess the puppetmaster is hitting that somehow? [18:53:12] yeah [18:55:13] RECOVERY - check_listener_gc on thulium is OK: HTTP OK: HTTP/1.1 200 OK - 263 bytes in 0.011 second response time [18:57:41] (03PS11) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [18:59:35] (03CR) 10Ladsgroup: [C: 031] ores: Scap3 deployment configurations [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [18:59:54] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Puppet has 2 failures [19:01:14] PROBLEM - puppet last run on mc2001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:01:35] PROBLEM - puppet last run on mc2005 is CRITICAL: CRITICAL: Puppet has 2 failures [19:01:43] i'm looking at that.. but i already confirmed it's noop [19:01:49] and puppet finishes fine [19:01:55] re: mc* [19:02:15] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [19:03:44] RECOVERY - puppet last run on mc2005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [19:04:45] (03PS8) 10Dzahn: ircd: make check_ircd a critical (paging) icinga check [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) [19:05:34] RECOVERY - puppet last run on mc2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:06:11] (03CR) 10Dzahn: [C: 032] "As said in meeting, this wasn't the specific issue in this case but we can enable it anyways (and of course go back)." [puppet] - 10https://gerrit.wikimedia.org/r/290078 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [19:11:43] !log manually kicking stuck global renames (T135656) [19:11:44] T135656: GlobalRename is broken, presumably due to authmanager changes - https://phabricator.wikimedia.org/T135656 [19:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:38] (03PS12) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [19:27:04] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 4.074 second response time [19:27:42] (03PS1) 10Andrew Bogott: Revert "Remove labvirt1003 from the scheduler pool" [puppet] - 10https://gerrit.wikimedia.org/r/290296 [19:27:44] (03PS1) 10Dzahn: ocg: ocg1003 back to trusty installer [puppet] - 10https://gerrit.wikimedia.org/r/290297 [19:29:17] (03PS2) 10Andrew Bogott: Revert "Remove labvirt1003 from the scheduler pool" [puppet] - 10https://gerrit.wikimedia.org/r/290296 [19:30:04] PROBLEM - Puppet catalogue fetch on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:44] PROBLEM - Start a job and verify on Precise on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:04] PROBLEM - Start a job and verify on Trusty on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:12] PROBLEM - Test LDAP for query on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:54] ^ we are looking now [19:31:55] PROBLEM - NFS read/writeable on labs instances on checker.tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:34] (03PS2) 10Dzahn: ocg: ocg1003 back to trusty installer [puppet] - 10https://gerrit.wikimedia.org/r/290297 (https://phabricator.wikimedia.org/T84723) [19:32:45] RECOVERY - Puppet catalogue fetch on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 2.820 second response time [19:32:51] !log putting wdqs1001 in maintenance to fix deployment issues [19:33:25] (03PS13) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [19:33:44] RECOVERY - Test LDAP for query on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.229 second response time [19:34:24] RECOVERY - NFS read/writeable on labs instances on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.030 second response time [19:35:32] !log killed all mysqld process on Trusty CI slaves [19:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:37:55] (03CR) 10Andrew Bogott: [C: 032] Revert "Remove labvirt1003 from the scheduler pool" [puppet] - 10https://gerrit.wikimedia.org/r/290296 (owner: 10Andrew Bogott) [19:37:56] PROBLEM - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 417 bytes in 0.001 second response time [19:39:16] PROBLEM - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 417 bytes in 0.039 second response time [19:40:30] ACKNOWLEDGEMENT - WDQS HTTP on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 417 bytes in 0.002 second response time daniel_zahn gehel !log putting wdqs1001 in maintenance to fix deployment issues [19:40:31] ACKNOWLEDGEMENT - WDQS SPARQL on wdqs1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 417 bytes in 0.039 second response time daniel_zahn gehel !log putting wdqs1001 in maintenance to fix deployment issues [19:41:24] jdlrobson: RoanKattouw: bunch of CI jobs failed due to an issue with mysql. Should be solved now and I have +2 the patches that failed [19:41:28] mutante: thanks! [19:41:31] Thanks [19:41:40] I found a browser bug in Chrome while trying to report this, video incoming :) [19:41:47] nicee [19:41:49] gehel: yw, it tells the #wikidata too [19:42:09] mutante: yep, I saw that... [19:42:46] RECOVERY - Start a job and verify on Trusty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.454 second response time [19:43:29] https://usercontent.irccloud-cdn.com/file/OMY9LTUS/chrome-textarea-scrollbars.mp4 [19:43:39] (03PS14) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [19:44:45] RECOVERY - Start a job and verify on Precise on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.333 second response time [19:52:56] PROBLEM - nova-compute process on labvirt1010 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [19:54:01] andrewbogott: ^ more fun or related? [19:54:28] no idea, I just noticed that [19:54:33] I can't imagine how it would relate [19:54:51] !log deployed latest WDQS version [19:54:57] PROBLEM - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:06] RECOVERY - nova-compute process on labvirt1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [19:55:35] !log putting wdqs1001 out of maintenance [19:55:36] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [19:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:56:24] andrewbogott: is "Recursive DNS on 208.80.154.20 is CRITICAL" you? could you silence it if so [19:56:45] yep [19:56:56] RECOVERY - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is OK: DNS OK - 0.066 seconds response time (tools-grid-master.tools.eqiad.wmflabs. 60 IN A 10.68.20.158) [19:57:36] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.079 seconds response time. www.wikipedia.org returns 208.80.154.224 [19:58:07] thanks lots happening hard to keep track atm [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160523T2000). Please do the needful. [20:05:15] PROBLEM - puppet last run on iridium is CRITICAL: CRITICAL: puppet fail [20:05:55] RECOVERY - Getent speed check on labstore1001 is OK: OK: getent group returns within a second [20:06:00] i wonder how it's possible that iridum has a puppet fail AND says puppet is disabled in motd [20:06:03] pick one [20:06:06] looks [20:07:12] (03PS3) 10Gehel: Point CirrusSearch alerting at more useful metrics [puppet] - 10https://gerrit.wikimedia.org/r/290262 (https://phabricator.wikimedia.org/T134852) (owner: 10EBernhardson) [20:07:24] RECOVERY - puppet last run on iridium is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:08:06] i did nothing except run puppet and there were no errors [20:09:18] (03CR) 10Gehel: [C: 032 V: 032] Point CirrusSearch alerting at more useful metrics [puppet] - 10https://gerrit.wikimedia.org/r/290262 (https://phabricator.wikimedia.org/T134852) (owner: 10EBernhardson) [20:10:25] PROBLEM - Host labsdb1008 is DOWN: PING CRITICAL - Packet loss = 100% [20:10:39] ACKNOWLEDGEMENT - Last backup of the maps filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-maps was exit-code cpettet known, should be fixed waiting on an overnight cycle to call it [20:10:52] looking at labsdb^^^ [20:10:56] ACKNOWLEDGEMENT - Last backup of the others filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-others was exit-code cpettet known, should be fixed waiting on an overnight cycle to call it [20:10:56] ACKNOWLEDGEMENT - Last backup of the tools filesystem on labstore1001 is CRITICAL: CRITICAL - Last run result for unit replicate-tools was exit-code cpettet known, should be fixed waiting on an overnight cycle to call it [20:15:02] (03PS15) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [20:18:08] !log starting mobileapps deploy [20:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:22:15] RECOVERY - Host labsdb1008 is UP: PING OK - Packet loss = 0%, RTA = 0.83 ms [20:22:40] (03PS1) 10Andrew Bogott: Don't override labsdnsconfig for labs. [puppet] - 10https://gerrit.wikimedia.org/r/290314 [20:23:37] 06Operations, 07HHVM, 07User-notice: Switch HAT appservers to trusty's ICU (or newer) - https://phabricator.wikimedia.org/T86096#2321028 (10hashar) Related, I have switched the CI instances to HHVM just after @Joe switched deployment-prep. [20:23:39] (03CR) 10Andrew Bogott: [C: 032] Don't override labsdnsconfig for labs. [puppet] - 10https://gerrit.wikimedia.org/r/290314 (owner: 10Andrew Bogott) [20:30:24] PROBLEM - puppet last run on palladium is CRITICAL: CRITICAL: puppet fail [20:32:25] (03PS16) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [20:32:35] RECOVERY - puppet last run on palladium is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:33:53] (03PS1) 10Andrew Bogott: Revert "Don't override labsdnsconfig for labs." [puppet] - 10https://gerrit.wikimedia.org/r/290346 [20:34:31] (03CR) 10Andrew Bogott: [C: 032 V: 032] Revert "Don't override labsdnsconfig for labs." [puppet] - 10https://gerrit.wikimedia.org/r/290346 (owner: 10Andrew Bogott) [20:36:21] !log mobileapps deployed cd76f5a [20:36:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:41:08] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 6483771 keys - replication_delay is 1 [20:42:48] (03PS1) 10Dzahn: let chromium use jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/290347 [20:43:01] (03PS1) 10Alex Monk: scap: add labtestwikitech to mediawiki-installation group [puppet] - 10https://gerrit.wikimedia.org/r/290348 [20:45:40] should labtestweb2001 have IPv6? silver has it [20:45:57] just because that seems a difference between otherwise identical servers [20:46:14] (03PS1) 10Andrew Bogott: Labs instances: Use labs-recursor0 as the primary. [puppet] - 10https://gerrit.wikimedia.org/r/290350 [20:48:42] (03PS1) 10Dzahn: labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 [20:50:16] (03PS2) 10Dzahn: labtestweb2001: add IPv6 like on silver [puppet] - 10https://gerrit.wikimedia.org/r/290351 [20:51:47] (03CR) 10Andrew Bogott: [C: 032] Labs instances: Use labs-recursor0 as the primary. [puppet] - 10https://gerrit.wikimedia.org/r/290350 (owner: 10Andrew Bogott) [20:53:06] (03CR) 10Dzahn: "it might have to be added to network.pp too:" [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [20:53:54] (03CR) 10Dzahn: "392 } elsif $::realm == 'labtest' {" [puppet] - 10https://gerrit.wikimedia.org/r/290348 (owner: 10Alex Monk) [20:59:29] PROBLEM - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [21:01:22] andrewbogott: ^ I put in downtown for 2 hours [21:01:35] chasemp: thanks [21:01:39] so many hostnames associated with that :/ [21:13:25] (03PS13) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:14:11] (03CR) 10jenkins-bot: [V: 04-1] Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [21:14:48] (03PS14) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:17:51] (03PS15) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:18:18] (03CR) 10jenkins-bot: [V: 04-1] Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [21:19:33] (03CR) 10jenkins-bot: [V: 04-1] Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [21:21:14] (03PS16) 10Andrew Bogott: Rename holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:23:41] (03CR) 10Andrew Bogott: [C: 032] Rename holmium to labservices1002. [dns] - 10https://gerrit.wikimedia.org/r/255047 (https://phabricator.wikimedia.org/T106303) (owner: 10Andrew Bogott) [21:32:01] (03PS17) 10Andrew Bogott: Rename Holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 [21:34:01] (03CR) 10Andrew Bogott: [C: 032] Rename Holmium to labservices1002 [puppet] - 10https://gerrit.wikimedia.org/r/254465 (owner: 10Andrew Bogott) [21:34:57] (03PS5) 10Dzahn: ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [21:35:21] (03CR) 10Dzahn: "removed the irc server check, kept the irc bot check" [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [21:36:48] !log reimaging holmium to labservices1002. [21:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:37:29] (03PS6) 10Dzahn: ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) [21:37:41] (03CR) 10Dzahn: [C: 032] ircecho: add icinga process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/290077 (https://phabricator.wikimedia.org/T135948) (owner: 10Dzahn) [21:41:41] (03PS17) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [21:45:29] Request from 90.180.83.194 via cp3009 cp3009, Varnish XID 71655334 [21:45:29] Error: 503, Backend fetch failed at Mon, 23 May 2016 21:45:04 GMT [21:45:47] (phabricator) [21:49:25] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Puppet has 2 failures [21:51:36] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [21:51:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:52:15] 06Operations, 10DNS, 10Traffic, 10Wikimedia-Apache-configuration, 10Wikimedia-General-or-Unknown: m.{project}.org portal/redirect consistency - https://phabricator.wikimedia.org/T78421#2321287 (10Jdlrobson) [21:52:18] 06Operations, 10MediaWiki-extensions-ZeroBanner, 06Reading-Web-Backlog, 10Traffic, and 4 others: m.wikipedia.org incorrectly redirects to en.m.wikipedia.org - https://phabricator.wikimedia.org/T69015#2321285 (10Jdlrobson) 05Open>03stalled Stalled per https://phabricator.wikimedia.org/T69015#2248241 [21:56:30] 06Operations, 10Ops-Access-Requests, 06Services: Expand sc-admins to provide sufficient coverage for sc* clusters - https://phabricator.wikimedia.org/T135548#2321301 (10RobH) I'm not 100% clear on the meeting result from today (my hangouts decided to lag and jitter). It seems there was no objection, but I'l... [21:57:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:58:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:05:57] PROBLEM - Check for gridmaster host resolution UDP on labs-ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:06:28] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL: CRITICAL - Plugin timed out while executing system call [22:09:03] PROBLEM - mysqld processes on labservices1002 is CRITICAL: Connection refused or timed out [22:09:21] ^ new host, I'm marking it for downtime now [22:09:27] (03PS18) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747) [22:09:32] what's up guys? [22:10:07] volans: it's just a newly-imaged host having growing pains, you can disregard [22:10:10] sorry for the page [22:10:44] ok, we should really find a solution for newly added services, like adding them in downtime already :) [22:10:51] yeah [22:11:04] no prob, I was still on the laptop ;) [22:16:38] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:19:07] RECOVERY - Recursive DNS on 208.80.154.20 is OK: DNS OK: 0.055 seconds response time. www.wikipedia.org returns 208.80.154.224 [22:28:14] mutante: I'm told irc.wikimedia.org was restarted again? [22:29:13] Never mind, I found the tasks. [22:30:16] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2312863 (10Mattflaschen-WMF) >>! In T135851#2312924, @jcrespo wrote: > 1) document this limitation on mediawiki.org and recommend "best practices" Filed as {T136045}. I'm n... [22:31:57] 06Operations, 10DBA, 10MediaWiki-Database: Preserve InnoDB table auto_increment on restart - https://phabricator.wikimedia.org/T135851#2321418 (10Mattflaschen-WMF) [22:36:36] (03PS1) 10Foks: Adding Support and Safety and update OIT Bug: T136046 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) [22:38:07] PROBLEM - Host 208.80.154.20 is DOWN: CRITICAL - Host Unreachable (208.80.154.20) [22:39:52] ^ labs-recursor [22:52:51] (03PS2) 10Foks: Adding Support and Safety global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) [22:53:26] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80707 MB (15% inode=99%) [22:54:16] (03CR) 10Alex Monk: [C: 04-1] "These aren't global groups, those are managed on-wiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [22:54:50] (03CR) 10Alex Monk: Adding Support and Safety global user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [22:56:37] Is anyone able to copy a file around on the fundraising cluster? [22:56:46] RECOVERY - Host 208.80.154.20 is UP: PING OK - Packet loss = 0%, RTA = 1.27 ms [22:57:08] It should be straightforward. And would save us a lot of overnight failure emails [22:59:47] RECOVERY - Disk space on elastic1016 is OK: DISK OK [23:00:04] RoanKattouw ostriches Krenair MaxSem Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160523T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:01:19] I'll be monitoring RoanKattouw's SWAT. [23:03:43] Hello. [23:04:55] (03PS3) 10Alex Monk: Adding WMF Support and Safety user groups to meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:06:00] matt_flaschen: okay, let's go [23:06:45] 06Operations, 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Update tag and racktables for holmium: rename to labservices1002. - https://phabricator.wikimedia.org/T119533#2321490 (10Andrew) a:05Andrew>03None [23:08:27] (03PS2) 10Dereckson: Enable Flow beta feature on frwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290279 (https://phabricator.wikimedia.org/T135702) (owner: 10Catrope) [23:08:36] (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290279 (https://phabricator.wikimedia.org/T135702) (owner: 10Catrope) [23:09:34] (03Merged) 10jenkins-bot: Enable Flow beta feature on frwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290279 (https://phabricator.wikimedia.org/T135702) (owner: 10Catrope) [23:09:52] (03CR) 10Legoktm: [C: 04-1] "globalgroupmembership has to be a global permission from a global group, it can't be a local wiki permission." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290366 (https://phabricator.wikimedia.org/T136046) (owner: 10Foks) [23:12:34] err, why am I on the lists again? [23:17:07] MaxSem: maybe bad copy/paste? I just copy the previous week's entries [23:24:07] PROBLEM - puppet last run on mw2163 is CRITICAL: CRITICAL: Puppet has 1 failures [23:24:33] !log dereckson@tin Synchronized php-1.28.0-wmf.2/extensions/Flow/includes/Data/Index/BoardHistoryIndex.php: More reliable post sorting (T119509, 1/2) (duration: 00m 26s) [23:24:34] T119509: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509 [23:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:26:06] !log dereckson@tin Synchronized php-1.28.0-wmf.2/extensions/Flow/maintenance/FlowRemoveOldTopics.php: More reliable post sorting (T119509, 2/2) (duration: 00m 34s) [23:26:07] T119509: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509 [23:26:09] matt_flaschen: here you are for the Flow change, please test ^ [23:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:17] matt_flaschen: all is fine? [23:32:45] Dereckson, we're testing now. [23:32:49] k [23:33:43] Dereckson, looks good. [23:33:53] Dereckson, you're going to do the frwikivoyage in a bit? [23:33:56] (03PS1) 10Kaldari: Set $wgSpamBlacklistEventLogging to true on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/290370 [23:35:36] k [23:35:40] right now: [23:36:07] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable Flow beta feature on frwikivoyage (T135702) (duration: 00m 25s) [23:36:08] T135702: Enable Flow on fr Wikivoyage as a Beta Feature - https://phabricator.wikimedia.org/T135702 [23:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:36:19] matt_flaschen: here you're for frwikivoyage ^ [23:40:14] Dereckson, thanks, it works. [23:40:43] Nice. Thanks for testing. [23:41:08] On T119509, there are several steps noted, is there something to do right now? [23:41:08] T119509: Cleanup ptwikibooks conversion - https://phabricator.wikimedia.org/T119509 [23:42:37] Dereckson: AFAIK we just needed that to fix a bug in one of those steps [23:42:48] So nothing right now [23:43:59] Okay. [23:47:06] RECOVERY - puppet last run on mw2163 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:58:40] (03PS19) 1020after4: keyholder key cleanup [puppet] - 10https://gerrit.wikimedia.org/r/289236 (https://phabricator.wikimedia.org/T132747)