[00:37:52] PROBLEM - puppet last run on ganeti2004 is CRITICAL puppet fail [00:54:42] RECOVERY - puppet last run on ganeti2004 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [02:21:57] !log l10nupdate Synchronized php-1.26wmf7/cache/l10n: (no message) (duration: 06m 35s) [02:22:15] Logged the message, Master [02:27:06] !log LocalisationUpdate completed (1.26wmf7) at 2015-06-01 02:26:03+00:00 [02:27:12] Logged the message, Master [02:43:07] !log l10nupdate Synchronized php-1.26wmf8/cache/l10n: (no message) (duration: 05m 37s) [02:43:12] Logged the message, Master [02:47:36] !log LocalisationUpdate completed (1.26wmf8) at 2015-06-01 02:46:32+00:00 [02:47:40] Logged the message, Master [03:47:53] PROBLEM - puppet last run on mw2023 is CRITICAL puppet fail [04:06:23] RECOVERY - puppet last run on mw2023 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [04:15:03] PROBLEM - puppet last run on db2069 is CRITICAL puppet fail [04:27:54] (03PS15) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [04:30:22] (03PS16) 10KartikMistry: CX: Log to logstash [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) [04:31:53] RECOVERY - puppet last run on db2069 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [04:40:53] PROBLEM - puppet last run on mw2198 is CRITICAL puppet fail [04:59:13] RECOVERY - puppet last run on mw2198 is OK Puppet is currently enabled, last run 27 seconds ago with 0 failures [05:19:21] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 1 05:18:18 UTC 2015 (duration 18m 17s) [05:19:25] Logged the message, Master [06:13:40] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint Lou Reed: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1324871 (10AndyRussG) Hi! I've checked this by comparing, from erbium: `/a/log/fundraising/logs/buffer/2015/bannerImp... [06:20:57] 6operations, 5Patch-For-Review, 7database: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1324890 (10jcrespo) a:5jcrespo>3Dzahn I have dropped the databases from the full m1 hierarchy of live data (db1001, db1016 and db2010). ``` DROP DATABASE cont... [06:26:03] PROBLEM - puppet last run on cp3045 is CRITICAL puppet fail [06:30:13] PROBLEM - puppet last run on mc2011 is CRITICAL puppet fail [06:31:12] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [06:32:42] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures [06:32:43] PROBLEM - puppet last run on cp3042 is CRITICAL Puppet has 1 failures [06:32:43] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 1 failures [06:33:04] PROBLEM - puppet last run on ms-fe2003 is CRITICAL Puppet has 1 failures [06:34:03] PROBLEM - puppet last run on mw2017 is CRITICAL Puppet has 1 failures [06:34:23] PROBLEM - puppet last run on mw2212 is CRITICAL Puppet has 1 failures [06:34:43] PROBLEM - puppet last run on tin is CRITICAL Puppet has 2 failures [06:34:53] PROBLEM - puppet last run on mw2143 is CRITICAL Puppet has 1 failures [06:35:02] PROBLEM - puppet last run on mw2113 is CRITICAL Puppet has 1 failures [06:35:03] PROBLEM - puppet last run on mw1235 is CRITICAL Puppet has 1 failures [06:35:12] PROBLEM - puppet last run on mw2097 is CRITICAL Puppet has 1 failures [06:35:12] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:35:13] PROBLEM - puppet last run on mw2022 is CRITICAL Puppet has 1 failures [06:35:13] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:37:52] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [06:39:22] ^ checked, np [06:40:12] RECOVERY - puppet last run on mw1235 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:41:20] <_joe_> jynus: It's the usual mod_passenger restart thingie [06:41:33] * _joe_ makes the logrotate dance [06:42:52] RECOVERY - puppet last run on cp3045 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:13] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw2143 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw2113 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on mw2022 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:47:02] RECOVERY - puppet last run on mc2011 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:47:54] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:47:54] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:54] RECOVERY - puppet last run on mw2212 is OK Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:48:12] RECOVERY - puppet last run on tin is OK Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:48:13] RECOVERY - puppet last run on ms-fe2003 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:42] RECOVERY - puppet last run on mw2097 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:42] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:43] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:49:13] RECOVERY - puppet last run on mw2017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [07:23:51] (03PS1) 10Jcrespo: Adding an alias (m5-slave) to db1009 [dns] - 10https://gerrit.wikimedia.org/r/214993 [07:29:39] (03PS1) 10Jcrespo: Backups of m5 300MB of data, as per request (T92693) Depends on https://gerrit.wikimedia.org/r/#/c/214993/ [puppet] - 10https://gerrit.wikimedia.org/r/214994 [07:35:39] (03CR) 10Jcrespo: "Checking with you, Sean, because you were restructuring the backups/dbstore1." [puppet] - 10https://gerrit.wikimedia.org/r/214994 (owner: 10Jcrespo) [07:59:05] !log restbase restart cassandra on restbase1001 [07:59:08] Logged the message, Master [08:00:30] !log restbase restart cassandra on restbase1002 [08:00:39] Logged the message, Master [08:05:41] !log restbase restart cassandra on restbase1003 [08:05:46] Logged the message, Master [08:07:38] !log restbase restart cassandra on restbase1004 [08:07:42] Logged the message, Master [08:09:39] !log restbase restart cassandra on restbase1005 [08:09:43] Logged the message, Master [08:12:07] !log restbase restart cassandra on restbase1006 [08:12:11] Logged the message, Master [08:18:14] !log Jenkins: upgrading git plugin from 1.5.0 to latest [08:18:18] Logged the message, Master [08:18:27] (03PS1) 10Jcrespo: Add cnwikimedia to the list of wikis on labs [puppet] - 10https://gerrit.wikimedia.org/r/214995 [08:19:22] ^diff never ceases to surprise me [08:21:09] (03CR) 10Jcrespo: "Related: https://gerrit.wikimedia.org/r/#/c/214995/1" [puppet] - 10https://gerrit.wikimedia.org/r/214718 (https://phabricator.wikimedia.org/T96638) (owner: 10Tim Landscheidt) [08:21:50] (03CR) 10jenkins-bot: [V: 04-1] Add cnwikimedia to the list of wikis on labs [puppet] - 10https://gerrit.wikimedia.org/r/214995 (owner: 10Jcrespo) [08:24:12] (03PS2) 10Jcrespo: Add cnwikimedia to the list of wikis on labs [puppet] - 10https://gerrit.wikimedia.org/r/214995 [08:25:11] (03CR) 10Mobrovac: "@Filippo, yep, tested and confirmed to work." [puppet] - 10https://gerrit.wikimedia.org/r/213530 (https://phabricator.wikimedia.org/T99564) (owner: 10Mobrovac) [08:29:20] 6operations, 6Labs, 7database: Santitize recent wikis: wikimania 2016 and cn.wikimedia.org at labs dbs - https://phabricator.wikimedia.org/T100441#1325049 (10jcrespo) Thank you, @scfc, I will wait for @core to see how to proceed (either me or him does it). He may want to chime in due to related bug T96638. H... [08:30:50] hi all, is there anyone here who could help by shutting down a GWToolset job? [08:38:47] (03CR) 10Mobrovac: CX: Log to logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/213840 (https://phabricator.wikimedia.org/T89265) (owner: 10KartikMistry) [08:51:08] chandres, where is that running? [08:51:59] hi jynus , not sure I can answer, I’m not a really a tech guy :-) but the loh is this one https://commons.wikimedia.org/w/index.php?title=Special:Log&type=gwtoolset [08:52:05] loh=log [08:52:36] I think I saw that last week, it was saturating uplads to commons [08:52:43] since thursday our server is overload and cannot recover while GWToolset is still doing requests [08:52:58] possible, we have done a mistake by laucnhing two process at the same time :-( [08:53:12] :-), no prob [08:54:01] is it possible to stop this GWToolset task? [08:54:50] akosiaris: https://gerrit.wikimedia.org/r/#/c/213840/16/modules/cxserver/templates/config.erb - is that ok to comment out stdout there? [08:55:43] which url did you used to launch it, chandres? [08:56:36] you mean the url of the files? the xml? sorry for my stupid questions :-( [08:57:18] I am just not familiar with the extension, did you just uploaded the xml to commons? [08:57:42] yes, it is uploaded during the process , here for me https://commons.wikimedia.org/wiki/GWToolset:Metadata_Mappings/Neuch%C3%A2tel_Herbarium/Penard.json [08:58:08] here is the page where I do the mapping between the XML files and the Commons template https://commons.wikimedia.org/wiki/Special:GWToolset [08:58:45] but I don’t know where the xml is stored, if it is on commons [08:58:53] <_joe_> jynus: I suspect GWToolset uses the jobqueue, if so the only way to stop it is some mwscript magic [08:59:26] hum, doesn’t look it is an easy solution? :-) [08:59:49] <_joe_> chandres: nope I don't think so [09:00:13] <_joe_> chandres: but I know exactly zero about GWToolset apart from "sometimes screws up commons" :) [09:00:24] :-) oups [09:00:51] ok, I’ will update the people on the GWToolset mailing to see if someone has already done that, but for the moment the only answer was « see with wikimedia-operations » :-) [09:01:11] <_joe_> chandres: yes I'm searching the docs now [09:03:44] <_joe_> chandres: do you have the URL of the XML file we should remove/stop processing of? [09:04:02] or at least a name [09:04:41] actually, the xml is on my computer and I don’t find it on commons, it’s named Penard.xml [09:06:05] <_joe_> chandres: ok don't worry, it has been uploaded to swift, I'll try to look into this [09:06:37] thanks a lot _joe_ and jynus ! [09:08:16] <_joe_> chandres: oh no promises of solving something :) I'm trying to figure out things as I go [09:08:54] it’s already better than just waiting that GWToolset goes along the 8000 entries of the xml :-( [09:09:10] <_joe_> chandres: I see most are failing [09:10:17] yes, since thursday, it really looks like that at a moment our server was overloaded, and since is not able to « recover » [09:10:35] https://phabricator.wikimedia.org/T87040 [09:10:38] more than 3300 have been done without problem, but now…. [09:10:42] I might try the cmmandlisted there: [09:10:56] redis-cli -a $REDISPWD -h rdb1003 < /home/tgr/redis-gwtoolset-clear.txt if people don't mind [09:11:12] _joe_: [09:11:26] please! [09:11:28] <_joe_> apergos: that would mean removing the whole gwtoolset history? [09:11:34] yes I believe so [09:11:43] <_joe_> apergos: lemme take a look at that txt file :) [09:12:06] can we see how many jobs are currently running, aside from this one? [09:12:22] <_joe_> jynus: yep [09:12:52] i'm about to deploy graphoid service - has been misbehaving over the weekend. Should not affect any of the MW, only sca100x [09:14:05] it has been running in the betacluster for the weekend, seems ok [09:14:06] <_joe_> yurik: go on, np [09:14:28] <_joe_> yurik: of course if it pages ops, you'll notice :P [09:14:39] hehe :) [09:14:55] * yurik notes to self not to enable ops paging in his services [09:15:45] :-D [09:17:52] @jynus I haven’t seen another job for GWToolset than mine in the last days, and the last 2000 entries in GWToolset are all mine [09:22:19] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1325081 (10GWicke) I know this is bikesheddy, but you asked for it ;) Nomenclature: - **dcs**: datacenters; ex: `eqiad` - **services**: functional ser... [09:30:12] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 605 [09:30:58] this has been passed off to me, I'm going to have to dig around a while [09:31:26] <_joe_> jynus: assume that's you? ^^ [09:31:50] yep, looking at db1008 [09:32:41] !log deployed latest graphoid service to sca100x [09:32:45] Logged the message, Master [09:32:48] nope, not me-fundraising [09:35:12] RECOVERY - check_mysql on db1008 is OK: Uptime: 3962975 Threads: 1 Questions: 13694015 Slow queries: 26305 Opens: 63111 Flush tables: 2 Open tables: 64 Queries per second avg: 3.455 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:51:56] chandres, what day did you start those gwtoolset jobs? [09:52:12] tuesday [09:52:28] last tuesday [09:55:10] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1325122 (10Joe) @GWicke I think you have a point in considering a (host, port) tuple as a possible future expansion of the model, but I like to have a g... [10:01:11] 6operations, 10RESTBase-Cassandra: configure less aggressive cassandra log rotation - https://phabricator.wikimedia.org/T100970#1325145 (10Eevans) 3NEW [10:18:47] chandres, sorry to take so long but I've found three metadata xml files of yours and they are all pretty short [10:19:04] certainly not 8k entries [10:21:16] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1325217 (10GWicke) > Apart from that, right now pybal does not support defining an host:port combination for the backends separately for each instance,... [10:21:21] chandres_ did you do this as you, i.e. User:Chandres ? [10:21:33] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1325218 (10GWicke) > Apart from that, right now pybal does not support defining an host:port combination for the backends separately for each instance,... [10:24:51] chandres: what user did you do this as? [10:24:56] User:Chandres ? [10:27:46] I cannot remember if it was user:chandres or user:Neuchâtel Herbarium :-( [10:27:56] alternatively, chandres, is the first entry in your xml http://lan.wikimedia.ch/penard/Collection_Penard_MHNG_Specimen_05bis-3-2.tif ?? [10:28:32] so it should be this account I used [10:28:38] do you want the original xml? [10:28:50] no just give me the first item, pastebin it or something [10:28:57] I can check to see if the xml we found is the right one [10:29:02] maybe! [10:29:25] dn't need all 8500 entries :-) [10:29:53] :-) [10:29:56] [10:29:57] [10:29:59] [10:30:00] Penard [10:30:02] Rösel [10:30:03] http://lan.wikimedia.ch/penard/Collection_Penard_MHNG_Specimen_05bis-3-2.tif [10:30:05] Amoeba proteus [10:30:06] Natural History Museum of Geneva [10:30:08] YES [10:30:08] Amoeba from the Collection Penard MHNG [10:30:09] Thierry Arnet [10:30:10] {{Pénard}} [10:30:11] Collection_Penard_MHNG_Specimen_05bis-3-2_Amoeba proteus [10:30:12] [10:30:13] found it, going to delete it right now [10:31:10] http://pastebin.com/cKsZYV2F [10:31:29] gone. [10:31:44] now there's the matter of the job queue entries. ugh [10:33:37] the errors are going down on the logs [10:34:43] I have to travel to Bern, will be back in 1 hour, thanks a lot !!! [10:34:50] veeery slowly [10:36:32] btw, Thank you to everyone working on this. I'm sorry that gwtoolset has no cancel job button or (probably) no docs to speak of for what to do when things go wrong [10:40:12] bawolff, consider this a bug report :-) [10:40:33] jynus: lol, I will [10:41:07] * bawolff shifts blame - not my tool, I just somewhat inherited dealing with bug reports on it ;) [10:41:47] but yes, we desperately need some sane way of dealing with situations like this [10:48:00] I suffered from confirmation bias, no change in error rate (it could be the queue) [10:54:35] jynus: Well for starters I filed https://phabricator.wikimedia.org/T100972 [10:58:20] ok, that should have cleaned it up [10:59:28] there should be a script that, given the xml file name and the user account, can 1) find it, 2) find the redis job queue entrie(s), 3) delete the swift copy of the xml file, 4) delete all related baggage from redis [11:01:24] maybe copying and pasting exactly that on T100972, plus a link to such a tool on the extension description page [11:03:28] copy and pasted [11:27:52] <_joe_> bawolff: regarding the docs, I guess we can bake up something. [11:45:36] back, I just read the log of the room and will try to do my best to help fill the bug report [11:46:34] chandres: Well I already filed the report at https://phabricator.wikimedia.org/T100972, I don't think there's anything more you need to do (But if you see any missing details, feel free to add them) [11:49:25] thanks bawolff I will do like this [12:01:21] (03PS1) 10GWicke: Add basic alerts on RESTBase error rates and storage latencies [puppet] - 10https://gerrit.wikimedia.org/r/215004 (https://phabricator.wikimedia.org/T78514) [12:02:34] 6operations, 7HTTPS, 5Patch-For-Review: replace git's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100827#1325380 (10Chmarkine) git.wikimedia.org is behind misc-web. Is this cert still needed? [12:03:05] (03CR) 10GWicke: "How would we target those alerts to the services alert group?" [puppet] - 10https://gerrit.wikimedia.org/r/215004 (https://phabricator.wikimedia.org/T78514) (owner: 10GWicke) [12:12:56] (03PS1) 10Andrew Bogott: Add service IP for labs-recursor0.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/215006 [12:22:33] !log added firmware-nonfree 0.44~wmf1 for jessie-wikimedia on carbon [12:22:37] Logged the message, Master [12:23:14] 6operations: Backport & test firmware-linux 0.44 - https://phabricator.wikimedia.org/T100771#1325409 (10MoritzMuehlenhoff) I built and imported 0.44~wmf1 to apt.wikimedia.org. I also compared the hash values of the fw files to Brandon's previous bnx2x package and they're identical. The tg3 controller on baham d... [12:25:44] PROBLEM - puppet last run on eventlog1001 is CRITICAL Puppet has 6 failures [12:28:06] (03PS8) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [12:29:59] (03PS9) 10Andrew Bogott: Added a simple IP-aliasing script for the pdns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [12:31:22] 6operations, 10Analytics-Cluster: Build Kafka 0.8.1.1 package for Jessie and upgrade Brokers to Jessie. - https://phabricator.wikimedia.org/T98161#1325431 (10Ottomata) I think we should either backport or stick them manually in lib/. I don't think we should use Archiva for building this if we don't have to. [12:31:23] (03PS10) 10Andrew Bogott: Added a labs-specific dns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 [12:35:58] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1325438 (10GWicke) (removed double-post from shaky cell phone connection on train) [12:42:32] RECOVERY - puppet last run on eventlog1001 is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [12:47:35] aude, hey [12:47:48] fyi I am deploying https://gerrit.wikimedia.org/r/#/c/215009/ now, hopefully before your window [12:56:17] (03CR) 10Andrew Bogott: [C: 032] Add service IP for labs-recursor0.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/215006 (owner: 10Andrew Bogott) [12:59:01] 6operations, 10Analytics, 6Security, 10Traffic: Purge > 90 days stat1002:/a/squid/archive/api - https://phabricator.wikimedia.org/T92338#1325520 (10Ottomata) AFAIK, not intentionally, but who knows what kind of stuff users send in POST data. [12:59:46] 6operations, 7Tracking: Make ircecho much better (Tracking) - https://phabricator.wikimedia.org/T95052#1325526 (10MoritzMuehlenhoff) [12:59:48] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1325525 (10MoritzMuehlenhoff) [12:59:50] 6operations, 5Patch-For-Review: Convert ircecho init script to a systemd unit - https://phabricator.wikimedia.org/T95055#1325524 (10MoritzMuehlenhoff) 5Open>3Resolved [13:00:05] aude: Dear anthropoid, the time has come. Please deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150601T1300). [13:00:34] !log krenair Synchronized php-1.26wmf8/extensions/WikimediaMessages/WikimediaMessages.hooks.php: https://gerrit.wikimedia.org/r/#/c/215011/ - fix EditPageCopyrightWarning (duration: 00m 16s) [13:00:38] Logged the message, Master [13:01:17] (done) [13:01:55] Krenair: ok :) [13:05:09] (03PS1) 10Aude: Enable Wikibase arbitrary access on wikisource and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215012 (https://phabricator.wikimedia.org/T98756) [13:14:36] gilles, James_F|Away: Planning to merge all those extension changes and prepare an extension-bump for core? You could even merge the extension-bumps into 214741, I'd think. [13:15:09] anomie: I shall do that right now [13:20:51] (03CR) 10Andrew Bogott: [C: 032] Added a labs-specific dns recursor. [puppet] - 10https://gerrit.wikimedia.org/r/211059 (owner: 10Andrew Bogott) [13:27:38] * aude waits patiently for jenkins [13:31:26] !log aude Synchronized php-1.26wmf8/extensions/Wikidata: css compatibility fixes for wmf8 (duration: 00m 24s) [13:31:34] Logged the message, Master [13:33:35] anomie: done [13:34:30] gilles: Thanks! [13:35:00] (03PS1) 10Andrew Bogott: Default ip_alises to undef rather than {} [puppet] - 10https://gerrit.wikimedia.org/r/215017 [13:36:25] (03PS2) 10Andrew Bogott: Default ip_aliases to undef rather than {} [puppet] - 10https://gerrit.wikimedia.org/r/215017 [13:37:26] (03CR) 10Andrew Bogott: [C: 032] Default ip_aliases to undef rather than {} [puppet] - 10https://gerrit.wikimedia.org/r/215017 (owner: 10Andrew Bogott) [13:37:57] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint Lou Reed: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1325639 (10Ottomata) > My only concern so far is a difference in the number of entries for that 15-minute period: the... [13:40:36] (03CR) 10Aude: [C: 032] Enable Wikibase arbitrary access on wikisource and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215012 (https://phabricator.wikimedia.org/T98756) (owner: 10Aude) [13:42:49] (03Merged) 10jenkins-bot: Enable "Other Projects Links" by default on ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214894 (https://phabricator.wikimedia.org/T99901) (owner: 10Glaisher) [13:42:51] (03Merged) 10jenkins-bot: Enable Wikibase arbitrary access on wikisource and itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215012 (https://phabricator.wikimedia.org/T98756) (owner: 10Aude) [13:45:26] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable arbitrary access on wikisource and itwiki, and make other projects sidebar feature default for ptwiki (duration: 00m 15s) [13:45:32] Logged the message, Master [13:48:23] !log aude Synchronized wmf-config/InitialiseSettings.php: Enable arbitrary access on wikisource and itwiki, and make other projects sidebar feature default for ptwiki (for real) (duration: 00m 12s) [14:07:28] (03PS1) 10Andrew Bogott: Add a pdns recursor to holmium [puppet] - 10https://gerrit.wikimedia.org/r/215022 [14:10:15] (03PS2) 10Ottomata: Add icinga check for Hadoop YARN NodeManager Node-State [puppet] - 10https://gerrit.wikimedia.org/r/213874 [14:11:30] (03CR) 10Ottomata: [C: 032] "Discussed with Yuvi, I think it will work!" [puppet] - 10https://gerrit.wikimedia.org/r/213874 (owner: 10Ottomata) [14:13:00] (03CR) 10Andrew Bogott: [C: 032] Add a pdns recursor to holmium [puppet] - 10https://gerrit.wikimedia.org/r/215022 (owner: 10Andrew Bogott) [14:15:03] (03PS1) 10Andrew Bogott: Rename ferm rules in the recursor [puppet] - 10https://gerrit.wikimedia.org/r/215024 [14:16:02] (03CR) 10Andrew Bogott: [C: 032] Rename ferm rules in the recursor [puppet] - 10https://gerrit.wikimedia.org/r/215024 (owner: 10Andrew Bogott) [14:17:02] PROBLEM - puppet last run on analytics1028 is CRITICAL Puppet has 1 failures [14:17:12] PROBLEM - puppet last run on analytics1013 is CRITICAL Puppet has 1 failures [14:17:33] PROBLEM - puppet last run on holmium is CRITICAL puppet fail [14:19:13] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [14:19:53] PROBLEM - puppet last run on analytics1031 is CRITICAL Puppet has 1 failures [14:23:12] PROBLEM - puppet last run on analytics1029 is CRITICAL Puppet has 1 failures [14:23:32] PROBLEM - puppet last run on analytics1034 is CRITICAL Puppet has 1 failures [14:23:34] PROBLEM - puppet last run on analytics1036 is CRITICAL Puppet has 1 failures [14:24:18] ottomata: ^ is that from the check command merge? [14:24:22] PROBLEM - puppet last run on analytics1015 is CRITICAL Puppet has 1 failures [14:24:22] PROBLEM - puppet last run on analytics1039 is CRITICAL Puppet has 1 failures [14:24:22] PROBLEM - puppet last run on analytics1019 is CRITICAL Puppet has 1 failures [14:24:23] PROBLEM - Recursive DNS on 208.80.154.20 is CRITICAL - Plugin timed out while executing system call [14:25:30] yeah [14:25:32] on it... [14:25:38] its a weird one [14:25:40] /etc/icinga/commands/check_hadoop_yarn_node_state.cfg20150601-15783-muvpes.lock at 117:/etc/puppet/modules/nagios_common/manifests/check_command.pp [14:25:45] wat [14:25:52] no /etc/icinga??? [14:26:00] yeah, that might be a problem [14:26:32] PROBLEM - puppet last run on analytics1033 is CRITICAL Puppet has 1 failures [14:27:13] PROBLEM - puppet last run on analytics1020 is CRITICAL Puppet has 1 failures [14:27:32] i guessi should include icinga? [14:27:42] YuviPanda: shoudl I do that in nagios_common module? [14:28:22] PROBLEM - puppet last run on analytics1017 is CRITICAL Puppet has 1 failures [14:28:24] gm no [14:28:36] that does a lot of stuff [14:28:53] PROBLEM - puppet last run on analytics1041 is CRITICAL Puppet has 1 failures [14:29:31] 6operations, 10Traffic, 7discovery-system, 5services-tooling: integrate (pybal|varnish)->varnish backend config/state with etcd or similar - https://phabricator.wikimedia.org/T97029#1231228 (10Joe) [14:29:34] 7Puppet, 6operations, 10Traffic, 5Patch-For-Review, and 2 others: Create a confd puppet module - https://phabricator.wikimedia.org/T97974#1325831 (10Joe) 5Open>3Resolved [14:29:38] Hm, YuviPanda, i think I don't need the check_command config on those hosts [14:29:42] PROBLEM - puppet last run on analytics1030 is CRITICAL Puppet has 1 failures [14:29:47] and nagios_common::check_command always sets that up [14:30:00] maybe I should jsut install the plugin file manually? [14:30:04] and not use nagios_common? [14:30:22] PROBLEM - puppet last run on analytics1040 is CRITICAL Puppet has 1 failures [14:30:22] PROBLEM - puppet last run on analytics1035 is CRITICAL Puppet has 1 failures [14:31:24] Did anything happen to the cluster or external storage on the 30th that might cause a whole bunch of failures in saving the image description page when uploading a file [14:31:41] As there's a big uptick in files showing up in commons not having an image description page [14:32:12] PROBLEM - puppet last run on analytics1016 is CRITICAL Puppet has 1 failures [14:32:13] PROBLEM - puppet last run on analytics1038 is CRITICAL Puppet has 1 failures [14:34:31] (03PS1) 10Jcrespo: Depool pc1003 for issues with performance_schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215025 [14:34:42] PROBLEM - puppet last run on analytics1032 is CRITICAL Puppet has 1 failures [14:35:02] PROBLEM - puppet last run on analytics1037 is CRITICAL Puppet has 1 failures [14:35:13] PROBLEM - puppet last run on analytics1014 is CRITICAL Puppet has 1 failures [14:35:34] (03CR) 10Jcrespo: [C: 032] Depool pc1003 for issues with performance_schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215025 (owner: 10Jcrespo) [14:35:59] (03PS1) 10Andrew Bogott: Make forward_zones configurable in our recursor class. [puppet] - 10https://gerrit.wikimedia.org/r/215026 [14:37:13] PROBLEM - puppet last run on analytics1011 is CRITICAL Puppet has 1 failures [14:37:38] ah yeha, YuviPanda, this is all wrong :/ fixing. [14:38:41] !log jynus Synchronized wmf-config/db-eqiad.php: depool pc1003 (duration: 00m 12s) [14:38:45] Logged the message, Master [14:38:52] (03CR) 10Yuvipanda: [C: 04-1] Make forward_zones configurable in our recursor class. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215026 (owner: 10Andrew Bogott) [14:39:21] (03CR) 10Andrew Bogott: Make forward_zones configurable in our recursor class. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215026 (owner: 10Andrew Bogott) [14:39:54] (03PS2) 10Andrew Bogott: Make forward_zones configurable in our recursor class. [puppet] - 10https://gerrit.wikimedia.org/r/215026 [14:41:23] yurik: did you see https://phabricator.wikimedia.org/T100699 ? [14:41:37] * yurik looking [14:42:18] bleh, not good [14:42:28] but the title is great ! :) [14:42:28] i have seen it briefly [14:42:41] !log powering down analytics1028 to swap the bad DIMM [14:42:41] lol, i just realized what it is :D [14:42:45] Logged the message, Master [14:43:21] anyhow yurik, in your spare time, please give it some attention please [14:43:34] matanya, not sure what that is [14:43:40] spare time [14:43:50] i guessed :) [14:44:08] i think i will push for SVG generation ... once csteipp_afk agrees its safe :) [14:44:28] and possibly do a server-side svg->png [14:44:54] * yurik is looking for graph volonteers [14:45:53] (03PS1) 10Ottomata: Fix for nrpe with nagios_common. This doesn't work! [puppet] - 10https://gerrit.wikimedia.org/r/215027 [14:46:29] (03CR) 10jenkins-bot: [V: 04-1] Fix for nrpe with nagios_common. This doesn't work! [puppet] - 10https://gerrit.wikimedia.org/r/215027 (owner: 10Ottomata) [14:46:43] PROBLEM - Host analytics1028 is DOWN: PING CRITICAL - Packet loss = 100% [14:47:49] (03PS3) 10Andrew Bogott: Make forward_zones configurable in our recursor class. [puppet] - 10https://gerrit.wikimedia.org/r/215026 [14:47:51] (03PS2) 10Ottomata: Fix for nrpe with nagios_common. This doesn't work! [puppet] - 10https://gerrit.wikimedia.org/r/215027 [14:49:40] (03CR) 10Ottomata: [C: 032] Fix for nrpe with nagios_common. This doesn't work! [puppet] - 10https://gerrit.wikimedia.org/r/215027 (owner: 10Ottomata) [14:49:40] 6operations, 7database: investigate performance_schema for wmf prod - https://phabricator.wikimedia.org/T99485#1325917 (10jcrespo) This is a bug I hit on pc1003, but not on the other hosts: https://bugs.launchpad.net/percona-server/+bug/1329772 The procedure means that, just in case, when upgrading from 14->1... [14:51:44] (03PS1) 10Ottomata: Fix for check_hadoop_yarn_node_state source path [puppet] - 10https://gerrit.wikimedia.org/r/215028 [14:52:05] (03CR) 10Ottomata: [C: 032 V: 032] Fix for check_hadoop_yarn_node_state source path [puppet] - 10https://gerrit.wikimedia.org/r/215028 (owner: 10Ottomata) [14:53:30] (03PS1) 10Ricordisamoa: Add task ids for d4d5b243640a5e99fc4f243a504de8b014c5c814 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215030 [14:54:12] (03PS4) 10Andrew Bogott: Make forward_zones configurable in our recursor class. [puppet] - 10https://gerrit.wikimedia.org/r/215026 [14:54:53] (03PS5) 10Andrew Bogott: Add additional_forward_zones arg to dnsrecursor class. [puppet] - 10https://gerrit.wikimedia.org/r/215026 [14:57:13] PROBLEM - puppet last run on mw2019 is CRITICAL puppet fail [14:59:11] (03CR) 10Andrew Bogott: [C: 032] Add additional_forward_zones arg to dnsrecursor class. [puppet] - 10https://gerrit.wikimedia.org/r/215026 (owner: 10Andrew Bogott) [14:59:30] um… ottomata, unmerged patch? [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, gilles: Dear anthropoid, the time has come. Please deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150601T1500). [15:00:17] * gilles is ready [15:00:57] andrewbogott: ah [15:01:05] i must have not types 'yes' correctly [15:01:06] ok to merge. [15:01:09] or, i merge yours? [15:01:14] I’ll get ‘em [15:02:12] gilles: ok, I'll SWAT this morning. Bunch of submodule bumps on wmf8. Anything that requires a full scap (l10nupdates)? [15:02:53] RECOVERY - puppet last run on analytics1017 is OK Puppet is currently enabled, last run 12 seconds ago with 0 failures [15:02:53] thcipriani: no i18n in there [15:03:01] kk [15:03:37] ottomata: fun fact - I got tired of typing 'yes' and now it also accepts 'y' [15:03:50] I will repool a server in about an hour, please tell me when/if I can do that later [15:04:02] thcipriani, ^ [15:04:07] (03PS1) 10Andrew Bogott: s/heira/hiera/ [puppet] - 10https://gerrit.wikimedia.org/r/215035 [15:04:12] RECOVERY - puppet last run on analytics1040 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:04:13] RECOVERY - puppet last run on analytics1033 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:17] andrewbogott: haha, I missed that >_> [15:04:23] RECOVERY - puppet last run on analytics1041 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [15:04:29] hashar_: there’s a CI test for ‘common typos’ in puppet, right? Where is that test code? I want to add one. [15:04:31] e and i be my enemies 4eiva [15:04:33] RECOVERY - puppet last run on analytics1011 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [15:04:43] RECOVERY - puppet last run on analytics1020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:04:53] RECOVERY - puppet last run on analytics1013 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [15:05:23] RECOVERY - puppet last run on analytics1035 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:05:25] (03CR) 10Andrew Bogott: [C: 032] s/heira/hiera/ [puppet] - 10https://gerrit.wikimedia.org/r/215035 (owner: 10Andrew Bogott) [15:05:42] <_joe_> YuviPanda: what accepts 'y'? [15:05:45] <_joe_> puppet-merge? [15:05:51] yes [15:06:01] andrewbogott: let me check :-} [15:06:04] <_joe_> why did you change that? a good occasion for bikeshedding! [15:06:23] BAH [15:06:26] the job is https://integration.wikimedia.org/ci/job/operations-puppet-typos/33629/console [15:06:32] but there is no more file named 'typos' [15:06:35] hashar_: I think we used to reject any patch that said ‘pmpta’ or something like that [15:06:42] PROBLEM - puppet last run on holmium is CRITICAL puppet fail [15:06:44] * 9ac8d27 - Move misc. utilities to utils/; remove typos (2 months ago) [15:06:48] huh [15:07:03] hashar_: have time to revive that now, or shall I open a ticket? [15:07:08] and the Jenkins job does not assert the file actually exist hehe [15:07:13] equiad is something I will write sooner or later [15:07:18] yeah task please. going to fix it right now [15:07:32] _joe_: I changed it like a few months ago, because sometimes the 'yes' will run through to bash and it'll just start printing 'y' on the commandline continuously [15:08:11] * YuviPanda used to typo it as equiad all the time. Sounds quite horsey [15:08:32] RECOVERY - puppet last run on holmium is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [15:09:12] RECOVERY - puppet last run on analytics1030 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:09:26] also I do not know how to write "myself" (had to retype that), I always write "mysql" [15:10:06] hashar_: https://phabricator.wikimedia.org/T100989?workflow=create [15:10:13] RECOVERY - puppet last run on analytics1038 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:10:14] RECOVERY - puppet last run on analytics1016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:10:52] RECOVERY - puppet last run on analytics1032 is OK Puppet is currently enabled, last run 9 seconds ago with 0 failures [15:10:55] YuviPanda: ok, now I think the recursor works for private and public both. check my work, again? [15:11:07] andrewbogott: which one? [15:11:17] labs-recursor.wikimedia.org [15:11:33] RECOVERY - puppet last run on analytics1037 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:11:43] (03PS1) 10Hashar: Restore /typos file. Used by Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/215037 (https://phabricator.wikimedia.org/T100989) [15:11:47] andrewbogott: ^^^ [15:11:59] will find a fix to have the jenkins job fail whenever the file is missing [15:12:10] gilles: once merge happens, I'll pull down to tin, then sync one extension at a time via sync-dir then sync-file ResourceLoaderWikiModule.php, sound good? Is there any need to order anything differently? [15:12:28] andrewbogott: checking :) [15:12:55] aarggg, stupid internet. The UK is supposed to be 'first world'!! [15:13:23] When it comes to the internet, only SE Asia is the first world. [15:13:25] thcipriani: order shouldn't matter [15:13:33] <_joe_> andrewbogott: and holland. [15:13:34] RECOVERY - puppet last run on analytics1014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:34] kk, cool. [15:13:53] _joe_: true. And possible random points in the midwest US blessed by google fiber. [15:13:56] andrewbogott: beaches of south goa were also pretty good. [15:14:12] <_joe_> my current connection is not that bad [15:14:27] dig @labs-recursor.wikimedia.org quarry.wmflabs.org [15:14:31] dig: couldn't get address for 'labs-recursor.wikimedia.org': not found [15:14:34] andrewbogott: ^ [15:14:51] (03PS2) 10Andrew Bogott: Restore /typos file. Used by Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/215037 (https://phabricator.wikimedia.org/T100989) (owner: 10Hashar) [15:15:02] hashar_: countered [15:15:32] YuviPanda: oh, sorry, it’s labs-recursor0 [15:15:54] RECOVERY - puppet last run on mw2019 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [15:16:36] andrewbogott: seems to work. [15:17:09] YuviPanda: I’ll change bastion-restricted-pdns to use it. Are you using that as your labs proxy these days? [15:17:45] no, I'm just still on bastion-restrcited I think [15:17:53] RECOVERY - puppet last run on analytics1031 is OK Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:17:54] we should move all the bastions to debian, I think [15:18:01] away from precise, at least [15:18:01] (03CR) 10Hashar: [C: 031] "+1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/215037 (https://phabricator.wikimedia.org/T100989) (owner: 10Hashar) [15:18:49] !log thcipriani Synchronized php-1.26wmf8/extensions/Gather: SWAT: Make ResourceLoaderWikiModule support custom position [[gerrit:214741]] (duration: 00m 13s) [15:18:52] Logged the message, Master [15:19:08] (03CR) 10Andrew Bogott: [C: 032] Restore /typos file. Used by Jenkins [puppet] - 10https://gerrit.wikimedia.org/r/215037 (https://phabricator.wikimedia.org/T100989) (owner: 10Hashar) [15:19:32] RECOVERY - puppet last run on analytics1029 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:19:43] RECOVERY - puppet last run on analytics1019 is OK Puppet is currently enabled, last run 29 seconds ago with 0 failures [15:20:22] !log thcipriani Synchronized php-1.26wmf8/extensions/MobileFrontend: SWAT: Make ResourceLoaderWikiModule support custom position [[gerrit:214741]] (duration: 00m 13s) [15:20:23] RECOVERY - puppet last run on analytics1034 is OK Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:20:26] Logged the message, Master [15:20:29] hashar_: want to query labs_recursor0.wikimedia.org a bit and make sure that it’s returning what you want when you do queries within beta and CI? [15:21:03] aren't underscore frowned upon in DNS ? [15:21:12] !log thcipriani Synchronized php-1.26wmf8/extensions/SyntaxHighlight_GeSHi: SWAT: Make ResourceLoaderWikiModule support custom position [[gerrit:214741]] (duration: 00m 14s) [15:21:17] Logged the message, Master [15:21:23] RECOVERY - puppet last run on analytics1039 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:21:32] RECOVERY - puppet last run on analytics1036 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [15:21:36] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1326023 (10JohnLewis) @Nemo_bis could you please stop adding the blocker if you are not going to provide a rationale on how that is a technical block for the proj... [15:21:41] (03PS1) 10Andrew Bogott: Alphabetize typos, add equiad [puppet] - 10https://gerrit.wikimedia.org/r/215038 [15:21:42] RECOVERY - puppet last run on analytics1015 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [15:21:51] jynus: anything else you’d like me to add to ^^ while I’m at it? [15:22:01] "Underscores allowed, except in host names" [15:22:23] !log thcipriani Synchronized php-1.26wmf8/extensions/VectorBeta: SWAT: Make ResourceLoaderWikiModule support custom position [[gerrit:214741]] (duration: 00m 15s) [15:22:27] Logged the message, Master [15:22:42] hashar_: that’s good because the name is actually labs-recursor0.wikimedia.org [15:22:45] ^it says it is a myth [15:23:17] !log thcipriani Synchronized php-1.26wmf8/extensions/WikiEditor: SWAT: Make ResourceLoaderWikiModule support custom position [[gerrit:214741]] (duration: 00m 13s) [15:23:21] Logged the message, Master [15:23:26] so do not listen to the quote [15:24:18] !log thcipriani Synchronized php-1.26wmf8/includes/resourceloader/ResourceLoaderWikiModule.php: SWAT: Make ResourceLoaderWikiModule support custom position [[gerrit:214741]] (duration: 00m 15s) [15:24:22] Logged the message, Master [15:24:26] ^ gilles that ought to do it [15:24:32] andrewbogott, one question, what if there was some legit use, can it be overridden on commit message? [15:24:46] sorry I haven't seen the script [15:24:48] (03PS3) 10Cenarium: Remove 'autoreview' usergroup from enwiki/testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203370 (https://phabricator.wikimedia.org/T91934) [15:24:58] (03CR) 10Cenarium: "rebased" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203370 (https://phabricator.wikimedia.org/T91934) (owner: 10Cenarium) [15:25:00] jynus: nope! Hopefully the test excludes itself, otherwise we will never be able to change it :) [15:25:18] :) [15:25:38] thcipriani: are you sure that the change to ResourceLoaderWikiModule.php made it through? [15:25:58] gilles: double checking [15:27:28] 6operations, 6Labs: New server for labs dns recursor - https://phabricator.wikimedia.org/T99133#1326044 (10Andrew) 5Open>3Resolved Done in place, no problems. [15:28:23] thcipriani: nevermind, there's an extension I forgot to backport. is there still time to add that to the SWAT pile? [15:28:24] gilles: yes, that file has been updated and synced [15:28:41] gilles: sure, still time in the window [15:28:47] cool [15:33:49] andrewbogott: bah I refreshed the job [15:33:53] and now it is failing :-((( [15:34:25] hashar_: dang, the test rotted while the file wasn’t there? Or is typos scanning itself? [15:34:49] obviously not [15:34:53] but we list "ncsa" [15:34:55] and we have varnishncsa [15:34:59] hello [15:35:04] thcipriani: https://gerrit.wikimedia.org/r/215043 plz [15:35:14] need a smarter grep :-} [15:35:17] * thcipriani looks [15:36:58] (03PS1) 10Andrew Bogott: Use the labs-recursor0 for labs dns [puppet] - 10https://gerrit.wikimedia.org/r/215044 [15:37:39] (03CR) 10jenkins-bot: [V: 04-1] Use the labs-recursor0 for labs dns [puppet] - 10https://gerrit.wikimedia.org/r/215044 (owner: 10Andrew Bogott) [15:38:12] (03CR) 10Yuvipanda: [C: 04-1] Use the labs-recursor0 for labs dns (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/215044 (owner: 10Andrew Bogott) [15:39:20] (03PS1) 10Hashar: typos: remove ncsa (matches varnishncsa) [puppet] - 10https://gerrit.wikimedia.org/r/215046 (https://phabricator.wikimedia.org/T100989) [15:39:32] andrewbogott: https://gerrit.wikimedia.org/r/215046 removes the ncsa typo [15:39:36] will have to figure out a better plan [15:39:51] probably by using grep with perl regex [15:39:57] (03CR) 10jenkins-bot: [V: 04-1] typos: remove ncsa (matches varnishncsa) [puppet] - 10https://gerrit.wikimedia.org/r/215046 (https://phabricator.wikimedia.org/T100989) (owner: 10Hashar) [15:40:00] bah [15:40:14] fgrep: modules/admin/files/home/akosiaris/.my.cnf: Permission denied [15:40:17] fgrep: modules/admin/files/home/jynus/.my.cnf: Permission denied [15:40:49] (03CR) 10Andrew Bogott: [C: 032 V: 032] typos: remove ncsa (matches varnishncsa) [puppet] - 10https://gerrit.wikimedia.org/r/215046 (https://phabricator.wikimedia.org/T100989) (owner: 10Hashar) [15:41:39] they are just symbolic links, hashar_ [15:41:48] akosiaris, jynus, what’s with you two having those magic files that no one else has? [15:42:47] godog: https://gerrit.wikimedia.org/r/#/c/214651/ ? [15:43:19] cannot be owned by the user, or it would be a security problem [15:43:47] (actually it doesn't matter, the original files is already owned by root) [15:44:14] (03PS2) 10Andrew Bogott: Alphabetize typos, add equiad and ip_resolve [puppet] - 10https://gerrit.wikimedia.org/r/215038 [15:44:49] How can permissions on a file in puppet be a security problem? Everything there is public… [15:44:59] (03CR) 10jenkins-bot: [V: 04-1] Alphabetize typos, add equiad and ip_resolve [puppet] - 10https://gerrit.wikimedia.org/r/215038 (owner: 10Andrew Bogott) [15:45:47] um… what now? [15:45:51] andrewbogott, not really. Specially with that file, which does not really exist [15:46:08] …ok [15:46:43] (03PS2) 10Andrew Bogott: Use the labs-recursor0 for labs dns [puppet] - 10https://gerrit.wikimedia.org/r/215044 [15:47:18] !log thcipriani Synchronized php-1.26wmf8/extensions/CodeReview: SWAT: Backport CodeReview module position fix [[gerrit:215043]] (duration: 00m 13s) [15:47:22] Logged the message, Master [15:47:25] (03CR) 10jenkins-bot: [V: 04-1] Use the labs-recursor0 for labs dns [puppet] - 10https://gerrit.wikimedia.org/r/215044 (owner: 10Andrew Bogott) [15:47:41] hashar_: typo check still hates everything [15:48:04] yeah :( [15:48:10] andrewbogott, hashar_: yeah those files are symlinks to /root IIRC :( [15:48:12] gilles: can you test the codereview position fix? [15:48:19] I fixed that hopefully [15:48:27] using grep -r, it does not follow symlinks [15:48:28] irritates me every time I grep the puppet repo [15:48:31] ah [15:49:19] jynus: does that file make your life enough better that it’s worth it for everyone to be annoyed when grepping? (‘yes’ is an acceptable answer) [15:49:34] well, I can delete [15:49:39] it [15:49:52] you don’t necessarily need to do that… I just don’t understand what it’s for [15:50:50] it allows me to write mysql, instead of mysql --defaults-file=/root/... [15:50:55] thcipriani: looks great, the world has been fixed, thank you [15:51:04] aliasing it can be dangeours for me [15:51:11] but I can do that inestead [15:51:20] let me fix it [15:51:34] gilles: yw, thanks for testing. [15:51:39] jynus: it’s only worth you deleting if alex deletes it too :) [15:51:47] SWAT complete. [15:51:53] hashar_: should I recheck, or are you still poking? [15:52:09] poking [15:52:13] 'k [15:52:17] * andrewbogott wanders off for a bit [15:52:58] should be good now [15:53:12] * andrewbogott wanders back [15:53:26] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/215038 (owner: 10Andrew Bogott) [15:53:38] (03CR) 10Andrew Bogott: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/215044 (owner: 10Andrew Bogott) [15:54:30] woo, tests passing again. Thanks hashar_ [15:54:56] (03CR) 10Andrew Bogott: [C: 032] Alphabetize typos, add equiad and ip_resolve [puppet] - 10https://gerrit.wikimedia.org/r/215038 (owner: 10Andrew Bogott) [15:55:28] (03CR) 10Andrew Bogott: [C: 032] Use the labs-recursor0 for labs dns [puppet] - 10https://gerrit.wikimedia.org/r/215044 (owner: 10Andrew Bogott) [15:55:40] (03PS1) 10Jcrespo: Removing symbolic link as it breaks other people's workflow [puppet] - 10https://gerrit.wikimedia.org/r/215048 [15:55:49] andrewbogott, hashar_ ^ [15:56:03] jynus: thanks :) [15:56:46] (03CR) 10Andrew Bogott: [C: 031] "I don't demand that this happen, it's just weird to have a few random files in my puppet repo that I'm not allowed to read. Uglies up the" [puppet] - 10https://gerrit.wikimedia.org/r/215048 (owner: 10Jcrespo) [15:57:08] andrew: you can read then, so sudo on your own machine [15:57:16] :) [15:57:20] could [15:57:41] I know, I know [15:58:02] (03PS2) 10Jcrespo: Removing symbolic link as it breaks other people's workflow [puppet] - 10https://gerrit.wikimedia.org/r/215048 [15:58:52] (03CR) 10Ori.livneh: [C: 031] "godog, all yours" [puppet] - 10https://gerrit.wikimedia.org/r/214651 (owner: 10Ori.livneh) [15:59:02] (03CR) 10Jcrespo: [C: 032] Removing symbolic link as it breaks other people's workflow [puppet] - 10https://gerrit.wikimedia.org/r/215048 (owner: 10Jcrespo) [16:01:42] (03PS3) 10Ori.livneh: Make import group assignable on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214925 (https://phabricator.wikimedia.org/T100925) (owner: 10Odder) [16:02:22] (03CR) 10Ori.livneh: [C: 032] Make import group assignable on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214925 (https://phabricator.wikimedia.org/T100925) (owner: 10Odder) [16:02:48] (03PS5) 10Ori.livneh: Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder) [16:02:57] (03PS2) 10Ori.livneh: Provide static PNG logos for emlwiki and kgwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214981 (https://phabricator.wikimedia.org/T100953) (owner: 10Odder) [16:03:05] (03CR) 10Ori.livneh: [C: 032] Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder) [16:04:01] (03Merged) 10jenkins-bot: Make import group assignable on newiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214925 (https://phabricator.wikimedia.org/T100925) (owner: 10Odder) [16:04:03] (03Merged) 10jenkins-bot: Sysops to add users to import group on maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/214771 (https://phabricator.wikimedia.org/T99491) (owner: 10Odder) [16:05:53] !log ori Synchronized wmf-config/InitialiseSettings.php: T99491, T100925: Sysops to add users to import group on maiwiki, newiki (duration: 00m 14s) [16:06:05] Logged the message, Master [16:08:11] 6operations, 7HTTPS, 5Patch-For-Review: replace git's sha1 cert with sha256 - https://phabricator.wikimedia.org/T100827#1326163 (10RobH) Indeed, it is behind misc-web. I think we can indeed revoke this cert/keypair entirely. I'll keep it assigned to me and do so later today. Once done, I'll remove from ou... [16:08:54] (03PS1) 10Hashar: DO NOT SUBMIT: bunch of typos [puppet] - 10https://gerrit.wikimedia.org/r/215050 [16:09:16] andrewbogott: now with colors https://integration.wikimedia.org/ci/job/operations-puppet-typos/33650/console : -} [16:09:27] (03Abandoned) 10Hashar: DO NOT SUBMIT: bunch of typos [puppet] - 10https://gerrit.wikimedia.org/r/215050 (owner: 10Hashar) [16:09:57] hashar, nice! [16:10:08] grep --color=always [16:18:48] (03PS4) 10Ori.livneh: Log a 20s sample of memcached usage to a file once a day [puppet] - 10https://gerrit.wikimedia.org/r/214762 [16:26:30] andrewbogott: I see that patch has been merged. mind if I switch it on for tools-dev, the secondary bastion? [16:27:35] andrewbogott: could also turn it on for the labs proxy, it makes a bunch of DNS requests every day [16:28:41] YuviPanda: I’m seeing another issue, we should wait a bit [16:28:53] alright [16:29:02] * YuviPanda takes his rush-to-enable-things-hat off [16:29:37] https://phabricator.wikimedia.org/T101000?workflow=create [16:30:13] andrewbogott: hmm, so I created tools-redis-02, then had to delete it and recreate it [16:30:18] I guess whatever purges them didn't catch it? [16:30:32] seems like. I’m looking now [16:30:56] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1326261 (10Joe) I think roles in your view are an approximation of what I put in the `/services` hierarchy. That's a collection of (mostly immutable) da... [16:31:31] heh, looks like the ssh algorithm update broke connectivity with designate somehow. [16:31:33] * andrewbogott surprised [16:33:17] I'm considering turning the kex / mac updates off for all of labs [16:33:25] it's already been turned off for at least two projects (tools and quarry) [16:35:20] this is failing on a prod machine anyway [16:35:33] huh, does paramiko have its own ssh algorithm? Surely it just uses the openssh install... [16:35:56] moritzm: any idea re ^? https://phabricator.wikimedia.org/T101000 [16:36:03] (03PS1) 10Jcrespo: Repool pc1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215055 [16:36:46] andrewbogott: paramiko is still problematic, yes. that's why I turned it off for quarry [16:37:05] andrewbogott: why does designate need paramiko? [16:37:55] (03PS1) 10Ottomata: Make it possible to install multiple custom diamond collectors that use the same source [puppet] - 10https://gerrit.wikimedia.org/r/215056 [16:38:03] YuviPanda: it calls back to the puppetmaster to clean up puppet and salt keys on deletion. [16:38:10] aaaah [16:38:11] (03CR) 10Jcrespo: [C: 032] Repool pc1003 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215055 (owner: 10Jcrespo) [16:38:13] ok [16:38:13] (03CR) 10Mark Bergsma: [C: 031] "It should be straightforward to test in Labs or any other machine. You don't need IPVS actually working for this, in DryRun you can test i" (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 (owner: 10Ori.livneh) [16:38:17] I guess I can just shell out… [16:38:28] andrewbogott: nah, I think you can turn it off on a per-host or per-role basis [16:38:37] ? [16:38:39] I'd rather have us use paramiko than shell out [16:38:47] Except it doesn’t work [16:38:58] (03PS2) 10Ottomata: Make it possible to install multiple custom diamond collectors that use the same source [puppet] - 10https://gerrit.wikimedia.org/r/215056 [16:38:58] andrewbogott: last two lines of https://wikitech.wikimedia.org/wiki/Hiera:Tools [16:39:23] hm… ugly [16:39:24] * andrewbogott tries [16:39:58] andrewbogott: i'd like to try this again, would appreciate a review: https://gerrit.wikimedia.org/r/#/c/215056/ [16:40:26] ottomata: ok. It’ll be a bit. [16:40:36] !log jynus Synchronized wmf-config/db-eqiad.php: repool pc1003 (duration: 00m 15s) [16:40:38] np [16:40:40] Logged the message, Master [16:41:56] (03PS1) 10Andrew Bogott: Allow a lax ssh policy on labs controllers. [puppet] - 10https://gerrit.wikimedia.org/r/215057 [16:42:06] YuviPanda: is that what you mean? [16:42:52] andrewbogott: yes, although we should use role based hiera lookup there, but I guess that requires we use the 'role' keyword for including our role which I don't think we do atm [16:42:57] andrewbogott: but yes, that should work [16:43:33] YuviPanda: does the role include have to happen in site.pp? [16:43:35] hm [16:44:09] YuviPanda: ‘role::puppet::server::labs’ is what I want those settings for [16:44:17] Can you show me how to do that? [16:44:41] andrewbogott: oh, hmm. I don't know if it can be used outside of site.pp [16:44:46] ok [16:44:48] andrewbogott: where are those included from? [16:45:03] oh, role::nova::controller [16:45:08] let’s just do it based on that, that’s close enough for now. [16:45:12] ok [16:45:24] 6operations, 7discovery-system, 5services-tooling: [RFC] Define the on-disk and live structure of etcd pool data - https://phabricator.wikimedia.org/T100793#1326300 (10Joe) Something that somehow escaped me in my previous comments: this is all thought for **static** configuration we get from a file on the f... [16:45:33] andrewbogott: want me to amend the patch or? [16:45:43] If you don’t mind, yes please [16:45:46] sure [16:46:32] * andrewbogott is creeped out that paramiko has a different ssh implementation than commandline ‘ssh’ [16:50:24] (03PS2) 10Yuvipanda: Allow a lax ssh policy on labs controllers. [puppet] - 10https://gerrit.wikimedia.org/r/215057 (owner: 10Andrew Bogott) [16:50:29] andrewbogott: ^ [16:51:08] (03CR) 10Ori.livneh: [UNTESTED] Use INotify to watch for configuration file changes (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/213223 (owner: 10Ori.livneh) [16:51:28] YuviPanda: ok, thanks! Let’s see if that fixes instance deletion... [16:51:46] (03CR) 10Andrew Bogott: [C: 032] Allow a lax ssh policy on labs controllers. [puppet] - 10https://gerrit.wikimedia.org/r/215057 (owner: 10Andrew Bogott) [16:51:52] 6operations, 6Phabricator, 7database: Add Story points (from Sprint Extension) to the phabricator data dump - https://phabricator.wikimedia.org/T100846#1326329 (10chasemp) [16:54:09] YuviPanda: Error: Could not retrieve catalog from remote server: Error 400 on SERVER: ldapconfig is not a hash or array when accessing it with basedn at /etc/puppet/manifests/role/puppet.pp:10 on node virt1000.wikimedia.org [16:54:34] uh. [16:55:05] yeah [16:56:53] PROBLEM - puppet last run on virt1000 is CRITICAL puppet fail [16:57:25] andrewbogott: have a fix coming up [16:57:35] you’re way ahead of me, then [16:57:40] (03PS1) 10Yuvipanda: puppetmaster: Explicitly include config class before using it [puppet] - 10https://gerrit.wikimedia.org/r/215060 [16:57:42] andrewbogott: ^ [16:57:57] andrewbogott: the change in ordering messed it up - it was only working due to 'lucky' ordering before :) [16:58:01] ah, sure. [16:58:02] OK [16:58:03] PROBLEM - puppet last run on labcontrol1001 is CRITICAL puppet fail [16:58:12] (03CR) 10Andrew Bogott: [C: 032] puppetmaster: Explicitly include config class before using it [puppet] - 10https://gerrit.wikimedia.org/r/215060 (owner: 10Yuvipanda) [16:58:27] andrewbogott: might uncover other issues :) [16:59:44] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1326339 (10greg) @Dzahn: the list of blockers are all resolved, how do you want to proceed? cc @aklapper [17:00:23] RECOVERY - puppet last run on virt1000 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:06:21] YuviPanda: deletion is working again. I have to dig through designate and purge duplicates now, which will take a while… will do after the meeting. [17:06:50] andrewbogott: cool. I'll hold off on trigerhappying any new switchovers in the meantime. [17:07:04] I think I’ll send an announcement warning that bastion and default-for-new-instances dns will change tomorrow and that I’ll switch over existing instances next Monday. sound OK? [17:07:30] andrewbogott: yeah. [17:07:51] Aw, I can hear the disappointment over ‘next Monday’ all the way over here. [17:09:22] andrewbogott: :D let's do it on friday night!!!!! [17:09:24] :) [17:09:48] andrewbogott: I'll still switchover some stuff before that though. labs proxies, tools proxies, and tools gridengine / master, I think [17:10:13] ok… would be nice to warn people about such things [17:11:51] andrewbogott: proxies and master run only our code, and only people getting access to them are 'us' (admins) [17:11:52] so seems ok? [17:12:10] they also do a lot of lookups, and I just want to hand move them than get them caught up in a labs-wide move [17:12:23] RECOVERY - Host analytics1028 is UPING OK - Packet loss = 0%, RTA = 1.66 ms [17:12:50] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On Beta Cluster, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1326373 (10greg) [17:14:49] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1326378 (10ori) [17:15:57] (03CR) 10Dzahn: [C: 032] people.wikimedia.org: HTTPS only [puppet] - 10https://gerrit.wikimedia.org/r/214773 (owner: 10Ori.livneh) [17:16:43] RECOVERY - puppet last run on labcontrol1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [17:19:49] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1326392 (10Joe) Proposal: ``` oo... [17:21:51] 6operations, 10Beta-Cluster, 6Labs, 10Labs-Infrastructure, 6WMF-NDA: On deployment-prep, add warning text + labs Term of Uses link to the motd files - https://phabricator.wikimedia.org/T100837#1326400 (10yuvipanda) Yes, and everytime someone breaks beta we can factually say they clubbed a baby seal to de... [17:29:36] YuviPanda: I’ll include a mention in the email. It’s good to warn people even if we think something shouldn’t break :) [17:29:36] (03CR) 10Dzahn: [C: 032] [English Planet] Add Bluerasberry, Nimish Gautam [puppet] - 10https://gerrit.wikimedia.org/r/214916 (owner: 10Nemo bis) [17:37:19] (03PS1) 10Gage: strongswan module: don't install ipsec-tools [puppet] - 10https://gerrit.wikimedia.org/r/215067 [17:38:10] (03CR) 10Gage: [C: 032] strongswan module: don't install ipsec-tools [puppet] - 10https://gerrit.wikimedia.org/r/215067 (owner: 10Gage) [17:39:12] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1326456 (10Andrew) [17:39:37] <_joe_> earldouglas: if you have profound insights on how to make eventlogging more performant, I'm sure ottomata will be happy to hear you out. [17:40:18] _joe_: re what? [17:41:13] <_joe_> ori: something said in a meeting, it seems the search team would need to send more events to eventlogging [17:41:27] oh [17:41:52] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Migrate Labs NFS storage from RAID6 to RAID10 - https://phabricator.wikimedia.org/T96063#1326475 (10coren) [17:42:27] <_joe_> ori: if I got that correctly [17:43:23] _joe_: afaik, the main eventlogging bottleneck at the moment is mysql insertion [17:43:28] gwicke, mobrovac i'm about to depl another version on sca100x, any objections? [17:43:39] beyond that, it is parellization, which we may solve with kafka soon [17:43:50] <_joe_> ottomata: oh so it won't run 100x faster on a laptop? [17:44:02] ha, wha? [17:44:02] mongodb etc [17:44:25] <_joe_> ottomata: just being sarcastic [17:44:51] 6operations, 10Traffic, 10Wikimedia-DNS: Consider DNSSec - https://phabricator.wikimedia.org/T26413#1326484 (10BBlack) Our [[ http://gdnsd.org/ | current DNS server implementation (gdnsd) ]], which we like for a lot of its other unique features (and which I should incidentally disclaim that I'm the author of... [17:46:09] !log deployed graphoid service update - grafana logging cleanup [17:46:12] Logged the message, Master [17:46:17] _joe_: thumbsup [17:46:46] <_joe_> earldouglas: ottomata and ori can help you understand what the bottleneck is, and please suggest them improvements [17:46:50] 6operations, 10Analytics-Cluster, 10hardware-requests, 10procurement: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1326488 (10RobH) #hardware-requests, not #procurement for requesting hardware in phabricator, as outlined on https://wikitech.wikimedia.org/wiki/Operations... [17:47:31] 6operations, 10ops-esams, 10hardware-requests, 10procurement: Buy fiber patches - https://phabricator.wikimedia.org/T94846#1326493 (10RobH) #hardware-requests, not #procurement for requesting hardware in phabricator, as outlined on https://wikitech.wikimedia.org/wiki/Operations_requests#Hardware_Requests.... [17:47:39] * _joe_ bbiab [17:47:41] 6operations, 10ops-esams, 10hardware-requests: Buy fiber patches - https://phabricator.wikimedia.org/T94846#1326495 (10RobH) [17:47:45] 6operations, 10Analytics-Cluster, 10hardware-requests: Hadoop worker node procurement - 2015 - https://phabricator.wikimedia.org/T100442#1326496 (10RobH) [17:52:00] (03PS2) 10Florianschmidtwelzow: Enable alternate and canonical links for mobile/desktop pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/212022 (https://phabricator.wikimedia.org/T99587) [17:52:09] 6operations: pc100[123] maintenance and upgrade - https://phabricator.wikimedia.org/T100301#1326513 (10jcrespo) 5Open>3Resolved I think I can say this is done, I will reopen if I find another issue. Extra monitoring will be ticketed if needed on a separate issue. [17:56:14] PROBLEM - puppet last run on ms-be1018 is CRITICAL Puppet has 1 failures [17:57:46] 6operations, 5Patch-For-Review, 7database: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1326531 (10Dzahn) @jcrespo thank you very much. should be all done then. [17:57:51] 6operations, 5Patch-For-Review, 7database: contacts.wikimedia.org drupal unpuppetized / retire contacts - https://phabricator.wikimedia.org/T90679#1326532 (10Dzahn) 5Open>3Resolved [17:58:30] 6operations, 7database: Discuss enabling automatic buffer pool dumping on start/stop (puppet) for all servers - https://phabricator.wikimedia.org/T101009#1326533 (10jcrespo) 3NEW a:3jcrespo [17:59:25] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1326544 (10coren) 3NEW [17:59:54] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1326544 (10coren) a:3coren First part is having the destination ready, doing that. [18:00:20] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Rsync live labstore filesystem to local eqiad copy - https://phabricator.wikimedia.org/T101011#1326556 (10coren) 3NEW [18:00:42] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1326566 (10coren) [18:00:45] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Rsync live labstore filesystem to local eqiad copy - https://phabricator.wikimedia.org/T101011#1326556 (10coren) [18:05:20] 7Puppet, 10Deployment-Systems: Trebuchet master should be separate from scap - https://phabricator.wikimedia.org/T96042#1326597 (10thcipriani) p:5Triage>3Low [18:05:50] 6operations, 10ops-esams, 10hardware-requests: Buy fiber patches - https://phabricator.wikimedia.org/T94846#1326602 (10RobH) (I realize this task was likely just for Mark as a reminder for himself, but I didn't want to not point out that we aren't using the #procurement project quite yet.) [18:07:38] _joe_: fair enough, 100x might have been overly optimistic. [18:08:01] But fwiw I just slapped together a Web app that inserts a row into mysql once per request, and it easily handles ~10k requests/sec on my laptop. [18:09:28] <_joe_> earldouglas: something tells me some part of the complexity of what is being done is a bit higher than that. But I'm sure ori can comment better on that. [18:09:53] No doubt. [18:10:32] But with a little CQRS, it shouldn't matter. For writes, just get the data in the door. For complex analysis and/or reads, let other machines handle the work. [18:10:57] ori: lemme know when you have some time to chat. [18:11:03] earldouglas: what's up? [18:11:11] I'm probably making some silly assumptions about the architecture. [18:11:30] I haven't been paying full attention. Could you recap? [18:11:43] RECOVERY - puppet last run on ms-be1018 is OK Puppet is currently enabled, last run 54 seconds ago with 0 failures [18:11:56] We're discussing EventLogging's current performance capabilities, and whether (And how) it could be improved. [18:12:25] what are you trying to do, and what bottleneck are you hitting? [18:13:40] Word on the street is that EL can only consume 300 events per second across all projects. [18:13:58] This is pretty low, and will be insufficent for S&D's needs. [18:14:19] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint Lou Reed, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1326616 (10atgo) [18:16:21] what are you trying to log? this is still a little vague [18:16:48] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM, but a small comment." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/215004 (https://phabricator.wikimedia.org/T78514) (owner: 10GWicke) [18:17:06] if you haven't already, please read https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#What_is_EventLogging.3F for a concise explanation of what EventLogging does that goes beyond simply piping web requests into a database, and why it does it [18:17:29] (03CR) 10Legoktm: "Ping :)" [tools/scap] - 10https://gerrit.wikimedia.org/r/214288 (https://phabricator.wikimedia.org/T100600) (owner: 10Legoktm) [18:18:02] 6operations, 7database: Discuss enabling automatic buffer pool dumping on start/stop (puppet) for all servers - https://phabricator.wikimedia.org/T101009#1326621 (10Springle) Sounds good? What are the cons? [18:18:25] 6operations, 10Beta-Cluster, 10Traffic: Upgrade beta-cluster caches to jessie - https://phabricator.wikimedia.org/T98758#1326622 (10thcipriani) [18:18:40] 6operations, 7database: Enabling automatic buffer pool dumping on start/stop (puppet) for all servers - https://phabricator.wikimedia.org/T101009#1326624 (10jcrespo) [18:18:59] 6operations, 7database: Enabling automatic buffer pool dumping on start/stop (puppet) for all servers - https://phabricator.wikimedia.org/T101009#1326533 (10jcrespo) Description updated :) [18:20:10] The approach was born out of experience with doing analytics at the wmf, and it has worked well for a large number of use-cases, but it is not the ultimate tool for every purpose. if the added overhead of ensuring data is accurate described by schema is not adding any value for you, then a more naive approach to handling events might be the way to go. [18:21:56] https://www.mediawiki.org/wiki/Extension:EventLogging/Guide#Using_EventLogging:_The_workflow is useful as well [18:23:24] ori: excellent, thanks! [18:23:24] (03CR) 10Ori.livneh: "It'd be better to have a dictionary which maps file extensions to validators, then have scap walk the file hierarchy that is to be synced," [tools/scap] - 10https://gerrit.wikimedia.org/r/214288 (https://phabricator.wikimedia.org/T100600) (owner: 10Legoktm) [18:28:01] (03CR) 10Ori.livneh: [C: 031] "(This is fine for now, though.)" [tools/scap] - 10https://gerrit.wikimedia.org/r/214288 (https://phabricator.wikimedia.org/T100600) (owner: 10Legoktm) [18:44:08] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1326725 (10BBlack) ^ The commit above was reverted, it was a test run of basically idea (2) from earlier. It's not practical in the real world because some hosts map the IP of their own hostname... [18:48:22] 6operations: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1326738 (10Andrew) Very happy to see this chugging along! This will be a Trusty box. Note also that the network setup for labnet1001 is weird, and labnet1002 needs the same setup. I believe that eth1,2,3 are bonded and eth0... [18:55:35] Deskana: Are you still on perma-product duty? [18:55:36] !log ori Synchronized php-1.26wmf8/extensions/SemanticForms/includes/SF_FormUtils.php: I7ed3996a1: Stop using StripState (duration: 00m 15s) [18:55:40] Logged the message, Master [18:55:50] !log ori Synchronized php-1.26wmf7/extensions/SemanticForms/includes/SF_FormUtils.php: I7ed3996a1: Stop using StripState (duration: 00m 13s) [18:55:50] Deskana: (Shouldn't Reading be the main operations contact? ;-)) [18:55:53] Logged the message, Master [18:55:56] James_F: Yes, although I'm not sure I've *ever* been asked a question in that role. :-p [18:56:33] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1326763 (10coren) Destination volume on labstore1002 is set up `backup/backup` aka `/dev/mapper/backup-backup` It has 40T virtual s... [18:56:46] Deskana: :-) [18:58:38] 6operations, 6Labs, 10Labs-Infrastructure, 3Labs-Sprint-100: Make a block-level copy of the codfw mirror of labstore1001 to eqiad - https://phabricator.wikimedia.org/T101010#1326772 (10coren) a:5coren>3mark [19:02:06] see you later [19:03:58] 6operations, 7database: mysql user and group should be a system user/group - https://phabricator.wikimedia.org/T100501#1326791 (10chasemp) p:5Triage>3Normal [19:06:02] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1326808 (10hashar) [19:08:34] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1326811 (10ori) Note that /static/$VERSION/ resources are not used to serve JavaScript, unless Resour... [19:14:55] 6operations, 10Deployment-Systems, 10Traffic: Varnish cache busting desired for /static/$VERSION/ resources which change within the lifetime of a WMF release branch - https://phabricator.wikimedia.org/T99096#1326839 (10ori) Alternately ResourceLoader should use load.php to fetch resources even in debug mode. [19:22:14] aude: https://gerrit.wikimedia.org/r/#/c/215096/ [19:22:32] ori: :S [19:22:48] hoo: ? [19:22:55] Updating wikibase will just override that [19:23:12] updating wikibase will get rid of that line anyway [19:23:19] it is not present in wikibase's master [19:23:34] But in wikibase wmf8 probably [19:24:15] I5ee84362debadf97ea4bd256ae7db5fba663a876 [19:24:17] right [19:24:49] I5f73f0944e55ecdaa3342e5cacedb58cafaeb04a and I5ee84362debadf97ea4bd256ae7db5fba663a876 would need to be cherry-picked to wikibase's wmf8 [19:25:06] i just figured this was the least invasive way to do it. what would be preferrable? [19:25:30] do the same patch against wmf6 in Wikibase [19:25:35] probably [19:25:43] I think we wanted to backport other things anyway [19:26:22] wmf...6? [19:26:28] yeah [19:26:33] that's our latest [19:26:48] We will only branch again next weekt [19:26:50] * week [19:26:50] https://gerrit.wikimedia.org/r/#/c/212272/ doesn't cherry-pick cleanly [19:27:18] Just do the minimal patch against the branch? [19:27:21] It's test only, right? [19:27:26] yeah [19:27:30] makes sense [19:29:05] We don't set the defaultbranch in our .gitreview [19:29:10] so you have to do that per hand [19:29:15] gets me every time :P [19:29:23] heh [19:29:26] https://gerrit.wikimedia.org/r/#/c/215099/ [19:52:53] !log Repopulated gis.spatial_ref_sys on labsdb1004 with postgis 2.1 data, old contents backed up as spatial_ref_sys_bak [19:52:57] Logged the message, Master [20:01:11] jouncebot seems missing in action. [20:01:20] parsoid deploy time. [20:03:10] jouncebot_, next [20:03:10] In 2 hour(s) and 56 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150601T2300) [20:04:02] greg-g, is there a reason parsoid / services deploy don't show up in monday 3pm slot anymore or is that just an editing oversight? [20:05:12] !log restarted apache2 and phd on iridium (phabricator) [20:05:15] Logged the message, Master [20:16:30] (03PS1) 10Bmansurov: Disable WikiGrok in all production wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/215109 (https://phabricator.wikimedia.org/T101016) [20:17:39] greg-g, ^ [20:18:43] did someone delete operations/puppet/varnish.git ? phabricator is trying to sync it but it doesn't exist anymore in gerrit? [20:19:01] IIRC it was removed [20:19:05] bblack would know [20:19:43] well I mean, I just need to know what to do about phabricator? Delete it there too? [20:19:52] there are several repos in phab that seemingly don't exist in gerrit [20:26:49] !log deployed parsoid sha 73445bfd [20:26:52] Logged the message, Master [20:29:37] !log disabled several no-longer-existent repositories in phabricator which apparently have been deleted in gerrit [20:29:40] Logged the message, Master [20:32:15] (03PS3) 10Dzahn: Add all Release-Engineering team as Gerrit admins [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [20:33:49] (03CR) 10Dzahn: [C: 032] "approved in ops meeting per " "Gerrit admins" refer to the admin shell access group "gerrit-admin". So that is solely to give us shell ac" [puppet] - 10https://gerrit.wikimedia.org/r/214255 (https://phabricator.wikimedia.org/T100565) (owner: 10Hashar) [20:37:52] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1327058 (10Dzahn) a:3Dzahn [20:39:10] subbu: that was a mistake due to bad copy/paste from holiday week (where there was no monday) [20:39:16] 6operations, 5Patch-For-Review: LVM recipes broken for jessie, set up all remaining LVM space as swap - https://phabricator.wikimedia.org/T100636#1327060 (10fgiunchedi) a:3fgiunchedi picking this up, likely the solution will involve a fake LV [20:39:26] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1315740 (10Dzahn) Has been brought up in ops meeting today. No concerns have been raised. (for non-root acc... [20:39:31] greg-g, ok. i added it back for today and also finished deployment. [20:40:00] subbu: cool [20:40:43] !log ori Synchronized php-1.26wmf8/languages/LanguageConverter.php: 1d054ce6d3: Use a fixed marker prefix string in the Parser and MWTidy (duration: 00m 13s) [20:40:47] Logged the message, Master [20:45:15] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1327069 (10Dzahn) @greg I figured the next step is sending an announcement mail and committing to a date. @JohnLewis [20:47:54] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1327072 (10JohnLewis) Pretty much. I vote for a like a three week window so I'm proposing June 22nd? For a good time frame, after the ops meeting (1900 UTC seems... [20:48:28] (03PS4) 10Ori.livneh: mediawiki: tidy /tmp [puppet] - 10https://gerrit.wikimedia.org/r/168999 [20:48:53] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1327076 (10Dzahn) This created the users on node "antimony" (which includes role gitblit and role subversio... [20:49:52] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:53:40] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1327084 (10Dzahn) hiera data: role/common/gitblit.yaml: - gerrit-admin role/common/gerrit/production.yaml... [20:57:41] (03CR) 10Filippo Giunchedi: "@gabriel, add contact_group => 'services' and the relevant section in icinga config, an example is EL alerts in modules/eventlogging/manif" [puppet] - 10https://gerrit.wikimedia.org/r/215004 (https://phabricator.wikimedia.org/T78514) (owner: 10GWicke) [20:57:43] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1327087 (10Dzahn) 5Open>3Resolved it was just delayed somehow. now also on ytterbium: Notice: /Stage[... [20:59:09] (03PS1) 10Ori.livneh: HHVM canaries: set light_process_count to 5 [puppet] - 10https://gerrit.wikimedia.org/r/215187 [20:59:11] (03PS1) 10Ori.livneh: HHVM: set light_process_count to 5, light_process_file_prefix to /tmp/hhvm. [puppet] - 10https://gerrit.wikimedia.org/r/215188 [21:04:22] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 5Patch-For-Review: Grant access for aklapper to phab-admins - https://phabricator.wikimedia.org/T97642#1327112 (10chasemp) @aklapper, are you back? can you generate some ssh keys for prod? [21:05:09] jouncebot: next [21:05:10] In 1 hour(s) and 54 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150601T2300) [21:07:06] 10Ops-Access-Requests, 6operations, 6Release-Engineering, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Add all Release-Engineering team as Gerrit admins - https://phabricator.wikimedia.org/T100565#1327120 (10hashar) I can log on both hosts. Thank you @Dzahn [21:08:01] 6operations, 7network: Enable add_ip6_mapped functionality on all hosts - https://phabricator.wikimedia.org/T100690#1327138 (10chasemp) p:5Triage>3Normal [21:08:08] 6operations, 7Availability, 7Performance: Make redis/redisdb roles support multiple instances on the same servers - https://phabricator.wikimedia.org/T100714#1327139 (10chasemp) p:5Triage>3Normal [21:08:21] 6operations, 7network: asw2-a5-eqiad.mgmt.eqiad.wmnet xe-0/0/36 reporting errors - https://phabricator.wikimedia.org/T100820#1327140 (10chasemp) p:5Triage>3Normal [21:08:28] 6operations: Investigate the compatibility of our puppet tree with ruby2.1 and create a plan to upgrade - https://phabricator.wikimedia.org/T98129#1327143 (10Dzahn) regarding the issue with ruby 1.9 and UTF-8 characters, see T91453. I removed all the non-ASCII chars we had in .pp files and in .erb files they s... [21:08:43] 6operations, 10RESTBase-Cassandra: configure less aggressive cassandra log rotation - https://phabricator.wikimedia.org/T100970#1327145 (10chasemp) p:5Triage>3Normal [21:09:12] 6operations, 10RESTBase-Cassandra: configure less aggressive cassandra log rotation - https://phabricator.wikimedia.org/T100970#1325145 (10chasemp) Is it a better idea to include cassandra logs in logstash or some off-host central place if historical logs are important? [21:12:02] 6operations, 6Services: reinstall OCG servers - https://phabricator.wikimedia.org/T84723#1327152 (10chasemp) p:5High>3Normal [21:13:28] !log ori Synchronized php-1.26wmf7/languages/LanguageConverter.php: 1d054ce6d3: Use a fixed marker prefix string in the Parser and MWTidy (duration: 00m 14s) [21:13:33] Logged the message, Master [21:16:05] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1327173 (10chasemp) a:3greg >>! In T98084#1278103, @greg wrote: > (please don't close until we can confirm this stays working for more than a day) @greg what do you want to do here man? N... [21:16:21] 6operations, 10ops-esams, 10hardware-requests: Buy fiber patches - https://phabricator.wikimedia.org/T94846#1327175 (10chasemp) p:5High>3Normal [21:17:05] 6operations, 10wikitech.wikimedia.org: transient failures of wiki page saves - https://phabricator.wikimedia.org/T98084#1327180 (10greg) 5Open>3Resolved I've been good lately, so I guess we can close it. [21:20:32] (03PS1) 10Dzahn: add вікімедіа.укр (xn--80adgdym4pbd.xn--j1amh) [dns] - 10https://gerrit.wikimedia.org/r/215212 (https://phabricator.wikimedia.org/T95433) [21:20:42] (03CR) 10jenkins-bot: [V: 04-1] add вікімедіа.укр (xn--80adgdym4pbd.xn--j1amh) [dns] - 10https://gerrit.wikimedia.org/r/215212 (https://phabricator.wikimedia.org/T95433) (owner: 10Dzahn) [21:21:00] (03CR) 10Filippo Giunchedi: "LGTM overall, a nit on permissions" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/215056 (owner: 10Ottomata) [21:23:01] mutante: what are we doing with вікімедіа.укр anyways? [21:23:33] PROBLEM - puppet last run on ytterbium is CRITICAL Puppet has 1 failures [21:23:51] it bugs me when we add new domains as links to old ones. IMHO, they should either be empty, or they should include the provisioning of new SSL certs as applicable (going forward). [21:23:54] bblack: it was requested by the chapter https://phabricator.wikimedia.org/T95433 [21:24:10] (03PS1) 10Ori.livneh: HHVM APC: enable item expiration [puppet] - 10https://gerrit.wikimedia.org/r/215213 [21:24:16] or at least, we need to have a conversation about this stuff, and I don' teven know who "we" is [21:24:35] about how we categorize and support the various classes of domains we host [21:24:56] bblack: see https://phabricator.wikimedia.org/T95433#1194112 it's a trademark thing [21:25:24] (03PS1) 10Mattflaschen: Add php5-xdebug to deployment-bastion for command-line debugging [puppet] - 10https://gerrit.wikimedia.org/r/215214 [21:25:44] sure, but if we're just parking for trademark, it could still not be a softlink to our main domain data + redirect to a wiki [21:26:02] it could be a dead domain that only works at the DNS but not the browser level. [21:26:22] or just go to a parking page that indicates we're holding it for trademark, and again not link to our primary stuff at all [21:26:57] my concern here is we don't *want* people using these alternate domainname entry points to wikipedia unless they have working SSL [21:27:00] bblack: in this case it is also " its management should be given to WMUA techcomm then" [21:27:19] because that means a capability to hijack/downgrade/whatever [21:27:22] the only reason i made it a symlink is because it used to be the way we did in the past [21:28:19] yeah [21:28:28] we have lot sof legacy cases to sort out, too :/ [21:28:48] but the bottom line is, we don't want to be promoting insecure redirects, and we don't want users using them. [21:29:18] if it's intended that users will type $domain into a browser, or links to $domain will appear somewhere on the internet as a way of reaching us, then deciding to support $domain needs to include the purchase of appropriate SSL certs. [21:29:20] Kind of moot to worry about such domains when millions users go to the squatted domain wikipedia.it [21:29:45] it's not moot, we do have a responsibility to at least fix the ones we do own. [21:30:02] so this one would be a chapter site i assumed [21:30:09] so not like a redirect to a project then [21:30:18] I have no idea [21:30:24] Nemo_bis, seems to redirect to it.wikipedia [21:30:33] but the point remains. if a domainname is an entrypoint into us, it needs TLS [21:30:49] if it happens to fall under one of the existing star certs like *.wikimedia.org, we get that for free [21:30:54] Platonides: it's still a squatted domain and could go anywhere any time [21:30:57] but when it's off in these alternate TLDs, not so much [21:31:01] bblack: should we have one template "deaddomain" and symlink to that for others that should exist only on DNS level [21:31:09] i have a pending patch for that kind of [21:31:17] Registrant [21:31:17] Organization: Associazione Wikipedia Italia [21:31:23] PROBLEM - puppet last run on antimony is CRITICAL Puppet has 1 failures [21:31:29] that's... funny [21:31:31] I think first we need to probably loop in a few people and make decisions about what categories of domains we can have and set the policy [21:31:33] using "wikiartpedia.biz" as the example for "dead" :) [21:31:36] Platonides: evil, rather [21:31:37] (rather than by fiat here in IRC) [21:31:47] I'm just noting that it's an issue [21:32:11] bblack: yes, we should get it on phab instead of IRC, ack [21:32:14] (and the issue grows every time we provision new live domains without TLS) [21:32:46] did WM-IT attempt to recover it? [21:32:56] fwiw this was one i'm pretty sure should be in that category of "only in DNS" [21:32:59] https://gerrit.wikimedia.org/r/#/c/197361/ [21:33:12] Platonides: only WMF can do anything about it and they refuse to; the registration is formally valid [21:33:26] and assumes "empty" means looking like this https://gerrit.wikimedia.org/r/#/c/197361/2/templates/wikiartpedia.biz [21:33:36] did they create a Wikipedia Italia association? [21:33:55] (03CR) 10MaxSem: [C: 031] "+1 on Cyrillics." [dns] - 10https://gerrit.wikimedia.org/r/215212 (https://phabricator.wikimedia.org/T95433) (owner: 10Dzahn) [21:33:55] yep [21:34:04] https://it.wikipedia.org/wiki/Wikipedia:Sondaggi/Recupero_domini_a_nome_Wikipedia [21:34:06] otherwise, the reigistrar info is invalid... [21:34:24] then the organization is using the wikipedia name with no trademark agreement with wmf [21:34:33] sure [21:35:13] :( [21:35:32] we have a similar case in Spain, though [21:35:50] That is? [21:36:14] I'm doing some local testing on carbon for T100636 fwiw, thus puppet disabled [21:36:29] (03CR) 10Dzahn: [C: 04-1] "needs ticket discussion about planned usage of this domain." [dns] - 10https://gerrit.wikimedia.org/r/215212 (https://phabricator.wikimedia.org/T95433) (owner: 10Dzahn) [21:36:33] !log doing some local testing on carbon for T100636 fwiw, thus puppet disabled [21:36:37] Logged the message, Master [21:37:23] wikipedia.es was registered by another guy that registered the Wikipedia trademark [21:37:36] * Platonides realises that wikipedia.es is registered by the wmf [21:37:48] although the dns entries still seem the old ones [21:38:06] Platonides: Nemo_bis: https://phabricator.wikimedia.org/T88861 :p [21:38:34] mutante: restricted [21:38:40] Platonides: funny one [21:39:06] Of course we used to have proper tracking of that sort of stuff on internal.wikimedia.org but that stopped around 2009 IIRC [21:39:08] "You do not have permission to view this object." [21:39:36] mutante, maybe you could add us to the task if you expect us to view them? :) [21:39:41] (03CR) 10Yuvipanda: [C: 04-2] "Shouldn't add more beta special casing. Should be on prod hosts as well if it's useful." [puppet] - 10https://gerrit.wikimedia.org/r/215214 (owner: 10Mattflaschen) [21:39:51] I guess it's a "let's recover our domains task" [21:40:12] but a notification to the chapters would have been helpful [21:40:29] I doubt such a task exists [21:40:34] (03CR) 10Rush: [C: 031] "Based on https://github.com/facebook/hhvm/wiki/Runtime-options this seems like the right thing and we definitely need to pursue due our gr" [puppet] - 10https://gerrit.wikimedia.org/r/215213 (owner: 10Ori.livneh) [21:41:46] (03CR) 10Rush: "Based on https://github.com/facebook/hhvm/wiki/Runtime-options this seems like the right thing and we definitely need to pursue due our gr" [puppet] - 10https://gerrit.wikimedia.org/r/215213 (owner: 10Ori.livneh) [21:42:13] RECOVERY - puppet last run on ytterbium is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [21:42:46] I'll make up a meta task in phab to discuss this stuff. We may need input from Legal on whether we can hold a trademark domain in DNS only, or if it needs some kind of dead-end landing page, or has to be in functional use. [21:43:33] (only the last option implies we need to buy a cert for it if applicable) [21:43:59] Platonides: it's domains related at least [21:45:45] bblack: what about portals à la wikipedia.de [21:46:38] (03PS1) 10Andrew Bogott: Fix a typo that was causing the puppetsigner to crash. [puppet] - 10https://gerrit.wikimedia.org/r/215215 [21:47:27] (03CR) 10Andrew Bogott: [C: 032] Fix a typo that was causing the puppetsigner to crash. [puppet] - 10https://gerrit.wikimedia.org/r/215215 (owner: 10Andrew Bogott) [21:48:23] RECOVERY - puppet last run on antimony is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [21:48:53] (03CR) 10Ori.livneh: "* Live on Beta cluster." [puppet] - 10https://gerrit.wikimedia.org/r/215213 (owner: 10Ori.livneh) [21:49:06] Nemo_bis: well we don't own that, so there's nothing we can do about it. If we want to own it so we can bring it under our policy/technical control, that's a whole other issue. [21:49:18] and then it would get binned into the categories I'm trying to describe in the ticket. [21:49:22] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [21:49:49] bblack: I mean whether a search link like that qualifies as "dead-end landing page" [21:50:40] oh, no [21:51:19] by dead-end page, I really mean something that just has no useful function or content, just says something like "Hey WMF owns this domain, but it does nothing. You might want to go to https://elsewhere (with only a manual link)" [21:51:52] if it's doing something functional, users might decide to use/bookmark/link it, in which case we need to TLS protect it. [21:52:56] bblack: thanks for the meta task! [21:54:46] is it not possible to search pastes in phab? [21:54:54] 6operations, 6Phabricator, 7database: Missing data in Phab reporting dump - https://phabricator.wikimedia.org/T101038#1327263 (10JAufrecht) 3NEW a:3chasemp [22:01:46] mutante: didn't you have examples of existing dead-domains in our DNS? wikiartpedia.biz is actually a symlink too [22:02:20] bblack: https://gerrit.wikimedia.org/r/#/c/197361/2 was to change that [22:02:25] ah, just not merged [22:02:32] it would turn wikiartpedia.biz into an empty template [22:02:36] and then use it as link target [22:02:50] at least for the other wikiartpedia.* [22:03:01] ok I can link the diff as an example, anyways [22:03:14] then i have one more here: [22:03:21] https://gerrit.wikimedia.org/r/#/c/197362/ [22:03:30] "visualwikipedia", also quality link [22:03:51] hmm, that's wrong somehow i just saw [22:04:06] ah no, it just depends on the other one [22:14:39] (03PS1) 10Andrew Bogott: Allow .s in salt-key names for labs. [puppet] - 10https://gerrit.wikimedia.org/r/215217 [22:16:18] (03PS2) 10Andrew Bogott: Allow .s in salt-key names for labs. [puppet] - 10https://gerrit.wikimedia.org/r/215217 [22:17:42] (03CR) 10Andrew Bogott: [C: 032] Allow .s in salt-key names for labs. [puppet] - 10https://gerrit.wikimedia.org/r/215217 (owner: 10Andrew Bogott) [22:31:03] PROBLEM - puppet last run on db2065 is CRITICAL Puppet has 1 failures [22:46:42] RECOVERY - puppet last run on db2065 is OK Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:49:55] YuviPanda, do you know what permissions are required to ssh to deployment-mediawiki01 (or any of them)? [22:49:56] I'm thinking if I could do that maybe I could debug the script using hhvm. [22:53:05] (03CR) 10Dzahn: [C: 04-2] "once https://gerrit.wikimedia.org/r/#/c/197341/ gets merged this will not be needed anymore" [puppet] - 10https://gerrit.wikimedia.org/r/194455 (https://phabricator.wikimedia.org/T84543) (owner: 10Dzahn) [22:53:10] (03Abandoned) 10Dzahn: ensure there is always a newline in chained certs [puppet] - 10https://gerrit.wikimedia.org/r/194455 (https://phabricator.wikimedia.org/T84543) (owner: 10Dzahn) [22:54:04] matt_flaschen: should be shell flag and membership in the deployment project [22:54:46] eh, adminship in the project too [22:55:53] PROBLEM - git.wikimedia.org on antimony is CRITICAL - Socket timeout after 10 seconds [22:56:55] mutante, thanks. I re-checked and I do have access. I was just connecting wrong before. [22:57:24] matt_flaschen: ok, cool [22:58:17] (03PS2) 10Andrew Bogott: Revert "Feed the puppet host IP directly to dnsmasq." [puppet] - 10https://gerrit.wikimedia.org/r/214077 [22:59:09] (03CR) 10Andrew Bogott: [C: 032] Revert "Feed the puppet host IP directly to dnsmasq." [puppet] - 10https://gerrit.wikimedia.org/r/214077 (owner: 10Andrew Bogott) [23:00:04] RoanKattouw, ^d, AaronSchulz: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150601T2300). [23:01:32] AaronSchulz: So, https://gerrit.wikimedia.org/r/#/c/208852/ is listed for SWAT but I don't know how to deploy it [23:02:19] RoanKattouw, (a) use git-deploy on /srv/deployment/jobrunner/jobrunner on tin, (b) use git-deploy service restart (which can be flakey) [23:02:28] (03Abandoned) 10Andrew Bogott: WIP: Install a pipe backend to handle private IPs for public DNS names in labs [puppet] - 10https://gerrit.wikimedia.org/r/211905 (owner: 10Andrew Bogott) [23:03:09] RoanKattouw, though (b) won't normally work here since only jobrunner restarts via git-deploy though [23:03:24] on the other hand the existing oom-restart loop will make it work once [23:03:44] (03PS2) 10Andrew Bogott: Don't hardcode dns listening IPs. [puppet] - 10https://gerrit.wikimedia.org/r/211904 [23:04:42] RoanKattouw, in any case, ori can always use salt to restart jobchron/jobrunner on mw1001-mw1016 [23:04:51] ori, are you ready? [23:05:04] sure [23:05:12] (03CR) 10Andrew Bogott: [C: 032] Don't hardcode dns listening IPs. [puppet] - 10https://gerrit.wikimedia.org/r/211904 (owner: 10Andrew Bogott) [23:05:33] * AaronSchulz can only restart the ghetto way of sudoing as www-data and killing and relying on upstart ;) [23:06:52] Oh OK [23:06:53] Ahm [23:06:59] Can you guys handle this yourselves? :D [23:07:24] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1327475 (10BBlack) 3NEW [23:07:39] RoanKattouw: yes [23:07:49] Thanks guys [23:08:03] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60548 bytes in 3.450 second response time [23:08:08] mutante: it's a big task, I don't think we should block current work on it, I guess, because it could take a while to resolve and the new domains are a drop in the bucket. [23:09:33] (03PS1) 10Andrew Bogott: Turn off recursing for labs pdns/mysql/designate. [puppet] - 10https://gerrit.wikimedia.org/r/215226 [23:10:30] AaronSchulz: let me know when [23:10:40] ori, you can deploy now :) [23:10:56] bblack: like.. the symlink would be ok _for now_ and then we find a general solution how to handle them all ? [23:11:34] no idea. I'm saying I guess "proceed as normal", because it's not really fair to just suddenly block normal workflow on huge and complicated questions. [23:11:46] there are already at least 140 problems, what's another handful? :P [23:14:05] 6operations, 10Analytics-Cluster, 3Fundraising Sprint Kraftwerk, 3Fundraising Sprint Lou Reed, 10Fundraising Tech Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1327485 (10AndyRussG) Thanks, interesting! >Sampling is done on the full incoming strea... [23:14:14] alright, fair [23:14:58] !log Deployed jobchron / jobrunner change Icab05090b and restarted jobchron / jobrunner on job queue runners. [23:15:00] AaronSchulz: ^ [23:15:03] Logged the message, Master [23:16:18] (03PS1) 10Dzahn: mailman: adjust monitoring thresholds [puppet] - 10https://gerrit.wikimedia.org/r/215228 (https://phabricator.wikimedia.org/T84150) [23:17:32] (03CR) 10Dzahn: [C: 032] "based on https://icinga.wikimedia.org/cgi-bin/icinga/history.cgi?host=sodium&service=mailman+I%2FO+stats if this doesnt cover it i'm gon" [puppet] - 10https://gerrit.wikimedia.org/r/215228 (https://phabricator.wikimedia.org/T84150) (owner: 10Dzahn) [23:18:51] AaronSchulz: ack? [23:19:14] ori, I see it [23:19:23] (03CR) 10Dzahn: "also removing mysql connect timeout?" [puppet] - 10https://gerrit.wikimedia.org/r/215187 (owner: 10Ori.livneh) [23:19:26] (03PS2) 10Andrew Bogott: Turn off recursing for labs pdns/mysql/designate. [puppet] - 10https://gerrit.wikimedia.org/r/215226 [23:19:28] (03PS1) 10Andrew Bogott: Use the labs dns server for reverse-dns in the labs range [puppet] - 10https://gerrit.wikimedia.org/r/215231 [23:19:52] (03CR) 10Ori.livneh: "the mysql timeout is already applied in hhvm/manifests/init.pp" [puppet] - 10https://gerrit.wikimedia.org/r/215187 (owner: 10Ori.livneh) [23:20:37] (03CR) 10Andrew Bogott: [C: 032] Use the labs dns server for reverse-dns in the labs range [puppet] - 10https://gerrit.wikimedia.org/r/215231 (owner: 10Andrew Bogott) [23:22:06] (03PS3) 10Paladox: Add link in gitblit for phabricator [puppet] - 10https://gerrit.wikimedia.org/r/214923 [23:23:42] (03CR) 10Paladox: "@Andrew Bogott would this way work to fix link in gitblit for phabricator." [puppet] - 10https://gerrit.wikimedia.org/r/214923 (owner: 10Paladox) [23:24:59] manybubbles, 'Fatal error: Cannot use string offset as an array in /srv/mediawiki-staging/php-1.26wmf7/extensions/CirrusSearch/includes/ElasticsearchIntermediary.php on line 188' btw is that known ? [23:25:18] (03PS3) 10Andrew Bogott: Turn off recursing for labs pdns/mysql/designate. [puppet] - 10https://gerrit.wikimedia.org/r/215226 [23:25:51] AaronSchulz: fixed I think too. is a php vs zend issue [23:26:07] is it all over the logs? [23:26:10] (03CR) 10Andrew Bogott: [C: 032] Turn off recursing for labs pdns/mysql/designate. [puppet] - 10https://gerrit.wikimedia.org/r/215226 (owner: 10Andrew Bogott) [23:26:50] https://gerrit.wikimedia.org/r/#/c/212500/1 [23:27:01] AaronSchulz: ^^^^ [23:27:55] haha, I remember that commit [23:27:57] (03CR) 10Filippo Giunchedi: "@Alex yeah that's true, both approaches can be confusing, particularly because the include_command shell script approach might have stoppe" [puppet] - 10https://gerrit.wikimedia.org/r/214377 (owner: 10Alexandros Kosiaris) [23:28:03] not spamming logs though [23:30:31] k. I think it comes up when running scripts and maybe jobs - those are still zend, right? [23:32:14] (03CR) 10Dzahn: [C: 032] "looks like it would work. one more gitblit restart won't hurt either :)" [puppet] - 10https://gerrit.wikimedia.org/r/214923 (owner: 10Paladox) [23:32:37] (03CR) 10Dzahn: [V: 032] "looks like it would work. one more gitblit restart won't hurt either :)" [puppet] - 10https://gerrit.wikimedia.org/r/214923 (owner: 10Paladox) [23:33:49] (03CR) 10Paladox: "Thanks for reviewing and thanks for merging." [puppet] - 10https://gerrit.wikimedia.org/r/214923 (owner: 10Paladox) [23:34:35] ori, not seeing anything adverse anywhere [23:36:14] !log restarted gitblit .. [23:36:18] Logged the message, Master [23:38:32] PROBLEM - git.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Temporarily Unavailable - 516 bytes in 0.012 second response time [23:38:56] grmbl [23:39:51] mutante: do we have networ graphs public somewhere? [23:41:25] cajoel: there should be something on torrus. like http://torrus.wikimedia.org/torrus/Network?path=/Core_routers/cr1-eqiad.wikimedia.org/Interface_Counters/&view=overview-subleaves-html&OVS=traffic [23:41:49] ehm.. [23:42:13] looks like there might be no data in that [23:44:46] 6operations, 10Traffic: Refactor varnish puppet config - https://phabricator.wikimedia.org/T96847#1327611 (10BBlack) [23:44:46] cajoel: librenms but it's not public [23:46:17] cajoel: is that something rhenium would do? [23:46:57] (03PS1) 10Dzahn: Revert "Add link in gitblit for phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/215238 [23:46:59] torrus loks brokn [23:47:03] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 60548 bytes in 0.220 second response time [23:47:09] yes, it is [23:47:11] mutante: would you consider opening a phab about that? [23:47:16] and/or suggest we shut it down? [23:47:37] cajoel: https://phabricator.wikimedia.org/T87840 [23:47:53] bingo [23:47:56] thanks jgage! [23:48:12] chasemp: should be replaced by https://librenms.wikimedia.org/ [23:48:43] how about a public read-only login to librenms? [23:49:53] is librenms ops-only? [23:50:20] Krenair: not sure [23:50:25] I'd love to have access. [23:50:29] (03PS15) 10BBlack: sslcert: generate chained certs automatically [puppet] - 10https://gerrit.wikimedia.org/r/197341 (owner: 10Faidon Liambotis) [23:50:59] mutante: is your libresnms login working? [23:53:35] cajoel: no, it does not [23:54:00] (the labs account that is) [23:54:49] looks on netmon [23:56:06] (03CR) 10Paladox: [C: 031] "Hum wonder why it is causing internal error" [puppet] - 10https://gerrit.wikimedia.org/r/215238 (owner: 10Dzahn) [23:56:17] (03PS2) 10Paladox: Revert "Add link in gitblit for phabricator" [puppet] - 10https://gerrit.wikimedia.org/r/215238 (owner: 10Dzahn) [23:57:47] (03CR) 10Paladox: "Hum Wonder why it is causing internal error." [puppet] - 10https://gerrit.wikimedia.org/r/214923 (owner: 10Paladox) [23:58:59] cajoel: Krenair: it's not hooked up to LDAP it looks, separate logins [23:59:35] mutante: can we setup a read-only public login? [23:59:46] a'la the way people used torrus