[00:00:04] RoanKattouw, ^d: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150213T0000). Please do the needful. [00:04:58] o/ [00:10:33] \o [00:11:18] no, i'm not going to deploy, sry [00:11:57] that doesnt mean i'm against that patch [00:13:02] mutante, I'm not asking you to deploy, I'm asking for a permission to do so [00:13:43] MaxSem: are you volunteering to deploy for swat? :) [00:14:00] oh shi... [00:14:04] :) [00:14:27] what's MF? [00:14:34] MobileFrontend [00:14:37] <^d> Mother fu.... [00:14:43] (03CR) 10Dzahn: [C: 031] "wikitech also has an 'm' DNS name now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190380 (owner: 10MaxSem) [00:15:39] I'm getting some lag to various things of ours, poking around at stats... [00:16:15] MaxSem: well, yes [00:16:26] maybe just me [00:16:42] wee [00:17:04] i wonder how many people use wikitech on their phone [00:17:30] maybe tablets [00:18:38] (03CR) 10Dzahn: "yes, and wikitech is not behind things:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190380 (owner: 10MaxSem) [00:19:05] pdf output sure is erratic [00:19:21] jzerebecki, are there submodule bumps for your commits? [00:19:21] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=PDF%20servers%20eqiad&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1423786711&g=network_report&z=large [00:19:26] bblack: oh? i disabled one of 3 ocg servers the other day [00:19:32] MaxSem: yes [00:19:48] bblack: but that looked erratic before as well [00:19:51] I just mean in general [00:19:51] where? [00:20:02] maybe there some bulk-users that pop up sporadically [00:20:06] MaxSem: https://gerrit.wikimedia.org/r/#/c/190377/ [00:20:19] MaxSem: https://gerrit.wikimedia.org/r/#/c/190374/ [00:20:26] cool, thx [00:21:44] bblack: if you look at ocg1003 specifically vs. the other 2, does it look to you like ocg1003 is disabled? because i was wondering. it felt like nothing changed after changing pybal config [00:21:54] !log maxsem Synchronized php-1.25wmf17/extensions/GlobalUserPage/: SWAT (duration: 00m 06s) [00:21:59] Logged the message, Master [00:22:04] legoktm ^^^ [00:22:07] * legoktm tests [00:24:39] wtf Feb 13 00:23:12 mw1017: #012Fatal error: Class undefined: GlobalUserPageCacheInvalidator in /srv/mediawiki/php-1.25wmf17/extensions/GlobalUserPage/GlobalUserPage.hooks.php on line 135 [00:25:13] ergh I don't even see it in fatalmonitor [00:25:19] mutante: the graphs do look similar between 100x like nothing changed for 1003. I can go double-check live pybal just in case. but anyways, perhaps it's just that local maintenance processes dominate the graphs and the real traffic doesn't show up well [00:25:33] MaxSem: can I fix that (this is only on testwikis right now) or do you want to revert? [00:25:38] (03CR) 10Dzahn: [C: 032] phab: direct_comments from wikimedia.org for 'domains' [puppet] - 10https://gerrit.wikimedia.org/r/190383 (owner: 10Dzahn) [00:25:45] it's just missing the $wgAutoloadClasses entry [00:26:04] legoktm, how serious is the breakage? [00:26:26] if you try deleting a global user page which is only on testwiki it'll fatal [00:26:36] bblack: i went to config-master.eqiad, then /srv/pybal-config/ and committed [00:26:44] but please check [00:26:46] pfft [00:26:53] go ahead and fix:) [00:27:03] jzerebecki, you're next [00:27:09] k [00:27:29] bblack: ocg1003 is supposed to be disabled so that i can reinstall it. the docs say to turn it off and wait for the jobs to get finished. but i'm confused because nothing happened [00:27:38] mutante: [00:27:38] 2015-02-13 00:27:07.885550 [ocg_8000 ProxyFetch] ocg1002.eqiad.wmnet (enabled/up/pooled): Fetch successful, 0.021 s [00:27:41] 2015-02-13 00:27:08.690954 [ocg_8000 ProxyFetch] ocg1003.eqiad.wmnet (disabled/up/not pooled): Fetch successful, 0.007 s [00:27:44] 2015-02-13 00:27:11.618436 [ocg_8000 ProxyFetch] ocg1001.eqiad.wmnet (enabled/up/pooled): Fetch successful, 0.006 s [00:27:47] pybal does know it's depooled, fwiw [00:28:04] hmm.. [00:28:24] I'll double-check the raw ipvs tables too [00:28:33] thank you [00:29:05] !log maxsem Synchronized php-1.25wmf17/extensions/Wikidata/: SWAT (duration: 00m 12s) [00:29:10] Logged the message, Master [00:29:33] jzerebecki, ^^. is it testable or should I proceed with wmf16? [00:29:43] MaxSem: i'll test [00:29:48] !log phab service restart for config change [00:29:51] Logged the message, Master [00:29:57] yeah there's no pybal bug either, the raw ipvs tables have just ocg100[12] IPs: [00:30:00] TCP 10.2.2.31:8000 wrr -> 10.64.32.151:8000 Route 10 0 564 -> 10.64.48.42:8000 Route 10 0 565 [00:30:41] uhm.. so i was trying to follow this [00:30:41] MaxSem: nothing changed, probably getting old JS [00:30:44] https://wikitech.wikimedia.org/wiki/OCG#Decommissioning_a_host [00:31:00] i says "First, remove the host from the round-robin DNS name " [00:31:24] that was the pybal change for me [00:31:56] !log maxsem Synchronized php-1.25wmf17/extensions/Wikidata/: touch (duration: 00m 18s) [00:32:00] Logged the message, Master [00:32:31] and "Once the DNS change has propagated and any existing jobs on that host were complete, you would run something like: " ., then the "delete cache entries from redis" thing [00:32:36] MaxSem: https://gerrit.wikimedia.org/r/#/c/190387/ has the fix [00:32:47] MaxSem: works [00:32:52] mutante: I donno, probably have to check with cscott on that [00:33:10] bblack: ok, will do [00:34:01] !log maxsem Synchronized php-1.25wmf17/extensions/GlobalUserPage/: fix SWAT (duration: 00m 06s) [00:34:05] Logged the message, Master [00:34:10] legoktm ^^ [00:34:20] * legoktm tries again [00:35:10] MaxSem: works now, thanks :) [00:35:46] !log maxsem Synchronized php-1.25wmf16/extensions/Wikidata/: SWAT (duration: 00m 12s) [00:35:50] Logged the message, Master [00:35:56] jzerebecki, ^^^ [00:36:03] * jzerebecki testing [00:36:46] (03CR) 10MaxSem: [C: 032] enable MobileFrontend on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190373 (owner: 10Dzahn) [00:37:06] MaxSem: works. thx. [00:38:09] PROBLEM - HHVM rendering on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:28] PROBLEM - Apache HTTP on mw1095 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:38:31] MaxSem: what do you know about WP:TSN on en since you're listed as a developer? [00:38:46] sorry, busy atm [00:38:54] mutante, ehm - File not found: /usr/local/apache/common/private/WikitechPrivateSettings.php [00:39:23] I didn't touch anything yet [00:40:07] MaxSem: sigh, the pathes on wikitech are all different [00:40:17] ??? [00:40:50] /srv/org/wikimedia/controller/ [00:40:57] hmm, wikitech still works for me [00:40:58] ./wikis/ [00:41:37] where are these errors coming from? [00:41:58] @silver:/usr/local/apache/common/private# file WikitechPrivateSettings.php [00:42:01] WikitechPrivateSettings.php: PHP script, ASCII text [00:42:03] ? [00:42:20] but it's there? [00:43:11] MaxSem: where did you see it? [00:43:19] in fatalmonitor [00:43:34] but can't see a wiki that's broken by it [00:43:56] so this file exists on silver but not on the cluster [00:44:02] and other wikis are now looking for it? [00:44:15] Feb 13 00:42:40 mw1005: #012Fatal error: File not found: /usr/local/apache/common/private/WikitechPrivateSettings.php in /srv/mediawiki/private/PrivateSettings.php on line 37 [00:45:04] if ( $wgDBname == "labswiki" ) { [00:45:04] require_once( '/usr/local/apache/common/private/WikitechPrivateSettings.php' ); [00:45:41] why is it being executed by random generic mw* hosts? [00:45:50] yea, so if mw1005 is looking for that, then [00:45:55] -bash: cd: /usr/local/apache: No such file or directory [00:45:56] indeed [00:46:03] no /usr/local/apache at all there [00:46:22] but on silver that is the correct path [00:46:49] mw1014, mw1015 and so on [00:47:23] are requests to wt being routed to general apache pool? [00:47:31] why is it working for me then? [00:47:50] PROBLEM - HHVM busy threads on mw1095 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [86.4] [00:48:05] no, wikitech is just the IP of silver [00:48:41] (03Merged) 10jenkins-bot: enable MobileFrontend on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190373 (owner: 10Dzahn) [00:48:50] ? [00:49:33] 12 minutes from your +2 until jenkins merged it now? [00:49:49] zuul is so zuul [00:50:10] i also dont see breakage though , wth [00:50:41] there is no jenkins, only zuul [00:50:46] http://gdash.wikimedia.org/dashboards/reqerror/ also looks sane [00:51:14] is it possible to find out the request URL that produced that error? [00:52:02] not that I know of [00:52:29] hmmm, the errors seem to be resricted to mw1001-16 [00:53:48] aha [00:53:49] # mw1001-1016 are jobrunners (precise) [00:53:52] lololol [00:54:19] need to stop routing job queue to the general redis pool [00:54:34] was it just using DB queue before? [00:54:58] who knows this, maybe Coren ? [00:55:11] or andrew or yuvi .. [00:55:28] PROBLEM - HHVM queue size on mw1095 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [80.0] [00:55:38] poke andrewbogott_afk [00:56:16] "file": "/srv/mediawiki/rpc/RunJobs.php", indeed [00:56:24] from hhvm error log [00:57:57] 'wmgUseClusterJobqueue' => array( [00:57:57] 'default' => true, [00:57:57] 'labswiki' => false, [00:58:00] aiiiieeeee [00:59:09] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [00:59:30] ? i'm at a loss here then, doesn't that mean it's already configured not to do what it does now [01:00:21] ahem [01:01:01] what are you trying to do, and why? [01:01:19] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [01:01:22] ori: there are errors in the hhvm log of poolcouner servers [01:01:34] they try to find in /usr/local/apache [01:01:38] !log maxsem Synchronized wmf-config/InitialiseSettings.php: Shutting the warning off (duration: 00m 06s) [01:01:46] Logged the message, Master [01:01:50] WikitechPrivateSettings.php [01:02:06] but they shouldn't do that, that only exists on silver [01:02:11] where wikitech is [01:02:18] that was mf, just stfu'ing icinga ^^^ [01:03:00] and this is only on poolcounters, from RunJobs.php [01:04:18] maxsem@tin:/srv/mediawiki-staging$ mwscript eval.php labswiki [01:04:18] > var_dump($wgJobTypeConf); [01:04:29] ["class"]=> [01:04:29] string(10) "JobQueueDB" [01:04:57] ori: for example mw1015 , tail hhvm error log [01:05:58] so wikitech is placing jobs that reference a silver-only path into the global job queue? [01:06:40] yea, sounds about the right summary [01:07:21] ok, looking [01:07:30] but also Max pasted above how wmgUseClusterJobqueue is false for labswiki [01:12:12] mutante, ori fixed path in PrivateSettings [01:12:46] errors look fixed now, but are jobrunners even supposed to run wikitech jobs? [01:12:57] MaxSem: ah, great [01:13:14] dunno [01:13:16] i dont know [01:15:10] well, i think we want it to be as much of a regular cluster wiki as possible [01:16:24] hmm, the error is not actually gone [01:16:33] just maybe a bit more rare [01:26:46] !log ori Synchronized private/PrivateSettings.php: Correct path reference (duration: 00m 06s) [01:26:54] Logged the message, Master [01:30:11] !log ori Synchronized private/PrivateSettings.php: Correct path reference, for real this time (duration: 00m 07s) [01:30:14] Logged the message, Master [01:30:38] i am rather amused by how the bots talk to each other here. [01:31:09] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [01:32:33] !log restarting jobrunners [01:32:38] Logged the message, Master [01:33:48] mutante: doesn't that violate "wikitech needs to be fully functional if any random piece of prod goes down" [01:33:55] ? [01:36:23] well, it was functional [01:36:33] just jobs were crashing [01:36:41] !log Correcting path reference in private/PrivateSettings.php required restarting HHVM on job runners. StatCache bug? [01:36:45] Logged the message, Master [01:37:08] jzerebecki: no, as long as wikitech-static is updated regularly and we added the monitoring for that [01:37:29] i think.. [01:37:49] that's what wikitech-static was for, to still have all the docs in any case [01:38:43] ori: thanks! [01:38:56] np [01:39:15] (03CR) 10MaxSem: [C: 032] Enable PHP-based autodetection on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190380 (owner: 10MaxSem) [01:39:23] (03Merged) 10jenkins-bot: Enable PHP-based autodetection on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190380 (owner: 10MaxSem) [01:40:34] legoktm, MWException from line 354 of /srv/mediawiki/php-1.25wmf16/includes/jobqueue/JobQueue.php: Unrecognized job type 'LocalGlobalUserPageCacheUpdateJob'. [01:41:26] legoktm, I'm inclined to revert [01:42:07] MaxSem: but it has $wgJobClasses['LocalGlobalUserPageCacheUpdateJob'] = 'LocalGlobalUserPageCacheUpdateJob'; ? [01:42:26] dunno. exception logs are flooded wiith this [01:42:34] wtf [01:43:39] !log maxsem Synchronized wmf-config/: Let there be mobile on wikitech (duration: 00m 06s) [01:43:43] Logged the message, Master [01:44:01] MaxSem: oh ugh, it's because it's trying to submit jobs on other wikis where the class doesn't exist...yeah lets revert [01:46:04] are you doing it? [01:46:18] PROBLEM - Apache HTTP on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:46:32] yes [01:47:28] MaxSem: https://gerrit.wikimedia.org/r/#/c/190402/1 do you want me to deploy it as well? [01:47:37] go ahead [01:48:19] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:52:17] !log legoktm Synchronized php-1.25wmf17/extensions/GlobalUserPage: Revert GlobalUserPage updates (duration: 00m 06s) [01:52:26] Logged the message, Master [01:56:35] 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1035919 (10Dzahn) reason was that 'Domains' was capitalized. but it's still unclear why that was considered a syntax error by operations-puppet-pplint-HEAD. anyways, project tag was renamed to 'doma... [02:00:50] (03PS1) 10Legoktm: Set $wgJobClasses['LocalGlobalUserPageCacheUpdateJob'] = 'NullJob' to clear queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190403 [02:01:52] 3Ops-Access-Requests: Give Tyler Cipriani shell access (with access to CI systems as well) - https://phabricator.wikimedia.org/T89378#1035927 (10Dzahn) Hi @thcipriani! Welcome to Wikimedia! please make a new SSH key and provide us with the public part. Then we can make a patch for your shell access and upload i... [02:02:22] (03CR) 10Legoktm: [C: 032] Set $wgJobClasses['LocalGlobalUserPageCacheUpdateJob'] = 'NullJob' to clear queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190403 (owner: 10Legoktm) [02:03:27] (03CR) 10Legoktm: [V: 032] Set $wgJobClasses['LocalGlobalUserPageCacheUpdateJob'] = 'NullJob' to clear queues [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190403 (owner: 10Legoktm) [02:04:37] !log legoktm Synchronized wmf-config/CommonSettings.php: Set ['LocalGlobalUserPageCacheUpdateJob'] = 'NullJob' to clear queues (duration: 00m 06s) [02:04:44] Logged the message, Master [02:07:08] 3Ops-Access-Requests: Give Tyler Cipriani shell access (with access to CI systems as well) - https://phabricator.wikimedia.org/T89378#1035931 (10thcipriani) Hiya @Dzahn! In anticipation of your imminent request, I generated a new ssh key: This. Very. Afternoon. Public key is available here: https://wikitech.wi... [02:09:46] ^ greg-g, I like this guy :P [02:17:24] !log l10nupdate Synchronized php-1.25wmf16/cache/l10n: (no message) (duration: 00m 01s) [02:17:32] Logged the message, Master [02:18:31] !log LocalisationUpdate completed (1.25wmf16) at 2015-02-13 02:17:27+00:00 [02:18:36] Logged the message, Master [02:18:38] (03PS1) 10Dzahn: create shell user for Marielle Volz [puppet] - 10https://gerrit.wikimedia.org/r/190405 (https://phabricator.wikimedia.org/T89057) [02:19:09] !log ran redis commands 'HDEL jobqueue:aggregator:h-queue-types:v2 LocalGlobalUserPageCacheUpdateJob/labswiki' and 'HDEL jobqueue:aggregator:h-queue-types:v2 LocalGlobalUserPageCacheUpdateJob' on rdb1001 [02:19:15] Logged the message, Master [02:20:47] 3Ops-Access-Requests: Give Tyler Cipriani shell access (with access to CI systems as well) - https://phabricator.wikimedia.org/T89378#1035962 (10ori) Yo Tyler! Welcome! If the shell key on wikitech is the same one you plan on using for labs, you should generate another one -- we keep production and labs separate... [02:29:02] (03PS1) 10Ori.livneh: Don't set up the job queue for labswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190406 [02:29:36] ^ legoktm, looks sane? [02:30:46] ori: isn't that already guarded by: [02:30:47] if ( $wmgUseClusterJobqueue ) { [02:30:48] # Cluster-dependent files for job queue and job queue aggregator [02:30:48] require( getRealmSpecificFilename( "$wmfConfigDir/jobqueue.php" ) ); [02:30:48] } [02:31:31] !log l10nupdate Synchronized php-1.25wmf17/cache/l10n: (no message) (duration: 00m 01s) [02:31:39] Logged the message, Master [02:31:56] so it seems [02:32:38] !log LocalisationUpdate completed (1.25wmf17) at 2015-02-13 02:31:34+00:00 [02:32:42] Logged the message, Master [02:33:54] 3Ops-Access-Requests: Give Tyler Cipriani shell access (with access to CI systems as well) - https://phabricator.wikimedia.org/T89378#1035969 (10thcipriani) Good lookin' out Ori—thank you. New key available here: https://wikitech.wikimedia.org/wiki/User:Thcipriani#SSH_Production_Key [02:38:03] 3Ops-Access-Requests, operations, Citoid, Services: Give mvolz access to sha machine i.e. http://citoid.wikimedia.org/ - https://phabricator.wikimedia.org/T89057#1035985 (10Dzahn) @mvolz Hello, welcome to Wikimedia. I made a patch to add your ssh key and create a user. It is now in code review in gerrit, see th... [02:46:42] (03PS1) 10Dzahn: create shell account for Tyler Cipriani [puppet] - 10https://gerrit.wikimedia.org/r/190408 (https://phabricator.wikimedia.org/T89378) [02:49:34] (03Abandoned) 10Krinkle: contint: Allow ssh between labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/189132 (owner: 10Krinkle) [02:50:17] 3Ops-Access-Requests: Give Tyler Cipriani shell access (with access to CI systems as well) - https://phabricator.wikimedia.org/T89378#1035997 (10Dzahn) :) perfect, used the production key and uploaded a patch to create your user. the "on duty" person will follow-up soon [03:32:26] (03CR) 10Andrew Bogott: "There's a firewall on that box now, applied via the openstack::firewall class. Ferm changes should probably happen there so that everythi" [puppet] - 10https://gerrit.wikimedia.org/r/190147 (owner: 10Dzahn) [04:33:23] 3operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1036040 (10Mattflaschen) [04:36:27] 3operations: Our custom php packages need to create some conf.d links - https://phabricator.wikimedia.org/T89157#1036043 (10Krenair) Also needs to happen for /etc/php5/cli/conf.d to fix the jobs (which run from cli). [04:56:13] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Feb 13 04:55:10 UTC 2015 (duration 55m 9s) [04:56:21] Logged the message, Master [05:21:45] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1036064 (10GWicke) Quick status update: Four of the six boxes are online already & are being imaged by @fgiunchedi. Racking of the last two ones depends on memcached servers being shuffled aroun... [05:34:43] ori: :) thanks for your assist there [06:10:33] 3operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1036074 (10Joe) For the record, we had a failover a couple of months ago and we had no big issues (apart from the jobqueue being briefly down). @aaron can you please explain to me how redundancy and HA wor... [06:28:08] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:49] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:49] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:59] PROBLEM - puppet last run on cp4003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:58] PROBLEM - puppet last run on cp4008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on cp3014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:21] (03PS21) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (https://phabricator.wikimedia.org/T88793) [06:37:28] RECOVERY - Disk space on einsteinium is OK: DISK OK [06:48:59] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:51:20] RECOVERY - puppet last run on cp4008 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [06:51:20] RECOVERY - puppet last run on cp4003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [07:01:19] (03CR) 10BBlack: "If these are really gone, we could get rid of the non-ipmi DNS that some of them have as well, right? FWIW, none of the hostnames in your" [dns] - 10https://gerrit.wikimedia.org/r/190150 (owner: 10Dzahn) [07:05:30] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [07:06:19] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [07:07:38] RECOVERY - puppet last run on cp3014 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [07:16:39] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/3/2: down - Transit: ! NTT {#3475} (service ID 234630) [10Gbps]BR [07:23:18] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [07:45:19] (03PS3) 10Florianschmidtwelzow: mediawikiwiki: Allow sysop to add and remove themself from translationadmin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/187183 (https://phabricator.wikimedia.org/T87797) [08:03:19] (03Abandoned) 10Giuseppe Lavagetto: memcached: add mc1018 to the mediawiki pool as shard 18 [puppet] - 10https://gerrit.wikimedia.org/r/190000 (owner: 10Giuseppe Lavagetto) [08:41:41] [6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~[6~ �Q#�Q[6~[6~[6~[6~ [08:41:50] oops :/ [08:43:02] <_joe_> fix your readline kart_ [09:04:14] _joe_: I was reading backlog :) [09:23:09] greetings [09:23:22] (03PS1) 10Filippo Giunchedi: restbase: provision with jessie [puppet] - 10https://gerrit.wikimedia.org/r/190422 [09:23:41] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] restbase: provision with jessie [puppet] - 10https://gerrit.wikimedia.org/r/190422 (owner: 10Filippo Giunchedi) [09:25:11] (03CR) 10Alexandros Kosiaris: "It's better in this place IMHO. It's a firewall rule closely associated with a role class. We got this rule that role classes need to get " [puppet] - 10https://gerrit.wikimedia.org/r/190147 (owner: 10Dzahn) [09:25:49] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 932.239515348 [09:28:34] 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1036244 (10Aklapper) >>! In T88842#1035919, @Dzahn wrote: > reason was that 'Domains' was capitalized. That was intentional, because //by default// projects are capitalized in Phab (except for when... [09:32:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Because then the module is no longer an "unit" with well defined entry points, but rather various other "stuff" has connections to the mod" [puppet] - 10https://gerrit.wikimedia.org/r/187087 (owner: 10Dzahn) [09:35:37] (03CR) 10Alexandros Kosiaris: [C: 032] mediawiki: add codfw monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/188895 (https://phabricator.wikimedia.org/T86894) (owner: 10Dzahn) [09:35:48] (03PS3) 10Alexandros Kosiaris: mediawiki: add codfw monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/188895 (https://phabricator.wikimedia.org/T86894) (owner: 10Dzahn) [09:36:55] (03CR) 10Alexandros Kosiaris: [C: 032] mediawiki: add codfw monitoring groups [puppet] - 10https://gerrit.wikimedia.org/r/188895 (https://phabricator.wikimedia.org/T86894) (owner: 10Dzahn) [09:42:53] 3operations: Move servermon.wikimedia.org behind misc-web - https://phabricator.wikimedia.org/T88427#1036265 (10akosiaris) Hey, thanks for taking care of this!!! As far as librenms goes, I 'd rather we didn't. It is way more vital as a monitoring tool than servermon (especially during network outages) and the l... [09:52:01] 3operations: Access request for stat1003 - https://phabricator.wikimedia.org/T89418#1036287 (10Aklapper) [09:53:38] PROBLEM - Host restbase1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:02:28] RECOVERY - Host restbase1004 is UP: PING OK - Packet loss = 0%, RTA = 1.51 ms [10:02:36] PROBLEM - puppet last run on restbase1004 is CRITICAL: Connection refused by host [10:04:38] PROBLEM - dhclient process on restbase1004 is CRITICAL: Connection refused by host [10:04:57] PROBLEM - DPKG on restbase1004 is CRITICAL: Connection refused by host [10:05:07] PROBLEM - salt-minion processes on restbase1004 is CRITICAL: Connection refused by host [10:05:38] PROBLEM - Disk space on restbase1004 is CRITICAL: Connection refused by host [10:05:40] PROBLEM - configured eth on restbase1004 is CRITICAL: Connection refused by host [10:05:40] PROBLEM - RAID on restbase1004 is CRITICAL: Connection refused by host [10:11:08] PROBLEM - Host restbase1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:12:09] RECOVERY - Host restbase1004 is UP: PING OK - Packet loss = 0%, RTA = 2.63 ms [10:23:13] 3operations: "pxe boot once" option for HP servers - https://phabricator.wikimedia.org/T89443#1036335 (10fgiunchedi) 3NEW [10:56:18] (03PS1) 10Filippo Giunchedi: restbase: provision restbase/cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/190426 (https://phabricator.wikimedia.org/T76986) [10:59:55] PROBLEM - Host restbase1006 is DOWN: PING CRITICAL - Packet loss = 100% [11:12:45] RECOVERY - Host restbase1006 is UP: PING OK - Packet loss = 0%, RTA = 0.73 ms [11:18:29] !log es-tool restart-fast on elastic1008 [11:18:33] Logged the message, Master [11:23:57] PROBLEM - HHVM busy threads on mw1192 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [115.2] [11:24:07] PROBLEM - HHVM queue size on mw1192 is CRITICAL: CRITICAL: 88.89% of data above the critical threshold [80.0] [11:24:36] PROBLEM - HHVM rendering on mw1192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:24:38] that's me enabling silenced notifications [11:24:55] mw1095 & mw1192 seem to have crapped themselves about ~10h ago [11:26:06] service hhvm restart [11:26:07] RECOVERY - Apache HTTP on mw1095 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [11:26:26] RECOVERY - HHVM rendering on mw1095 is OK: HTTP OK: HTTP/1.1 200 OK - 69060 bytes in 0.228 second response time [11:26:37] RECOVERY - HHVM rendering on mw1192 is OK: HTTP OK: HTTP/1.1 200 OK - 69060 bytes in 0.230 second response time [11:26:45] !log mw1095/mw1192: service hhvm restart, alerts for 10h30/9h35 respectively [11:26:49] Logged the message, Master [11:26:57] RECOVERY - Apache HTTP on mw1192 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [11:27:07] logstash has crapped itself too; godog any ideas? [11:27:55] paravoid: besides a classy restart, not too many no [11:28:52] I'll kick it [11:31:16] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 91 threshold =0.1% breach: status: red, number_of_nodes: 3, unassigned_shards: 86, timed_out: False, active_primary_shards: 40, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 44, initializing_shards: 5, number_of_data_nodes: 3 [11:31:17] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 87 threshold =0.1% breach: status: red, number_of_nodes: 3, unassigned_shards: 81, timed_out: False, active_primary_shards: 44, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 48, initializing_shards: 6, number_of_data_nodes: 3 [11:32:26] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 136, initializing_shards: 2, number_of_data_nodes: 3 [11:32:26] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 136, initializing_shards: 2, number_of_data_nodes: 3 [11:32:27] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 0, timed_out: False, active_primary_shards: 46, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 136, initializing_shards: 2, number_of_data_nodes: 3 [11:32:43] don't forget to !log [11:33:07] RECOVERY - HHVM queue size on mw1095 is OK: OK: Less than 30.00% above the threshold [10.0] [11:33:07] RECOVERY - HHVM busy threads on mw1095 is OK: OK: Less than 30.00% above the threshold [57.6] [11:34:18] sure [11:34:40] !log restart elasticsearch on logstash1001 logstash1002 logstash1003 [11:34:46] Logged the message, Master [11:34:47] RECOVERY - HHVM busy threads on mw1192 is OK: OK: Less than 30.00% above the threshold [76.8] [11:34:57] RECOVERY - HHVM queue size on mw1192 is OK: OK: Less than 30.00% above the threshold [10.0] [11:39:08] 3RESTBase, Services, operations, Scrum-of-Scrums: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1036388 (10mobrovac) [11:39:53] 3ContentTranslation-cxserver, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Separate config for Beta and Production for CXServer - https://phabricator.wikimedia.org/T88793#1036392 (10KartikMistry) p:5Normal>3High [11:54:02] 3RESTBase, Services, operations, Scrum-of-Scrums: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1036413 (10mobrovac) [11:56:04] (03PS22) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (https://phabricator.wikimedia.org/T88793) [12:06:20] (03CR) 10Alexandros Kosiaris: [C: 032] restbase: provision restbase/cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/190426 (https://phabricator.wikimedia.org/T76986) (owner: 10Filippo Giunchedi) [12:07:39] (03PS5) 10ArielGlenn: add index.html pages for various directories on dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/144640 [12:11:49] (03CR) 10ArielGlenn: [C: 032] add index.html pages for various directories on dataset hosts [puppet] - 10https://gerrit.wikimedia.org/r/144640 (owner: 10ArielGlenn) [12:12:31] (03PS2) 10ArielGlenn: Add link to pagecounts-all-site dataset [puppet] - 10https://gerrit.wikimedia.org/r/168104 (owner: 10QChris) [12:26:11] (03PS1) 10ArielGlenn: dataset hosts: make the newly commited index html pages live [puppet] - 10https://gerrit.wikimedia.org/r/190438 [12:29:07] (03CR) 10ArielGlenn: [C: 032] dataset hosts: make the newly commited index html pages live [puppet] - 10https://gerrit.wikimedia.org/r/190438 (owner: 10ArielGlenn) [12:30:55] puppet on dataset1001 will whine shortly, ignore please, thanks [12:33:01] PROBLEM - puppet last run on dataset1001 is CRITICAL: CRITICAL: puppet fail [12:33:15] 3operations: "pxe boot once" option for HP servers - https://phabricator.wikimedia.org/T89443#1036441 (10akosiaris) So, iLO3 for sure supports boot once. At least via the RIBCL (an XML language for configuration). I am pretty sure of that as I have written the following code in servermon https://github.com/serv... [12:34:31] (03PS1) 10ArielGlenn: datasets: fix up html index files manifest [puppet] - 10https://gerrit.wikimedia.org/r/190439 [12:36:16] (03CR) 10ArielGlenn: [C: 032] datasets: fix up html index files manifest [puppet] - 10https://gerrit.wikimedia.org/r/190439 (owner: 10ArielGlenn) [12:44:24] (03PS1) 10ArielGlenn: datasets: add the dirs vars manifest, never got committed, woops [puppet] - 10https://gerrit.wikimedia.org/r/190440 [12:47:29] (03CR) 10ArielGlenn: [C: 032] datasets: add the dirs vars manifest, never got committed, woops [puppet] - 10https://gerrit.wikimedia.org/r/190440 (owner: 10ArielGlenn) [12:53:33] (03PS1) 10ArielGlenn: datasets: fix embarrassing typo in dir name [puppet] - 10https://gerrit.wikimedia.org/r/190444 [12:54:39] (03CR) 10ArielGlenn: [C: 032] datasets: fix embarrassing typo in dir name [puppet] - 10https://gerrit.wikimedia.org/r/190444 (owner: 10ArielGlenn) [13:02:39] sorry for so much clutter, hopefully this next one is the last one and I learn to read someday [13:02:49] (03PS1) 10ArielGlenn: datasets: fix the other directory name, sheesh. [puppet] - 10https://gerrit.wikimedia.org/r/190445 [13:05:49] (03CR) 10ArielGlenn: [C: 032] datasets: fix the other directory name, sheesh. [puppet] - 10https://gerrit.wikimedia.org/r/190445 (owner: 10ArielGlenn) [13:07:21] RECOVERY - puppet last run on dataset1001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [13:25:49] !log Started rebuildItemsPerSite for wikidata on terbium [13:25:55] Logged the message, Master [13:33:20] PROBLEM - puppet last run on cp4011 is CRITICAL: CRITICAL: puppet fail [13:34:46] ^ that's me, nothing to see here :) [13:36:13] (03PS1) 10KartikMistry: CX: Publishing to Main namespace for idwiki and ptwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190450 (https://phabricator.wikimedia.org/T89450) [13:38:41] RECOVERY - puppet last run on cp4011 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [13:39:31] PROBLEM - puppet last run on cp1064 is CRITICAL: CRITICAL: Puppet last ran 1 day ago [13:49:41] PROBLEM - puppet last run on cp1058 is CRITICAL: CRITICAL: Puppet has 2 failures [13:51:40] (03CR) 10Alexandros Kosiaris: [C: 04-1] "That last change in the yaml (the addition of - characters) results in JSON that looks like" [puppet] - 10https://gerrit.wikimedia.org/r/188796 (https://phabricator.wikimedia.org/T88793) (owner: 10KartikMistry) [13:58:00] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: Puppet has 2 failures [14:04:57] blarg [14:06:42] akosiaris: shoud we have - 'Apertium' then? [14:10:00] PROBLEM - puppet last run on cp1045 is CRITICAL: CRITICAL: puppet fail [14:10:21] RECOVERY - puppet last run on cp1058 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:10:26] bblack: all those puppet disable is you, right [14:10:41] RECOVERY - puppet last run on cp1064 is OK: OK: Puppet is currently enabled, last run 29 minutes ago with 0 failures [14:11:01] RECOVERY - puppet last run on cp1045 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [14:11:51] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: puppet fail [14:12:04] paravoid: yes [14:15:10] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:16:24] why is MediaWiki running out of mediawiki-staging on terbium [14:16:31] and why is there even a mediawiki-staging [14:16:33] ? [14:16:46] Is it supposed to be a backup deploy server to tin? [14:17:24] ori: ^ [14:21:10] brb [14:23:29] (03PS23) 10KartikMistry: cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (https://phabricator.wikimedia.org/T88793) [14:23:44] akosiaris: fixed. [14:25:22] (03PS3) 10KartikMistry: WIP: Give apertium-admins access to kartik [puppet] - 10https://gerrit.wikimedia.org/r/189915 [14:26:16] (03CR) 10KartikMistry: "Need to fix 'gid: TODO' after creating apertium-admins on sca*, thus WIP." [puppet] - 10https://gerrit.wikimedia.org/r/189915 (owner: 10KartikMistry) [14:26:23] (03PS4) 10KartikMistry: WIP: Give apertium-admins access to kartik [puppet] - 10https://gerrit.wikimedia.org/r/189915 [14:30:34] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (https://phabricator.wikimedia.org/T88793) (owner: 10KartikMistry) [14:30:59] (03CR) 10Alexandros Kosiaris: [C: 032] cxserver: Use different registry for Beta and Production [puppet] - 10https://gerrit.wikimedia.org/r/188796 (https://phabricator.wikimedia.org/T88793) (owner: 10KartikMistry) [14:34:12] (03PS1) 10ArielGlenn: datasets: update pagecounts-ez index html [puppet] - 10https://gerrit.wikimedia.org/r/190457 [14:35:44] akosiaris: Thank you! [14:37:53] (03CR) 10ArielGlenn: [C: 032] datasets: update pagecounts-ez index html [puppet] - 10https://gerrit.wikimedia.org/r/190457 (owner: 10ArielGlenn) [14:40:16] apergos: there's also a previous patch to the pagecounts/ index page to deploy, IIRC [14:41:00] Or not. It already was at some point, the link is good. Nice. https://dumps.wikimedia.org/other/pagecounts-raw/ [14:41:03] !log es-tool restart-fast on elastic1009 [14:41:09] Logged the message, Master [14:41:23] 3ContentTranslation-cxserver, MediaWiki-extensions-ContentTranslation, ContentTranslation-Deployments: Separate config for Beta and Production for CXServer - https://phabricator.wikimedia.org/T88793#1036684 (10akosiaris) 5Open>3Resolved Change merged, resolving [14:42:47] heh good [14:54:41] PROBLEM - puppet last run on rdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [15:02:07] 3operations, ops-eqiad: rack and setup restbase production cluster in eqiad - https://phabricator.wikimedia.org/T88805#1036709 (10Cmjohnson) Update, if all goes according to plan the memcached boxes should be moved early next week and the remaining 2 will setup by the end of next week. [15:04:31] (03PS1) 10ArielGlenn: datasets: add and update legal notice [puppet] - 10https://gerrit.wikimedia.org/r/190460 [15:06:17] (03CR) 10ArielGlenn: [C: 032] datasets: add and update legal notice [puppet] - 10https://gerrit.wikimedia.org/r/190460 (owner: 10ArielGlenn) [15:08:10] !log es-tool restart-fast on elastic1019 [15:08:18] !log correction, elastic1010 [15:08:18] Logged the message, Master [15:08:22] Logged the message, Master [15:09:13] 3operations, Incident-20150205-SiteOutage, ops-eqiad: Split memcached in eqiad across multiple racks/rows - https://phabricator.wikimedia.org/T83551#1036716 (10Cmjohnson) DNS entries have been made and are sitting in gerrit for review. https://gerrit.wikimedia.org/r/#/c/190358/ Switch ports have been labeled,... [15:11:10] AndyRussG: hiya [15:11:25] ottomata: hi! :) [15:12:09] How's it going? [15:12:22] i hear you got some problems? someone told me to ask you what's up, and if you need help at the very least getting help. :) [15:13:02] RECOVERY - puppet last run on rdb1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:17] (03CR) 10Thcipriani: [C: 031] "Key verified. Group looks right." [puppet] - 10https://gerrit.wikimedia.org/r/190408 (https://phabricator.wikimedia.org/T89378) (owner: 10Dzahn) [15:14:41] ottomata: Hey, much appreciated! Yes we have a double mystery that I think has "ops" written on it :) [15:15:01] ottomata: https://phabricator.wikimedia.org/T89258 [15:15:32] If you have a sec, take a look at that Phab task, especially the last two posts. If it's not clear what the issues are, let me know... [15:16:54] tl:dr; there is dead code that shouldn't be run that is being run and is causing errors in the hhmv log. After investigating we found some unexpected requests coming in that can account for some of that code being triggered, but they're not enough to account for the volume of errors we're seeing [15:18:00] So the two mysteries are: why are those requests even coming in (not such a big deal but still an important question to ask) and why is that code being run so much (this is the more serious one, I think) [15:18:07] Thanks so much!! Really appreciate it :) [15:18:37] ottomata: ^ [15:20:12] AndyRussG: cached js on clients could explain it, right? [15:20:16] as you noted? [15:20:28] ottomata: it could explain the requests, yes, that's the likely explanation for the first mystery [15:20:38] the second mystery is the missing lines? [15:20:46] webrequest lines [15:20:48] lgos* [15:20:50] logs* [15:21:06] (Though re: the requests, it's been a _really_ long time since we stoped serving that code and I think it's worth figuring out more details of what's going on there) [15:22:01] And yes, the second one is the missing webrequest lines... or some other explanation for the SpecialRecordImpression class to be called (like maybe some error in some other code that somehow calls it by mistake?) [15:22:14] maybe. mabye not though. sounds like it would be a lot of effort to figure that out. if we don't see missing sequence numbers in varnishkafka logs, i think it is very very unlikely that those requests are actually going through varnish, and it is more likely that something else is retrying or initiating these requests internally [15:22:21] yeah [15:22:53] and, if the code is dead anyway, it might not be worth the effort to solve the mystery (in my opinion anyway, others might differ) [15:23:23] ^ btw above I wrote the wrong name of the special page, it's SpecialBannerRandom [15:23:57] ottomata: whether or not it's worth it is really ops's call, I think [15:24:07] We can pretty easily remove the code and deploy a fix [15:24:26] but I wouldn't want to do it without a green light from you guys [15:25:29] so we only have one stack trace, which does point to a call to Special:BannerRandom, but it is to be expected that at least some of the errors would indeed happen like that [15:25:37] But if we could get a greater sampling of stack traces... [15:26:00] that is not something I know much about...not sure how mediawiki error logs work [15:26:19] Also of note, nothing is really truly broken in a user-facing way, as far as we can tell. Just the annoying log errors [15:26:23] does that stack trace show a direct call to that page from a request? or something less direct? [15:26:47] Yes, but it's only one, and with the 30k/day requests, it's expected that at least some will be like that [15:27:07] If we could get a bigger sample, that'd be fun [15:27:22] it samples the stack traces from each of the errors? [15:27:38] any idea who knows about mediaiwiki error logging? [15:27:43] A sample of more stack traces from lots of these errors, yeah [15:28:22] No idea... Yesterday I got some help from Chad and also qchris in analytics [15:28:29] Fundraising and community banners are displaying properly as far as we can tell, no known site breakage [15:28:46] So if you can live with the log messages, and want to take the time to investigate, I think that's also fine [15:29:25] Or we could try to think of a fix that would de-annoyify the log but not cover up whatever issue it is and let you folks continue to study it [15:29:59] haha, i'm happy to ask someone if they can figure out how to get some more stack traces, but I'm not personally worried about the issue. (there are bigger fish to fry :) ), i'd just remove the code and not think about it [15:30:01] PROBLEM - check_puppetrun on backup4001 is CRITICAL: CRITICAL: puppet fail [15:30:08] but maybe someone else would think otherwise. [15:30:08] hm [15:30:13] i'll email ops list an mention the ticket [15:30:26] Ah OK that sounds like a plan :) [15:30:36] Lemme subscribe there too then... thanks [15:31:09] Yeah I thing FR-tech is happy to just remove the code, especially if someon from ops posts to the Phab page saying that's OK [15:32:44] <^d> +1 to just nuking the evil code [15:32:45] <^d> :) [15:32:54] (ottomata: or I guess that's a closed list... in any case, if it's appropriate please feel free to 'cc me and others in FR-tech, or also not if preferred) [15:33:04] frack puppet hiccups are expected, I just rolled out a puppet config change and some aren't restarting smoothly [15:33:42] (Jeff_Green: ?) [15:33:47] (unrelated?) [15:34:10] ^d: hmmm :) [15:34:26] AndyRussG: gimme a few minutes to straighten out the puppet debacle [15:35:00] RECOVERY - check_puppetrun on backup4001 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [15:35:24] Jeff_Green: sorry I think it was just a misundersanding (unless you were commenting on what we were talk about here) ;p [15:35:33] ok [15:36:05] AndyRussG: I CCed you, but asked for comments on the phab ticket [15:39:21] ottomata: fantastic, thanks much! [15:55:41] AndyRussG, fyi you can get exception logs from fluorine:/a/mw-log/exception.log [15:57:32] Krenair: hmmm, thanks! [15:59:00] Krenair: but I don't see any logs for my errors in hhmv.log (all relating to SpecialBannerRandom and its call to BannerChooser) [15:59:19] Feb 13 15:59:08 mw1214: #012Fatal error: Argument 1 passed to BannerChooser::__construct() must be an instance of AllocationContext, null given in /srv/mediawiki/php-1.25wmf16/extensions/CentralNotice/includes/BannerChooser.php on line 41 [15:59:19] Feb 13 15:59:08 mw1169: #012Notice: Undefined property: SpecialBannerRandom::$allocContext in /srv/mediawiki/php-1.25wmf16/extensions/CentralNotice/special/SpecialBannerRandom.php on line 27 [15:59:40] !log es-tool restart-fast on elastic1011 [15:59:45] Logged the message, Master [16:01:28] AndyRussG, you don't? That's almost all I see going through hhvm.log... [16:01:58] Krenair: right, but nothing about that in exception.log [16:02:10] (which is what you mentioned above ^) [16:02:26] right, it could be in hhvm.log instead of exception.log, sorry [16:02:34] Sorry I wasn't clear [16:02:36] Yeah [16:03:20] What I would like is either moar stack traces for the errors in hhvm.log, or an official green light from ops to not care [16:06:32] curious, I just rebased and merged https://gerrit.wikimedia.org/r/#/c/190426/ but no utterances from grrrit-wm [16:13:02] godog: Seems to be down [16:13:09] but I have no idea who to poke about it [16:17:02] hoo: the bot I presume? [16:17:08] Yeah [16:23:22] ok let's take a look [16:27:00] (03CR) 10ArielGlenn: [C: 032] datasets: bw limit the kiwix rsync [puppet] - 10https://gerrit.wikimedia.org/r/190470 (owner: 10ArielGlenn) [16:28:00] ... [16:29:23] hm that looks like a bot :-d [16:31:25] bah I didn't take any action, perhaps it lost ssh connection to gerrit [16:36:27] It's back, btw :) [16:38:26] <^d> apergos: We never got back to that utf_normal dl() call in wmf-config [16:42:59] apergos: Converting the snapshot hosts to hhvm could actually make a lot of sense [16:43:14] for long running tasks you can save quite some time with hhvm even on cgi [16:43:21] * cli [16:43:42] Just as an slightly related note :P [16:43:55] grrrit-wm isn't fully back btw, did some gerrit activity but it didn't show up [16:46:57] True :/ [16:53:06] ^d: ugh you are right [16:53:16] and I would have been own this morning but now of course it's [16:53:20] 7 pm on a friday :-/ [16:53:23] meeeehhhh [16:53:28] <^d> Yeah I was gonna say it's already friday [16:53:30] *would have been down [16:53:37] <^d> friday the 13th too [16:53:41] hahahaha [16:53:46] <^d> let's not tempt the irony gods [16:53:51] no kidding [16:54:35] I'll get toollabs access first, meanwhile I don't see anything particularly wrong with gerrit except taking up a lot of cpu [16:55:01] <^d> apergos: I'm going to file a task for this so we don't forget over the weekend [16:55:18] thanks [16:55:31] I've talked myself into believing it will be fine except that [16:55:39] nothing ever takes 5 minutes [16:55:40] <^d> Yeah [16:57:58] <^d> Filed T89466, assigned to myself [17:02:50] ok. did you add me? [17:05:36] legoktm_ ^d trying to debug grrrit-wm but I'm lacking access to lolrrit-wm in toollabs, a little help? :p [17:05:56] <^d> apergos: cc'd, yeah [17:06:02] <^d> godog: I know zilch about the bot. [17:06:04] ok great [17:06:06] * ^d supposedly has access [17:06:38] ah there it is, I just wasn't seeing it show up in my inbox [17:07:04] I rebooted it. [17:07:20] Someone should rebase something random to test it [17:07:32] Or upload something productive. That works too [17:07:51] thanks Krenair [17:08:02] hey godog, [17:08:16] i'm looking at a graphite threshold check for nuria, and i'm confused by something [17:08:21] oi ottomata [17:08:35] role::graphite::production includes several monitoring::graphite_threshold checks [17:08:40] but, I thikn those are meant to only be included on one host [17:08:53] and, the only reasn they are there, is beacuse it doesn't matter what single host they are included on [17:08:58] and previously, there was only one graphite host [17:10:25] (looking) [17:12:04] ottomata: yep I guess they are there lacking another place where to be, you are right tho it can get included multiple times now the role, not sure what happens to exported resources in that case [17:12:21] me neither [17:12:27] i'm grepping for them on neon, and i don't actually see them [17:12:30] or, lemme grep harder.. [17:12:46] oh, yes i do [17:12:51] sorry, after sudo was in wrong dir :p [17:13:17] i think it will just make multiple checks on different hosts with the same data [17:13:24] hehe [17:14:26] ^d: https://gerrit.wikimedia.org/r/#/c/190475/ if you want to generate some gerrit activity :p [17:15:48] * ^d left a comment [17:16:06] Krenair: ^ [17:16:20] sigh [17:16:54] I see events from gerrit via ssh using my user btw, so that part works fine [17:17:48] Probably a gerrit-to-redis issue? [17:18:11] Not something I can fix [17:19:23] marktraceur, ^ [17:19:25] you can fix it [17:19:28] probably [17:19:36] https://wikitech.wikimedia.org/wiki/Grrrit-wm#Debugging_stuck_stream [17:20:16] Maybe [17:21:04] hopefully [17:21:25] Done [17:22:09] Krenair: I filed https://phabricator.wikimedia.org/T89468 and poked some people. If it's not fixed soon poke all the roots you can find again. It's a 2 minute task. [17:22:31] marktraceur, commented on https://gerrit.wikimedia.org/r/#/c/190475/ - no update here though :( [17:22:56] bd808: I'll take a look [17:23:03] thanks godog [17:24:41] bd808: done! [17:24:58] thanks godog [17:25:04] Krenair: Ugh. [17:25:09] Maybe I need to restart the bot too [17:25:11] thank you godog [17:25:17] I tried that already [17:25:21] you could try again though [17:26:10] marktraceur: did you restart gerrit-to-redis? [17:26:14] Maybe zuul is fucked...but wikibugs isn't talking either, so maybe there's something crazier happening [17:26:17] legoktm: Yeah [17:26:26] wikibugs is dead too? [17:26:44] (fyi, antoine is moving today, so not online :) ) [17:26:56] if both of those are dead I'd blame redis [17:27:09] maybe redis is dead [17:27:12] Fantastic [17:27:13] marktraceur: Zuul monitor looks fine to me. [17:27:16] OK [17:27:28] I'd guess redis, and lucky for me, I have no idea how to fix that [17:27:32] tools-redis:6379> ping [17:27:32] PONG [17:27:38] And I need to fix file deletion on betacommons anyway [17:27:50] ok so redis is clearly not entirely dead [17:29:19] hmm [17:29:24] wikibugs is getting events [17:29:37] 2015-02-13 17:26:30,764 - wikibugs.wb2-phab - DEBUG - {'projects': {'Wiki-Release-Team': {'uri': '/tag/wiki-release-team/', 'shade': 'violet', 'disabled': False, 'tagtype': 'users'}}, 'user': 'MarkAHershberger', 'title': 'Take DumpHTML as a use case of 3rd party extension that MediaWiki maintainers cannot ignore', 'url': 'https://phabricator.wikimedia.org/T536#1036966', 'comment': 'greg writes:\n\n> I move to close this task as "decline" [17:29:37] and move forward more positively.\n\nI\'m fine with "decline" but I would like to understand what you mean by\n"move forward positively".'} [17:30:31] and it's responding to ctcp.. [17:33:35] The bots are both down, it can't be their fault [17:33:40] I restarted wikibugs [17:33:43] Maybe it's Freenode? [17:34:35] tools-redis:6379> RPUSH legotest "help" [17:34:35] (error) OOM command not allowed when used memory > 'maxmemory'. [17:34:36] nope [17:34:46] redis is just out of memory [17:34:51] * legoktm checks nagf [18:16:15] hrmm... why isnt gerrit outputting to here? [18:16:28] which bot was that again? [18:16:41] I think toollabs is having issues? [18:17:01] someone else on my team is dealing wiht it i hope? [18:17:20] its the silence of the bots! [18:17:22] looks like it [18:17:33] !log morebots, you doing yer thing? [18:17:38] but I'm in a meeting totally paying attention to it and not IRC [18:17:39] Logged the message, Master [18:17:46] well, i have morebots, its the only bot i need. [18:17:49] (or like) [18:18:28] specifically I think redis was OOM so gerrit-to-redis wasn't too happy about that [18:21:15] godog: im working on grpahite2001 fyi (i know ticket had stalled,sorry about that) [18:21:26] so hopefully will hand off later today =] [18:21:33] robh: sweet, thanks! [18:25:12] robh, I tried turning things off and back on again to fix the gerrit bot [18:25:16] marktraceur and legoktm also looked [18:25:25] no luck :/ [18:25:52] valhallasw is looking into it right now :P [18:26:31] ah yes, I see [18:26:53] <^d> Also, wikibugs? [18:26:57] <^d> I haven't seen him all morning [18:27:06] ^d: same issue, both are dependent upon redis [18:27:18] <^d> mmk [18:27:35] <^d> Good thing we don't use redis for anything important around here! [18:28:46] <_joe_> legoktm: what's up with redis? [18:28:47] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1037178 (10GWicke) [18:29:23] _joe_: tools-redis ran out of memory, which broke the bots that use it [18:29:32] <_joe_> oh TOOLS redis [18:29:33] <_joe_> ok [18:29:38] :P [18:29:49] <_joe_> I was already about to go on a prolonged rant :P [18:29:52] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#21146 (10GWicke) [18:30:16] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#21147 (10GWicke) [18:32:23] who are the transit providers at ULSFO? [18:32:28] 3operations, OTRS: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1037186 (10pajz) If I understand the documentation correctly, I would suggest, by the way, that, should we decide to implement this, we do not disable SessionCheckRemoteIP but only SessionDeleteIfNotRemoteID. As... [18:32:39] we just got approved for a big discount for ISP services from the CaPUC. [18:35:38] cajoel: GTT/TiNet and Telia [18:36:03] jgage: do you have support/billing contacts for them? [18:36:17] hm no, i don't know where that info lives [18:37:03] https://office.wikimedia.org/wiki/Vendor_Contact_List#ISP [18:37:26] no Telia there [18:37:29] Telia? [18:37:34] never heard of em :) [18:38:13] paravoid is the best person to ask about this while mark is on vacation [18:38:32] I call him Faidon. :) [18:38:34] thx [18:38:37] :) [18:39:21] 3operations: Problem accessing services like Graphite and Logstash - https://phabricator.wikimedia.org/T89474#1037201 (10Aklapper) Works for me with my Gerrit / wikitech.wm.org password for http. Sure you double-checked the usual stuff (no typos in pw, correct address, browser accepts cookies)? http://graphite.... [18:39:47] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037202 (10Aklapper) [18:53:03] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1037240 (10GWicke) [18:53:37] 3operations, OTRS: Make OTRS sessions IP-address-agnostic - https://phabricator.wikimedia.org/T87217#1037242 (10csteipp) From just a quick look at their code, it looks to me like as long as SessionCheckRemoteIP is enabled, then when the user requests a page from the site and their IP has changed, they won't have... [18:55:26] 3operations, Scrum-of-Scrums, hardware-requests, RESTBase: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#1037247 (10RobH) a:5RobH>3None [18:56:26] 3operations, Scrum-of-Scrums, hardware-requests, RESTBase: RESTBase production hardware - https://phabricator.wikimedia.org/T76986#824247 (10RobH) As the actual hardware request via this ticket is done, I pulled myself off the assigned list. I cannot quite resolve it yet, since we have to rack and make the last... [18:56:38] 3operations, Scrum-of-Scrums, hardware-requests, RESTBase: RESTBase production hardware - 4 of 6 ready - https://phabricator.wikimedia.org/T76986#1037255 (10RobH) [18:57:15] 3operations, RESTBase-Cassandra: Make the cassandra module use hiera properly - https://phabricator.wikimedia.org/T76149#1037259 (10GWicke) [18:58:11] 3operations, RESTBase: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1037264 (10GWicke) [18:58:35] 3operations, Labs, hardware-requests, ops-eqiad: Can virt1000 take more ram? - https://phabricator.wikimedia.org/T89266#1037265 (10RobH) a:5Cmjohnson>3Andrew @andrew & @coren: When did you guys want to take virt1000 down to have more memory installed? It sound like you guys are not in agreement if this is... [18:58:51] 3operations, Labs, hardware-requests, ops-eqiad: virt1000 memory upgrade - https://phabricator.wikimedia.org/T89266#1037268 (10RobH) [18:59:10] i didnt miss you at all wikibuygs [18:59:13] wikibugs even. [18:59:31] * robh only turned off the ignore because folks kept doing robh: ^ that ticket [19:00:51] 3operations, RESTBase: Set up cassandra monitoring - https://phabricator.wikimedia.org/T78514#1037279 (10GWicke) [19:01:38] 3operations, RESTBase: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1037285 (10GWicke) [19:01:39] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#21702 (10GWicke) [19:03:06] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1037298 (10GWicke) [19:03:34] 3operations, RESTBase: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#839029 (10GWicke) [19:03:50] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037309 (10mforns) Thanks for the quick response :] Yes, I double-checked username and password, cookies are on. And I can access wikitech, sure. But when I try to access,... [19:03:57] 3operations, RESTBase: Public entry point for RESTBase - https://phabricator.wikimedia.org/T78194#1037311 (10GWicke) [19:03:58] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#23528 (10GWicke) [19:06:14] 3operations, ops-eqiad: wipe lsearch machines - https://phabricator.wikimedia.org/T88352#1037328 (10Cmjohnson) Forgot to add that racktables has been updated and the servers were added to spare server wikitech page. [19:07:23] 3operations, RESTBase-Cassandra: Make the cassandra module use hiera properly - https://phabricator.wikimedia.org/T76149#1037338 (10GWicke) [19:07:38] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037340 (10Legoktm) Are you in the wmf or nda ldap groups? logstash and graphite access are restricted to those groups. [19:10:20] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1037348 (10GWicke) [19:11:35] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#783172 (10GWicke) [19:12:06] 3operations: alternatives to racktables ? - https://phabricator.wikimedia.org/T84001#1037359 (10Dzahn) [19:12:33] (03PS1) 10Faidon Liambotis: trebuchet: fix provider to be init system agnostic [puppet] - 10https://gerrit.wikimedia.org/r/190496 [19:12:54] 3RESTBase, Ops-Access-Requests: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1037368 (10GWicke) [19:12:55] 3operations, Scrum-of-Scrums, Services, RESTBase: RESTbase deployment - https://phabricator.wikimedia.org/T1228#1037367 (10GWicke) [19:14:06] (03CR) 10Ori.livneh: [C: 031] trebuchet: fix provider to be init system agnostic [puppet] - 10https://gerrit.wikimedia.org/r/190496 (owner: 10Faidon Liambotis) [19:14:58] (03CR) 10Jhernandez: [C: 031] Enable gather extension on en beta labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/189863 (owner: 10Robmoen) [19:15:12] (03CR) 10Faidon Liambotis: [C: 032] trebuchet: fix provider to be init system agnostic [puppet] - 10https://gerrit.wikimedia.org/r/190496 (owner: 10Faidon Liambotis) [19:15:39] 3operations, Labs, hardware-requests, ops-eqiad: virt1000 memory upgrade - https://phabricator.wikimedia.org/T89266#1037376 (10coren) No, don't worry about it - it's definitely wanted, I'm just wondering wether it is //sufficient//. At any rate, since this may make labs clunky for a while we probably want to av... [19:21:05] (03CR) 10Dzahn: "well, the thing is that firewall that exists is not on silver, but now this role is on silver" [puppet] - 10https://gerrit.wikimedia.org/r/190147 (owner: 10Dzahn) [19:22:41] (03CR) 10Dzahn: "this changed since the virt1000/silver split" [puppet] - 10https://gerrit.wikimedia.org/r/190147 (owner: 10Dzahn) [19:23:34] 3RESTBase, Ops-Access-Requests: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1037393 (10GWicke) p:5Triage>3High [19:24:05] (03CR) 10Dzahn: "robh said he would prefer the mgmt entries to stay until these hosts are physically removed from the rack. my point is more that now you c" [dns] - 10https://gerrit.wikimedia.org/r/190150 (owner: 10Dzahn) [19:24:07] 3RESTBase, Ops-Access-Requests: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1034732 (10GWicke) Setting priority to high as this blocks us from starting testing on prod hardware & Jessie. [19:25:32] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037401 (10Dzahn) What Legoktm said. The difference is the membership in these additional LDAP groups. [19:26:30] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037405 (10Dzahn) a:3Dzahn [19:28:34] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037408 (10Dzahn) @mforns Please try again now, i added you to the WMF group. ``` modify-ldap-group --addmembers=mforns wmf ldaplist -l group wmf | grep mforns member:... [19:29:37] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037410 (10Dzahn) p:5Triage>3Normal [19:30:40] PROBLEM - puppet last run on rcs1001 is CRITICAL: CRITICAL: Puppet has 1 failures [19:31:41] RECOVERY - puppet last run on rcs1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [19:32:51] 3RESTBase, Ops-Access-Requests: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1037416 (10Dzahn) Checked data.yaml and we have an existing "cassandra-test-roots" 66 description: users with root on cassandra hosts 67 members: [gwicke, ssastry, jdouglas, mobrovac]... [19:33:33] 3RESTBase, Ops-Access-Requests: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1037417 (10Dzahn) I suppose we just need to make a similar one "restbase-roots" and apply that on the nodes. [19:33:58] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037418 (10mforns) I'm not sure. I looked for these groups in (operations/puppet)./modules/admin/data/data.yaml and I didn't find them. Probably this is not the place to loo... [19:36:10] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037424 (10mforns) Oh, sorry. I did not F5 the page for a long time and then send the late comment. So, now it works! Thank you very much! Marcel [19:36:28] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037425 (10Dzahn) @mforns you can't find these groups in puppet, they are LDAP groups. shell access to terbium is required to list and modify them. But i just did that and a... [19:37:01] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037427 (10Dzahn) 5Open>3Resolved ok, cool :) resolving this ticket then. [19:38:16] 3operations: Cannot log into Graphite and Logstash with my wikitech password: Unauthorized error - https://phabricator.wikimedia.org/T89474#1037431 (10mforns) Ok, thanks @dzahn! It's working now :] Cheers [19:45:49] (03PS1) 10Dzahn: create admin group restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/190500 (https://phabricator.wikimedia.org/T89366) [19:45:56] (03CR) 10RobH: "So if server X and server Y have the same entries, and server X and server Y have the same mgmt ip, removing the FQDN of one won't solve t" [dns] - 10https://gerrit.wikimedia.org/r/190150 (owner: 10Dzahn) [19:50:10] (03PS1) 10Ori.livneh: trebuchet: use salt to check on salt-minion [puppet] - 10https://gerrit.wikimedia.org/r/190501 [19:51:31] (03CR) 10Ori.livneh: trebuchet: use salt to check on salt-minion (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/190501 (owner: 10Ori.livneh) [19:53:26] 3operations: Check that the redis roles can be applied in codfw, set up puppet. - https://phabricator.wikimedia.org/T86898#1037459 (10RobH) [19:53:27] 3operations: Setup redis clusters in codfw - https://phabricator.wikimedia.org/T86887#1037460 (10RobH) [19:53:28] 3operations, ops-codfw, hardware-requests: Procure and setup rdb2001-2004 - https://phabricator.wikimedia.org/T86896#1037457 (10RobH) 5Open>3stalled The dell quote https://rt.wikimedia.org/Ticket/Display.html?id=9172 has been escalated to Faidon for purchase approval. [19:53:44] 3operations, ops-codfw, hardware-requests: Procure and setup rdb2001-2004 - in purchase approvals process - https://phabricator.wikimedia.org/T86896#1037461 (10RobH) [19:55:45] (03PS2) 10Dzahn: create admin group restbase-roots [puppet] - 10https://gerrit.wikimedia.org/r/190500 (https://phabricator.wikimedia.org/T89366) [19:59:09] (03CR) 10Dzahn: "fair, i'll wait until we resolved a ticket that is for physically removing these hosts" [dns] - 10https://gerrit.wikimedia.org/r/190150 (owner: 10Dzahn) [19:59:30] (03Abandoned) 10Dzahn: remove old ipmi entries in esams [dns] - 10https://gerrit.wikimedia.org/r/190150 (owner: 10Dzahn) [20:00:06] (03Abandoned) 10Dzahn: move all files/icinga to modules/icinga/files [puppet] - 10https://gerrit.wikimedia.org/r/187087 (owner: 10Dzahn) [20:00:33] (03Abandoned) 10Dzahn: add hue.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/180471 (owner: 10Dzahn) [20:00:58] Dear opsen, we're deprecating a Special page endpoint, and would like advice on which status code we should return, if any. Is 204 already included in our reporting? Would 30k responses/day at that code be spam-drowning other information, or is that fine? [20:01:05] AndyRussG: ^^ [20:03:53] 410 - GONE ? [20:04:04] Reedy: MaxSem: Jeff_Green: paravoid: ^ [20:04:17] mutante: thanks! [20:04:35] well, wait for more replies, but this seemed like it could fit "Indicates that the resource requested is no longer available and will not be available again. This should be used when a resource has been intentionally removed" [20:04:46] 3Labs, operations: Fix php5 cli conf.d symlinks on silver - https://phabricator.wikimedia.org/T89468#1037475 (10Krenair) [20:06:13] :)} [20:07:08] Oh, wikibugs is back, what was wrong with it? [20:08:59] marktraceur: uh, oh--and it's has an oddly stiff walk. Why is Max barking? [20:09:06] Wolfie's just fine, honey. [20:09:22] Who's Max? [20:09:34] (03CR) 10Ottomata: "Don't need to abandon! I'm doing the Hadoop upgrade on Monday!" [dns] - 10https://gerrit.wikimedia.org/r/180471 (owner: 10Dzahn) [20:09:51] aww, Terminator 2 reference [20:12:03] (03Restored) 10Dzahn: add hue.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/180471 (owner: 10Dzahn) [20:16:16] jzerebecki: re 184360 . is it really extensions/Wikidata/extensions/Wikibase/lib/maintenance/ AND extensions/Wikidata/extensions/Wikibase/repo/maintenance/ ? [20:17:02] mutante: Yeah [20:17:18] hoo: ok:) tx [20:17:19] 3Ops-Access-Requests, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1037540 (10GWicke) @dzahn, this is for the production cluster which we are just spinning up (T76986). Otherwise the structure is pretty much the same (only renamed cassandra-roots to cassandra-test-r... [20:17:22] :) [20:17:27] win 2 [20:17:31] hey mutante, i'm trying to fix this https://phabricator.wikimedia.org/T89447 [20:17:44] and I feel like i asked this question when we were talking about the hue ssl stuff [20:17:51] stats.wikimedia.org is now behind misc-web-lb [20:18:02] https is handled by some nginx instance somewhere (not even sure where) [20:18:16] but, i want to force https, especially for a certain part of the site [20:18:20] ottomata: yea, so you need this: [20:18:50] RewriteCond %{HTTP:X-Forwarded-Proto} !https [20:18:50] RewriteCond %{REQUEST_URI} !^/status$ [20:18:50] RewriteRule ^/(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,E=ProtoRedirect] [20:18:53] Header always merge Vary X-Forwarded-Proto env=ProtoRedirect [20:19:11] ah, well, in Apache [20:19:24] when it's behind misc-web and you want to enforce https [20:19:36] hm ok [20:19:54] nginx must be similar, you want a condition based on X-Forwarded-Proto NOT being https [20:20:25] the exemption for the status page you can skip [20:20:32] awight: re. status code, i dunno enough about our log collection to know why 2xx or 4xx would make a difference [20:20:46] yay! [20:20:48] that works [20:20:49] thank you [20:20:53] :) yw [20:21:59] Jeff_Green: I think we're going with 410, cos it means "no intention to ever serve this endpoint again" [20:22:27] this is an endpoint that's hit due to client side js? [20:23:07] (03PS1) 10Ottomata: Fix redirect loop in stats.wikimedia.org/geowiki-private [puppet] - 10https://gerrit.wikimedia.org/r/190504 (https://phabricator.wikimedia.org/T89447) [20:28:39] (03CR) 10Ottomata: [C: 032] Fix redirect loop in stats.wikimedia.org/geowiki-private [puppet] - 10https://gerrit.wikimedia.org/r/190504 (https://phabricator.wikimedia.org/T89447) (owner: 10Ottomata) [20:29:32] Jeff_Green: old, old client-side JS that no sane cache would ever still be hanging onto, but that still sends us 30k requests per day [20:32:02] AndyRussG: seems reasonable [20:32:16] Jeff_Green: cool thanks! [20:35:43] (03PS4) 10Dzahn: Enable wikibase change dispatcher and pruning for test.wikidata [puppet] - 10https://gerrit.wikimedia.org/r/184360 (https://phabricator.wikimedia.org/T87026) (owner: 10Aude) [20:37:00] (03CR) 10Dzahn: [C: 032] Enable wikibase change dispatcher and pruning for test.wikidata [puppet] - 10https://gerrit.wikimedia.org/r/184360 (https://phabricator.wikimedia.org/T87026) (owner: 10Aude) [20:40:37] (03CR) 10Dzahn: "same as existing jobs for prod.wikidata, just for test.wikidata" [puppet] - 10https://gerrit.wikimedia.org/r/184360 (https://phabricator.wikimedia.org/T87026) (owner: 10Aude) [20:42:43] (03CR) 10Dzahn: "bug was resolved, but per aude "btw, https://gerrit.wikimedia.org/r/#/c/184360/ is still needed for test.wikipedia and test2 to automatica" [puppet] - 10https://gerrit.wikimedia.org/r/184360 (https://phabricator.wikimedia.org/T87026) (owner: 10Aude) [20:45:40] 3Ops-Access-Requests: Requesting sudo access to vanadium for mforns - https://phabricator.wikimedia.org/T89471#1037605 (10Dzahn) p:5Triage>3Normal [20:46:04] 3operations, Ops-Access-Requests: Requesting sudo access to vanadium for mforns - https://phabricator.wikimedia.org/T89471#1036995 (10Dzahn) [20:47:32] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [20:52:39] 3operations, Ops-Access-Requests: Requesting sudo access to vanadium for mforns - https://phabricator.wikimedia.org/T89471#1037637 (10Dzahn) Hi, so access to vanadium would mean the admin group 'eventlogging-admins' ( description: Login access for EventLogging investigation) current members are: [nuria, milim... [20:54:13] 3operations, Deployment-Systems: trebuchet puppet provider broken on systems without upstart - https://phabricator.wikimedia.org/T89461#1037642 (10GWicke) @faidon reported what looks like a missing submodule checkout: ``` restbase1003 restbase[25378]: module.js:340 restbase1003 restbase[25378]: throw err;... [20:56:49] 3operations, Ops-Access-Requests: Requesting sudo access to vanadium for mforns - https://phabricator.wikimedia.org/T89471#1037646 (10Dzahn) or did you mean to request root on vanadium when you say sudo permits. it could mean a few specific commands or it could mean ALL (ALL). That would be a difference to us an... [20:59:38] 3Ops-Access-Requests, RESTBase: Access to restbase / cassandra cluster - https://phabricator.wikimedia.org/T89366#1037657 (10Dzahn) @gwicke alright, cassandra-roots vs. cassandra-test-roots makes sense now. [21:02:14] 3operations, Ops-Access-Requests: Give Tyler Cipriani shell access (with access to CI systems as well) - https://phabricator.wikimedia.org/T89378#1037664 (10Dzahn) [21:02:31] 3operations, Ops-Access-Requests: Requesting access to ANALYTICS RESOURCES for joal - https://phabricator.wikimedia.org/T89357#1037665 (10Dzahn) [21:02:43] 3operations, Ops-Access-Requests: Requesting sudo for hafnium for nuria - https://phabricator.wikimedia.org/T88988#1037667 (10Dzahn) [21:03:36] @seen spage [21:03:36] mutante: I have never seen spage [21:06:08] 3operations: decrease negative cache TTL for lookups from MTA to google (WMF google apps) - https://phabricator.wikimedia.org/T84600#929276 (10Dzahn) [21:06:43] wm-bot: you haven't lived yet! [21:07:49] greg-g: mutante: Jeff_Green: is sending a tirade into wfLogWarning any less spammy than hhvm.log? Or is it the same? Or nothing? [21:07:57] mutante, you made it public and then non-public again? it doesn't look very sensitive to me [21:08:06] Krenair: Hi! ^ [21:08:36] Also, are there restrictions on what info from web request headers I can send into a debug log? [21:08:44] (03PS1) 10Ori.livneh: xenon: drop minwidth arg from flamegraph.pl invocation [puppet] - 10https://gerrit.wikimedia.org/r/190513 [21:09:13] AndyRussG: simply moving the log message to another log isn't fixing it :) [21:09:44] greg-g: it'd be a different, more informative message, in a patch that we could deploy and then undeploy right away [21:10:17] Krenair: yea, it was a mistake, i noticed an issue with that right after doing that [21:10:19] ah, so it's just a live hack debug thing, basically? [21:10:41] greg-g: yep. Should we do as a live hack rather than merge and revert? [21:10:41] AndyRussG, throw an exception? that would give you a backtrace [21:10:54] merge/revert is better [21:10:59] ok [21:11:31] MaxSem: we're pretty certain about the code path, but want to know why we're getting multiple hhvm fatal errors for a single web request. [21:12:05] The idea was to dump all the request headers so we can see if the requests all come from the same back-end cache server. [21:12:21] even when it fatals, some termination handlers/destructors can still be called [21:14:35] MaxSem: maybe... Our logging could include PID or something. Is there an equivalent? [21:15:05] how a pid would help? [21:15:23] awight: MaxSem: I think confirmation of code path would be nice, though also referrer and original request IP (is that allowed and doable)? [21:15:31] MaxSem: to distinguish between multiple requests from varnish, or multiple executions for one request. [21:15:55] mmm [21:16:53] i wish there was a way to make puppet give more details when it says Detail: wrong number of arguments (0 for 1) [21:17:26] because my arg is there, and i get a different error when i remove it [21:18:34] 3operations, Ops-Access-Requests: Access request for stat1003 - https://phabricator.wikimedia.org/T89418#1037706 (10Dzahn) [21:19:36] 3operations, Ops-Access-Requests: Access request for stat1003 - https://phabricator.wikimedia.org/T89418#1037709 (10Dzahn) p:5Triage>3High raising priority per "timeline: 2/18 or earliest possible" [21:22:38] !log demon Synchronized php-1.25wmf16/includes/resourceloader/ResourceLoaderImage.php: Debug fun (duration: 00m 05s) [21:22:42] Logged the message, Master [21:24:10] !log demon Synchronized wmf-config/InitialiseSettings.php: Debug fun (duration: 00m 05s) [21:24:13] Logged the message, Master [21:33:14] 3operations, Ops-Access-Requests: access request for researcher to analytics-users in Hadoop - https://phabricator.wikimedia.org/T89264#1037717 (10Dzahn) [21:34:17] 3operations, Ops-Access-Requests: access request for researcher to analytics-users in Hadoop - https://phabricator.wikimedia.org/T89264#1037719 (10Dzahn) p:5Triage>3High raising priority because due date is Monday [21:35:43] 3operations: Add another redis jobqueue server master and slave - https://phabricator.wikimedia.org/T89400#1037730 (10Dzahn) p:5Triage>3Normal [21:36:00] 3operations: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1037733 (10Dzahn) p:5Triage>3Normal [21:36:38] 3operations, Continuous-Integration: Create a Debian package for NodePool - https://phabricator.wikimedia.org/T89142#1037736 (10Dzahn) p:5Triage>3Normal [21:37:26] 3operations, MediaWiki-extensions-GWToolset, Multimedia: Can Commons support a mass upload of 14 million files (1.5 TB)? - https://phabricator.wikimedia.org/T88758#1037743 (10Dzahn) p:5Triage>3Normal [21:37:49] !log demon Synchronized php-1.25wmf16/includes/resourceloader/ResourceLoaderImage.php: debug time over (duration: 00m 05s) [21:37:54] Logged the message, Master [21:38:06] !log demon Synchronized wmf-config/InitialiseSettings.php: debug time over (duration: 00m 05s) [21:38:08] Logged the message, Master [21:38:15] manybubbles: is T86602 already done? [21:38:48] <^demon|busy> It's in progress [21:38:56] <^demon|busy> godog started it this week [21:39:04] mutante: I'm intentionally letting ops do it on their own! [21:40:41] ok, that's all i wanted to know (in progress) [21:40:45] tx [21:41:05] 3operations: Rolling restart for Elasticsearch to pick up new version of wikimedia-extra plugin - https://phabricator.wikimedia.org/T86602#1037793 (10Dzahn) p:5Triage>3Normal [21:42:52] awight mutante ^demon|busy Krenair: Jeff_Green: bblack: where does wfLogWarning go on production? anywhere? [21:43:02] * AndyRussG apologizes for pinging wildly [21:43:10] I'm not actually sure. [21:44:03] <^demon|busy> I think it goes in the general debug log? [21:44:12] <^demon|busy> docs seem to say so [21:44:27] ^demon|busy: I thought there was no debug logging on production? [21:44:56] (Um, I see this does not contradict what u said :) [21:44:59] <^demon|busy> Which is the next thing I was going to say..."which we don't log in prod" [21:45:00] <^demon|busy> :) [21:45:02] hehe [21:45:11] <^demon|busy> wfDebugLog() is my favorite logging tool [21:45:15] GRR [21:45:24] <^demon|busy> wfDebugLog( "SomeLogName", "My log message goes here" ) [21:45:32] <^demon|busy> Then configure SomeLogName in InitialiseSettings [21:45:46] oh. Wait, are you saying we *could* scrape off some debug logging on production? [21:46:01] <^demon|busy> We do lots of debug logging in prod. [21:46:03] It's only the general log which is >/dev/null'ed? [21:46:07] Awesome. Thank you. [21:46:09] <^demon|busy> The general "everything else" log isn't logged though [21:46:13] * Send a warning as a PHP error and the debug log. This is intended for logging [21:46:13] * warnings in production. For logging development warnings, use WfWarn instead. [21:46:15] 3operations: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1037820 (10Dzahn) Hi, stat1002 and stat1003 have different roles and i don't see a common one. stat1002 has a bunch of role::analytics::* roles but stat1003 is just "role::statistics::cruncher". Is this really ne... [21:46:15] <^demon|busy> (because it'd be way too noisy to be useful) [21:46:19] (doc for wfLogWarning) [21:47:01] 3operations, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1037822 (10Dzahn) [21:47:11] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1035650 (10Dzahn) [21:47:41] ^demon|busy: so for a qucik merge-then-revert-to-log-some-loggy-goodness, I'll use wfDebugLog and look for the results... where? [21:47:56] I think fluorine:/a/mw-log [21:48:23] <^demon|busy> Yeah, it'll end up in whatever you tell it to be called in InitialiseSettings [21:48:25] <^demon|busy> Lemme dig it up [21:48:49] <^demon|busy> All in /a/mw-log/ on fluorine [21:49:34] $wgDebugLogGroups, rather than $wgDebugLogFile, I guess? [21:49:42] <^demon|busy> Yep [21:49:43] yeah [21:49:44] k [21:49:49] AndyRussG, honestly we were just logging something without even going via gerrit [21:50:06] Krenair: ? [21:50:11] Ah right, I see [21:50:16] Hmmm [21:50:22] Krenair: I hear it's better to merge and revert [21:50:45] The nice part about that is it gives us an audit trail of when we broke everything :) [21:50:51] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1037826 (10Dzahn) stat1003 is a Statistics general compute node (non private data) (role::statistics::cruncher) stat1002 is a analytics server (role::analytics) stat1002 is... [21:50:52] ^demon|busy: Krenair: also, is there anything forbidden about writing full request headers to a log? as in, json_encode( $this->getRequest()->getAllHeaders() ) [21:51:57] I'm not sure. [21:52:27] <^demon|busy> Definitely wouldn't want to hang onto those for long... [21:52:32] I wouldn't do it. [21:52:46] <^demon|busy> csteipp said ok but looked a little sad when he did [21:53:20] I think we only need the backend cache server hostname. AndyRussG ? [21:53:26] I think I said, "as long as it's temporary and you delete it afterward" [21:53:40] <^demon|busy> You still looked sad saying it :p [21:54:35] All private data logging makes me sad. Very sad... like https://twitter.com/nihilist_arbys sad. [21:54:40] ^demon|busy: csteipp: Krenair I have no idea how to delete it, so maybe I won't [21:55:04] awight: what is that header? [21:57:10] I don't know yet :( Externally, it's in X-Cache [21:57:36] csteipp: ^^ which you would not consider private data, true? [21:59:01] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures [21:59:56] ^demon|busy: csteipp: Krenair: ^ I meant, maybe I won't do it... [22:00:04] awight: That data is not, as long as you're not tying it to an incoming request. [22:00:27] csteipp: okay, thanks [22:01:07] be careful with things such as actual client IPs, user agents, etc. [22:01:48] Krenair: yes, that sounds like asking for a bad case of subpoena or something [22:01:52] it's probably OK to log which cache server it came from, I'd imagine [22:04:46] 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1037878 (10Dzahn) >>! In T88842#1036244, @Aklapper wrote: > That was intentional, because //by default// projects are capitalized in Phab (except for when there are good reasons, like package names).... [22:05:13] (03PS1) 10Awight: Debug logging group for CentralNotice issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190571 [22:07:58] AndyRussG: ^^ [22:09:18] awight: Hmm! Looks fun but I have no rights nor undersanding there ... [22:10:05] (03CR) 10Reedy: [C: 031] Debug logging group for CentralNotice issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190571 (owner: 10Awight) [22:15:11] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [22:22:49] 3operations, Datasets-General-or-Unknown: dumps.wikimedia.org seems super-slow right now - https://phabricator.wikimedia.org/T45647#1037911 (10Henrik) I still get ~100KB/sec with the current limits. At that speed, downloading an hours worth of views takes ~15 minutes, so downloading a day is thus ~6 hours. It's... [22:39:34] 3operations, Phabricator: enable email for tickets in domains project? - https://phabricator.wikimedia.org/T88842#1037974 (10Dzahn) >>! In T88842#1036244, @Aklapper wrote: > Might be worth to add some notice to the project description why it needs to be lower case? done [22:41:43] 3operations, Services: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1037981 (10Dzahn) p:5Triage>3Normal [22:41:44] (03CR) 10Awight: [C: 032] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190571 (owner: 10Awight) [22:44:59] (03Merged) 10jenkins-bot: Debug logging group for CentralNotice issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190571 (owner: 10Awight) [22:47:14] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1037997 (10Halfak) So, both stat1002 and stat1003 have access to private data. I can't comment about puppet roles. I'm not sure what you are asking for WRT a "centralized... [22:47:57] (03PS2) 10Dzahn: static-bz: rewrite /show_bug.cgi to static HTML [puppet] - 10https://gerrit.wikimedia.org/r/190132 (https://phabricator.wikimedia.org/T85140) [22:49:16] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038004 (10Dzahn) I'm asking because to add the packages in puppet we need to decide which role to put it on. We install things in role classes which are applied to host nam... [22:50:25] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038013 (10Dzahn) a:3Ottomata [22:51:21] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1035650 (10Dzahn) Given to @ottomata for his advice where the puppet code should be added. [22:52:03] !log awight Synchronized php-1.25wmf16/extensions/CentralNotice: CentralNotice fixes for T89258 and T45250 (duration: 00m 07s) [22:52:07] AndyRussG: ^^ ! [22:52:08] Logged the message, Master [22:52:24] wooo [22:53:22] !log awight Synchronized php-1.25wmf17/extensions/CentralNotice: CentralNotice fixes for T89258 and T45250 (duration: 00m 06s) [22:53:24] Logged the message, Master [22:53:34] AndyRussG: okay, those are done. Now I'll set up config for the logging patch [22:56:38] !log awight Synchronized wmf-config: Set up a new debug logging group for T89258 (duration: 00m 06s) [22:56:41] Logged the message, Master [22:56:51] ooh boy. [22:56:52] rsync warning: some files vanished before they could be transferred (code 24) at main.c(1655) [generator=3.1.0] [22:57:01] ok [22:57:02] file has vanished: "/wmf-config/.InitialiseSettings.php.e5YkUB" (in common) [22:57:05] nbd [22:57:15] although--not my editor? [22:57:29] awight: whoo! Everything looks fine for CN banners on both mobile and desktop [22:57:35] whew! [22:57:37] Re: the settings.. wut? [22:57:54] It's an evil twin of my swp file or something [22:58:03] I'll log into an app server to verify the sync [23:00:44] greg-g: there's a CA patch we need to deploy: https://gerrit.wikimedia.org/r/190579, it's not confirming peoples emails and we just sent out a bunch of emails asking people to confirm their emails :/ [23:02:01] legoktm: doit [23:02:49] awight: that could be due to the rsync --delay-update flag? it stashes the new files on the destination then when they're all there mv's them in place en masse [23:03:06] why it would disappear is worrisome though [23:03:11] awight: are you done deploying? [23:03:12] oh cool. Yeah that makes it actually worrisome. [23:03:21] legoktm: one more piece, should take 15 min [23:03:27] ok [23:03:45] legoktm: I'm only touching CentralNotice, feel free to coordinate with me, yours sounds like a rush [23:03:48] ? [23:03:50] oh yes. [23:04:01] legoktm: please go for it, I can finish when you're done. [23:04:06] awight: or maybe I'm misinterpreting it: maybe the file was on the source that it thought was there but then vanish, if that's the case, i wouldn't worry since it is a tmp file [23:04:28] greg-g: but wasn't that needed for moving to the new path? [23:04:39] Now I'm thinking some of the app servers did not receive the update? [23:04:56] I mean, if somehow someone made the .IS.php.asfd file with an editor then closed it while you were sync'ing [23:05:15] we'd probably notice a chunk not having initialisesettings [23:05:25] awight: I'm going to need a little while to make the submodule bumps so you might as well finish [23:05:34] I mean, we don't have a check for it, but failures would sky rocket for those servers [23:05:37] greg-g: well it would have an old settings [23:05:42] ah, maybe... [23:05:44] not none, aiui [23:05:48] right [23:05:53] resync, can't hurt [23:05:53] legoktm: ok sure [23:05:58] greg-g: will do :) [23:06:18] !log awight Synchronized wmf-config: Set up a new debug logging group for T89258 (take 2) (duration: 00m 06s) [23:06:22] Logged the message, Master [23:06:24] greg-g: ok that was simple :) [23:06:27] :) [23:07:23] legoktm: just holler when you're bumped, we can make this a race :) [23:07:33] 3operations, Services: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1038047 (10GWicke) >>! In T88585#1017016, @bd808 wrote: > Can basic testing setup be added to the template/skeleton project as well? > > At $DAYJOB-1 we made a skeleton project a... [23:08:58] that's not how I was imagining speeding up deployments.... [23:09:00] ;) [23:09:39] letting the devs duel? [23:11:18] 3operations, Services: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1038053 (10GWicke) >>! In T88585#1017567, @fgiunchedi wrote: > +1 also I think it'd make sense to capture what metadata we need in a (per-repo?) file so templates and other artifa... [23:12:15] 3operations, Services: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1038058 (10GWicke) [23:12:45] 3operations, Services: Create a standard service template / init / logging / package setup - https://phabricator.wikimedia.org/T88585#1038059 (10GWicke) p:5Normal>3High [23:13:03] * awight taps fingers loudly, waiting for useless "test" job to duplicate work of the gate-and-submit job. [23:14:48] we don't always .. [23:14:48] (03PS1) 10Dzahn: install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) [23:15:49] !log awight Synchronized php-1.25wmf17/extensions/CentralNotice: CentralNotice debug logging for T89258 (duration: 00m 05s) [23:15:54] Logged the message, Master [23:16:52] (03CR) 10Dzahn: "this or any better ideas from analytics-ops ?" [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) (owner: 10Dzahn) [23:18:55] 3operations: migrate graphite to new hardware - https://phabricator.wikimedia.org/T85909#1038082 (10RobH) [23:20:52] !log awight Synchronized php-1.25wmf16/extensions/CentralNotice: CentralNotice debug logging for T89258 (duration: 00m 07s) [23:20:56] Logged the message, Master [23:21:47] !log awight Synchronized php-1.25wmf17/extensions/CentralNotice: CentralNotice debug logging for T89258 (duration: 00m 08s) [23:21:50] Logged the message, Master [23:22:01] legoktm: ok, all yours. [23:22:16] Please return with a little gas left in the tank though, I have one more small thing to do today... [23:22:43] (03PS2) 10Dzahn: install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) [23:23:00] awight: perfect timing, thanks :) [23:23:18] awesome. [23:23:19] legoktm: Perfect timing? Look at your calendar :D [23:23:50] * awight looks around at the post-its on every surface [23:23:51] hoo: >.> he pinged me just as jenkins merged the submodule update [23:24:15] (03CR) 10BryanDavis: "cherry-picked to deployment-salt for testing/validation" [puppet] - 10https://gerrit.wikimedia.org/r/190231 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [23:24:19] * awight slips Jenkins a tip for making me look on top of it for once [23:25:35] !log legoktm Synchronized php-1.25wmf17/extensions/CentralAuth/includes/CentralAuthUser.php: https://gerrit.wikimedia.org/r/#/c/190579/ (duration: 00m 06s) [23:25:37] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038087 (10Ottomata) Daniel, check out modules/statistics/manifests/compute.pp Put this there. [23:25:39] Logged the message, Master [23:26:45] !log legoktm Synchronized php-1.25wmf16/extensions/CentralAuth/includes/CentralAuthUser.php: https://gerrit.wikimedia.org/r/#/c/190579/ (duration: 00m 06s) [23:26:46] (03CR) 10Ottomata: "Both role::statistics::cruncher and role::statistics::private include the statistics::compute class. You can put this there." [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) (owner: 10Dzahn) [23:26:48] Logged the message, Master [23:26:53] greg-g, awight: done [23:27:25] wowza [23:28:21] * awight tries to peek at ~legoktm/.bash_history for borrowed glory... [23:28:59] awight: https://www.mediawiki.org/wiki/User:Legoktm/deploy [23:29:16] haar [23:29:30] wtf [23:29:37] shamelessly taken from o.ri :) [23:29:42] * awight files in my "when I'm older" folder [23:30:22] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038091 (10Dzahn) a:5Ottomata>3Dzahn [23:30:31] My process looks like that but is mostly driven by soemt^R history lookups [23:30:58] what is "fetch"? It's not on my path on tin. [23:31:12] this is pseudocode? That would make me feel better. [23:31:30] and ^R is mapped to up-arrow via ~/.inputrc [23:31:39] awight: look at my .bash_profile [23:32:27] * awight looks nervously towards the catsup [23:32:53] we should deploy those tools in local/bin? [23:37:21] (03PS3) 10Dzahn: install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) [23:38:05] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038114 (10Dzahn) @ottomata alright, amended: https://gerrit.wikimedia.org/r/#/c/190592/3/modules/statistics/manifests/compute.pp good to go? [23:38:08] (03CR) 10jenkins-bot: [V: 04-1] install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) (owner: 10Dzahn) [23:39:51] (03CR) 10BryanDavis: "Dependent ops/puppet changes cherry picked into beta. This can go at any time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190246 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [23:40:12] (03PS4) 10Dzahn: install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) [23:40:56] greg-g: I have a beta only config change that I'd like to make live. You can tell it's beta only because the only file it touches says -labs.php in the name. [23:41:33] thia would enable mediawiki to log via syslog datagrams to logstash which is the replacement for redis lists [23:42:13] Id like to let it bake in beta over the long weekend and the hopefully go to prod on Tuesday (at least for group0) [23:44:19] (03CR) 10Ottomata: [C: 031] install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) (owner: 10Dzahn) [23:45:33] (03CR) 10Dzahn: [C: 032] install gfortran and libs on stat1002/1003 nodes [puppet] - 10https://gerrit.wikimedia.org/r/190592 (https://phabricator.wikimedia.org/T89414) (owner: 10Dzahn) [23:47:02] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:49:49] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038155 (10Dzahn) @halfak packages have been installed on both hosts. ``` Notice: /Stage[main]/Statistics::Compute/Package[libopenblas-dev]/ensure: ensure changed 'purged'... [23:50:00] 3operations, Analytics-Engineering, Analytics-Cluster: Install packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038156 (10Dzahn) 5Open>3Resolved [23:52:03] 3operations, Analytics-Engineering, Analytics-Cluster: Install Fortran packages on stat1002 and stat1003 - https://phabricator.wikimedia.org/T89414#1038159 (10Dzahn) [23:54:48] awight: will I step on your toes if I sync a beta config change? [23:55:05] bd808: nope, go ahead! [23:55:21] I'm only playing with MW at this point [23:55:34] cool beans [23:55:39] (03PS2) 10BryanDavis: beta: switch logstash transport from redis to syslog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190246 (https://phabricator.wikimedia.org/T88870) [23:55:44] (03CR) 10BryanDavis: [C: 032] beta: switch logstash transport from redis to syslog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190246 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [23:55:56] (03Merged) 10jenkins-bot: beta: switch logstash transport from redis to syslog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/190246 (https://phabricator.wikimedia.org/T88870) (owner: 10BryanDavis) [23:56:49] !log awight Synchronized php-1.25wmf16/extensions/CentralNotice: CentralNotice debug logging for T89258 (duration: 00m 06s) [23:56:53] Logged the message, Master [23:58:12] !log bd808 Synchronized wmf-config/logging-labs.php: Switch beta to syslog logging (d9dcccb) (duration: 00m 06s) [23:58:14] Logged the message, Master [23:59:26] !log awight Synchronized php-1.25wmf17/extensions/CentralNotice: CentralNotice debug logging for T89258 (duration: 00m 05s)