[00:01:36] addWiki isn't the only file to do that: in core, includes/installer/DatabaseUpdater.php too [00:02:51] * Dereckson files a bug. [00:04:40] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:05:58] https://phabricator.wikimedia.org/T147609 - Update legacy calls to Database::sourceFile [00:07:10] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [00:09:52] Dereckson: Looks like most calls don't pass any parameters [00:10:19] Okay I've hacked php-1.28.0-wmf.21/extensions/WikimediaMaintenance/addWiki.php on Tin, with null + commenting the CREATE DATABASE statement out [00:10:26] and then scap pull on Terbium [00:10:42] There's literally those 2 to fix [00:11:02] Will prepare changes for them after olo [00:11:13] I'm just doing them [00:11:16] :) [00:12:32] Works fine, addWiki is at Cirrus step [00:15:52] addWiki step done, without further error [00:17:45] !log dereckson@tin Synchronized dblists: Create olo.wikipedia.org (T146612) (duration: 00m 50s) [00:17:47] T146612: Create Livvi-Karelian Wikipedia at olo.wikipedia.org - https://phabricator.wikimedia.org/T146612 [00:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:18:43] !log dereckson@tin rebuilt wikiversions.php and synchronized wikiversions files: (no message) [00:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:57] Reedy: you've the first user [00:20:49] Configuration looks good to me on mw1099 [00:22:30] !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Initial configuration for olo.wikipedia.org (T146612) (duration: 00m 50s) [00:25:02] Okay, addWiki is broken [00:25:08] !log dereckson@tin Synchronized langlist: +olo (duration: 00m 49s) [00:25:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:25:33] Have I ever mentioned that I don't think I've ever had that script run successfully the first time I've tried it for each wiki creation? [00:25:45] Yes, you've. [00:26:10] And yet I feel like I don't complain about it enough [00:27:15] A solution could be to prepare an integration test to add a new wiki to the labs [00:27:41] so we can run it periodically and detect issues before actual run to create a wiki in prod [00:28:22] that + make a modular system in steps, so we can restart the script at step 4 without having to comment out lines [00:28:28] maybe [00:28:35] let's open a task? [00:28:38] (not an issue this time, but for the previous one, it would have been useful) [00:28:50] * Dereckson nods [00:28:58] I'll make a note to file one tomorrow [00:29:53] I've got to be up in 5-6 hours so I'm going to bed [00:30:00] I'm sure Reedy can help in the case of any issues [00:30:45] Good night. Important steps are done, I'm updating interwiki cache. [00:31:41] blw, schoolswp update / pmid add too [00:32:05] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2698433 (10Dzahn) [00:32:39] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) OS installed, added to puppet, signed salt-key, gave access to gerrit-roots, gerrit server role commented out until tomorrow... [00:33:03] (03PS1) 10Dereckson: Interwiki cache update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314637 [00:36:55] (03CR) 10Dereckson: [C: 032] Interwiki cache update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314637 (owner: 10Dereckson) [00:37:26] (03Merged) 10jenkins-bot: Interwiki cache update [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314637 (owner: 10Dereckson) [00:38:23] Works on mw1099 [00:39:17] !log dereckson@tin Synchronized wmf-config/interwiki.php: Interwiki cache update for pmid, HTTPS links and olo.wikipedia.org (duration: 00m 50s) [00:39:21] (03PS1) 10Dzahn: make cobalt a backup::host [puppet] - 10https://gerrit.wikimedia.org/r/314638 (https://phabricator.wikimedia.org/T147597) [00:40:16] (03PS2) 10Dzahn: make cobalt a backup::host [puppet] - 10https://gerrit.wikimedia.org/r/314638 (https://phabricator.wikimedia.org/T147597) [00:40:25] (03CR) 10Dzahn: [C: 032] make cobalt a backup::host [puppet] - 10https://gerrit.wikimedia.org/r/314638 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [00:44:13] !log mwscript extensions/WikimediaMaintenance/filebackend/setZoneAccess.php olowiki --backend=local-multiwrite [00:44:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:46:08] For Parsoid, we need to run tools/fetch-sitematrix.js when we close a wiki too. [00:46:23] I'm uptading wikitech instructions. [00:50:15] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2698449 (10Dzahn) started bacula restored of lead data to cobalt /srv Run Restore job JobName: RestoreFiles Bootstrap: /var/lib/bacula/helium.eqiad.wmn... [00:51:34] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2698453 (10Dzahn) oops, since Where: is a prefix, this is restoring it as /srv/srv/gerrit but we can simply move it when done.. and then we'll rsync the diff tomorrow. [00:54:48] Okay, we're done. [00:54:55] !log https://olo.wikipedia.org has been successfully created (T146612). [00:54:56] T146612: Create Livvi-Karelian Wikipedia at olo.wikipedia.org - https://phabricator.wikimedia.org/T146612 [00:55:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:55:27] I've restored the genuine addWiki on terbium/tin [00:55:44] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2698458 (10Dzahn) [00:55:50] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:56:46] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) [00:58:15] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2698464 (10Dzahn) [00:58:29] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [01:02:42] ACKNOWLEDGEMENT - Host lithium is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn disks https://phabricator.wikimedia.org/T143307 [01:03:38] ACKNOWLEDGEMENT - HTTPS on tegmen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn WIP [01:03:38] ACKNOWLEDGEMENT - ircecho_service_running on tegmen is CRITICAL: PROCS CRITICAL: 0 processes with args ircecho daniel_zahn WIP [01:07:06] (03PS1) 10Dzahn: contint: allow ssh from cobalt, in addition to lead [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) [01:08:01] (03CR) 10jenkins-bot: [V: 04-1] contint: allow ssh from cobalt, in addition to lead [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [01:09:09] (03CR) 10Dzahn: [C: 031] Add ec.wikimedia.org to Apache configuration [puppet] - 10https://gerrit.wikimedia.org/r/314470 (https://phabricator.wikimedia.org/T135521) (owner: 10Dereckson) [01:09:14] (03CR) 10Dzahn: [C: 031] Sort by alphabetical order wikimedia-chapter Apache sites [puppet] - 10https://gerrit.wikimedia.org/r/314469 (owner: 10Dereckson) [01:09:56] (03PS2) 10Dzahn: contint: allow ssh from cobalt, in addition to lead [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) [01:10:51] (03CR) 10Dzahn: [C: 031] Gerrit: Update error.html message to include channel #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/314608 (owner: 10Paladox) [01:14:24] (03CR) 10Dzahn: "since we are reusing them we might as well set that in role/common rather than by hostname" [puppet] - 10https://gerrit.wikimedia.org/r/314628 (https://phabricator.wikimedia.org/T147597) (owner: 10Chad) [01:44:41] (03PS1) 10Catrope: Enable $wgPageTriageNoIndexUnreviewedNewArticles on all wikis that have PageTriage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314643 (https://phabricator.wikimedia.org/T147544) [01:47:49] PROBLEM - puppet last run on mx2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:10:28] RECOVERY - puppet last run on mx2001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [02:36:11] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.21) (duration: 16m 03s) [02:36:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:41:54] !log l10nupdate@tin ResourceLoader cache refresh completed at Fri Oct 7 02:41:54 UTC 2016 (duration 5m 43s) [02:42:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:47:39] PROBLEM - puppet last run on cp3034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:12:48] RECOVERY - puppet last run on cp3034 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [03:23:37] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [03:36:19] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [04:35:28] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [04:37:58] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [04:42:31] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:49:39] !log Update cxserver to 84fb704 (T147368) [04:49:41] T147368: Add olo.wikipedia to cxserver - https://phabricator.wikimedia.org/T147368 [04:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:58:33] PROBLEM - puppet last run on ms-be1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:07:46] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:23:47] RECOVERY - puppet last run on ms-be1025 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [05:24:12] (03PS1) 10KartikMistry: cxserver: Deploy Youdao MT service [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) [05:55:48] !log Deploying schema change on S4 master commonswiki.revision table - T147113 [05:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:08:54] PROBLEM - puppet last run on oxygen is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[generate-kafkatee.pyconf] [06:09:58] RECOVERY - HHVM jobrunner on mw1161 is OK: HTTP OK: HTTP/1.1 200 OK - 222 bytes in 0.010 second response time [06:10:03] <_joe_> !log restarted hhvm, jobrunner on mw1161 [06:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:10:36] PROBLEM - puppet last run on labstore1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:22:03] 06Operations, 10Pybal, 06Services, 13Patch-For-Review, and 2 others: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518#2698657 (10Joe) [06:26:09] (03PS1) 10Marostegui: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314650 (https://phabricator.wikimedia.org/T145533) [06:26:10] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, 15User-Joe: 6 small VMs for etcd clusters for kubernetes and its networking component - https://phabricator.wikimedia.org/T147620#2698659 (10Joe) [06:27:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314650 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [06:28:25] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314650 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [06:31:15] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1082 to get its raid controller firmware upgraded - T145533 (duration: 00m 49s) [06:31:17] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [06:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:31:47] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:32:02] !log reimaging mw1216, mw1218, mw1219 to jessie [06:32:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:33:30] (03CR) 10Florianschmidtwelzow: Add config for units on Wikidata (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/311206 (https://phabricator.wikimedia.org/T117032) (owner: 10Smalyshev) [06:34:49] RECOVERY - puppet last run on oxygen is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:36:27] RECOVERY - puppet last run on labstore1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:57] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:06:25] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698695 (10doctaxon) 05Resolved>03Open Hi, I think, a restart is needed again, there are too much 503 errors on several proxy servers like cp1053. A reasonable bot... [07:09:55] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698701 (10doctaxon) If those errors occur again and again, a technically check of these proxies has to be done, I suppose. [07:17:34] !log reimaging mw1228 and mw1229 (api appservers) to Debian Jessie [07:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:22:55] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [07:25:34] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [07:37:57] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [50.0] [07:40:34] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 1.00% above the threshold [25.0] [07:43:15] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:52:11] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2698802 (10MoritzMuehlenhoff) 05Open>03Resolved @matmarex : I'm currently preparing the next HHVM update, but when having a closer l... [07:52:53] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698809 (10doctaxon) Firing with traffic (different API URLs) the error report occurs about every 1.5 minutes (!) (Sorry, but what is an unbreak now! error report, if... [07:59:29] (03PS1) 10Giuseppe Lavagetto: Add the VMs for kubernetes/networking etcd [dns] - 10https://gerrit.wikimedia.org/r/314656 (https://phabricator.wikimedia.org/T147620) [08:04:05] PROBLEM - puppet last run on cp4013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:05:02] (03PS1) 10Giuseppe Lavagetto: kubernetes: add etcd clusters for k8s, networking [puppet] - 10https://gerrit.wikimedia.org/r/314657 (https://phabricator.wikimedia.org/T147620) [08:05:30] Giuseppe Lavagetto: see T146451 [08:05:31] T146451: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451 [08:08:56] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:10:35] PROBLEM - puppet last run on db1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:11:45] (03CR) 10Alexandros Kosiaris: [C: 04-1] "minor inline comment about typo, otherwise LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) (owner: 10KartikMistry) [08:15:39] <_joe_> doctaxon: yeah I'll take a look [08:16:23] <_joe_> doctaxon: wow, yes [08:16:54] I'm crying yet [08:17:04] <_joe_> something really bad happened tonight, apparently [08:17:05] nobody is there to help [08:18:01] Wikidata is slower than usual. [08:18:37] this night there has broken my bot script running on the grid [08:19:02] Wikidata is slow for a half month as I have seen. [08:19:32] Not as bad as today. [08:19:38] yah [08:19:55] but the proxy misery is more important now [08:20:20] <_joe_> doctaxon: yeah sorry I'm going to put an end to these issues quickly [08:20:48] yah, I very thank you [08:20:55] <_joe_> !log restarting hhvm on a few api appservers, due to memory leaks (T146451) [08:20:56] T146451: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451 [08:21:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:21:25] I think Wikidata is related though, as the only stuff that is slow is relaying on the api. [08:21:46] _joe_ it's necessary to technically check the proxies, if these errors occur again and again, as I have written into the task too [08:22:15] <_joe_> doctaxon: the cause of the problem is well known and hard to debug at the same time [08:22:36] I think so too [08:22:58] _joe_ is it happening on trusty and jessie apis right? (didn't get the hostnames) [08:23:02] <_joe_> it's a memory leak with our api code and hhvm, and 3.12 has some issues with memory profiling [08:23:05] <_joe_> elukey: both [08:27:06] _joe_ : is there a better way to contact you, if those bad errors occur again? A live mobile messenger or anything else [08:28:34] doctaxon: this channel should be enough, now that we know the root cause (most of) the opsen can rapidly check and do something [08:28:40] <_joe_> doctaxon: better than IRC? nope [08:29:00] <_joe_> also, this is causing 503s, but well below our detectability level :( [08:29:11] rapidly? [08:29:18] <_joe_> it's still less than 0.01% of our requests :/ [08:30:10] raise your detectability level and all is okay ?! [08:30:12] RECOVERY - puppet last run on cp4013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:31:09] _joe_ : it seems running better now [08:31:12] doctaxon: I meant that we could spot the hosts with leaks and restart hhvm accordingly. We should also look into a more permantent solution but as _joe_ mentioned these bugs are not trivial to repro [08:31:24] oh no, the next 503 [08:31:40] !log oblivian@puppetmaster1001 conftool action : set/weight=0; selector: cluster=api_appserver,dc=eqiad,name=mw123.* [08:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:44] !log oblivian@puppetmaster1001 conftool action : set/weight=20; selector: cluster=api_appserver,dc=eqiad,name=mw123.* [08:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:33:57] an idea: program a bot, that monitors the hosts and restarts it by itself [08:34:09] <_joe_> doctaxon: we already have a cron that can do that [08:34:17] oh cool [08:34:17] <_joe_> I just need to tweak it a bit and apply it [08:34:34] great [08:34:39] *thumbs_up* [08:34:42] RECOVERY - puppet last run on db1047 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [08:34:56] (03CR) 10Hashar: [C: 031] contint: allow ssh from cobalt, in addition to lead [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [08:35:45] (03CR) 10Hashar: [C: 031] Gerrit: Update error.html message to include channel #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/314608 (owner: 10Paladox) [08:36:25] <_joe_> are we doing reimagings just now? [08:36:38] <_joe_> moritzm elukey ? [08:37:14] mine finished 30 mins about and I was about to start the next batch, should I wait? [08:37:36] _joe_ yes I am doing mw122[89] [08:37:55] (two apis) [08:39:57] <_joe_> yes please wait until the fire is extinguished :) [08:40:03] (03PS2) 10Muehlenhoff: Re-enable HHVM Icinga checks for jessie [puppet] - 10https://gerrit.wikimedia.org/r/314507 [08:40:25] <_joe_> moritzm: I thought that was merged [08:40:33] _joe_ I am the only one that is doing reimages on API, but these two have already started so I can't stop them :) [08:41:44] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:43:28] <_joe_> how nice, someone is hammering the API right now... [08:44:08] aaaaaa [08:44:14] sorry [08:44:33] <_joe_> wat? [08:44:42] <_joe_> is that you, godog? :P [08:45:15] hahaha yeah, and my crappy internet at home, went away and tried banging "aaaaaa" [08:45:38] <_joe_> eheh [08:47:41] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2698891 (10fgiunchedi) [08:48:10] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2601885 (10fgiunchedi) Prometheus for beta is available at https://beta-prometheus.wmflabs.org/beta/gra... [08:48:49] <_joe_> I think this graph tells it all: https://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=API%20application%20servers%20eqiad&r=year&st=1475830086 [08:49:08] <_joe_> these are the requests per second to the API cluster during last year [08:49:18] <_joe_> they more than doubled [08:49:38] <_joe_> time to buy more API servers :) [08:50:19] _joe_ : it looks like running all okay now (for me) [08:50:37] _joe_: nope, merging now [08:50:41] no 503 any more [08:51:15] (03CR) 10Muehlenhoff: [C: 032] Re-enable HHVM Icinga checks for jessie [puppet] - 10https://gerrit.wikimedia.org/r/314507 (owner: 10Muehlenhoff) [08:51:32] <_joe_> doctaxon: my salt command is cycling through the machines and restarting the ones that are in a bad shape :) [08:51:35] (03PS1) 10Ema: cache_upload: remove varnish3 VCL compat [puppet] - 10https://gerrit.wikimedia.org/r/314658 (https://phabricator.wikimedia.org/T131502) [08:51:56] PROBLEM - puppet last run on cp3047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:51:57] fine! [08:52:07] https://grafana.wikimedia.org/dashboard/db/restbase?panelId=12&fullscreen - _joe_ might be related? [08:52:19] (03Abandoned) 10Hashar: puppet_compiler: conftool settings are now class parameters [puppet] - 10https://gerrit.wikimedia.org/r/312600 (owner: 10Hashar) [08:52:28] there was a big jump ~20 mins ago [08:52:32] what a report is that: CRITICAL: Catalog fetch fail. [08:52:57] (03PS8) 10Jcrespo: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [08:53:00] doctaxon: it means that a server managed via puppet failed to fetch it's configuration data [08:53:19] sounds not good [08:53:48] !log reimaging maps-test2003 - T147194 [08:53:49] T147194: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194 [08:53:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:53:55] but I got a 503 from cp1067 right now [08:54:19] it it happens rapidly to a large range of servers it's usually a sign of larger breakage, but it happens from time to time (without much negative effect) [08:55:27] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2698899 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2003.codfw.wmnet'] ``` The log can be found in `/va... [08:57:56] <_joe_> doctaxon: that might still happen while servers are restarted. [09:00:50] (03PS1) 10Alexandros Kosiaris: cxserver: Add youdao_api_key dummy stanza [labs/private] - 10https://gerrit.wikimedia.org/r/314660 [09:02:39] _joe_ do you state the end of the restarts in the task please - T146451 [09:02:39] T146451: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451 [09:03:02] <_joe_> doctaxon: I will, sure [09:04:46] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [09:06:20] (03PS9) 10Jcrespo: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:06:57] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698915 (10Joe) All the restarts finished right now, the cluster should be in a much better shape now. [09:07:24] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [09:10:23] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Add youdao_api_key dummy stanza [labs/private] - 10https://gerrit.wikimedia.org/r/314660 (owner: 10Alexandros Kosiaris) [09:14:37] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter for host monitoring - https://phabricator.wikimedia.org/T140646#2698943 (10fgiunchedi) [09:14:41] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2698941 (10fgiunchedi) 05Open>03Resolved Dashboard for host overview: https://grafana-labs.wikimedi... [09:14:48] 06Operations, 10Beta-Cluster-Infrastructure, 13Patch-For-Review, 05Prometheus-metrics-monitoring: deploy prometheus node_exporter and server to deployment-prep - https://phabricator.wikimedia.org/T144502#2698945 (10fgiunchedi) [09:16:47] RECOVERY - puppet last run on cp3047 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:17:55] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2698946 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2003.codfw.wmnet'] ``` Those hosts were successful: ``` [] ``` [09:18:26] !log installing php security updates on precise hosts [09:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:20:08] !log reimaging maps-test2004 - T147194 [09:20:09] T147194: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194 [09:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:21:30] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2698950 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2004.codfw.wmnet'] ``` The log can be found in `/va... [09:22:04] (03PS2) 10KartikMistry: cxserver: Deploy Youdao MT service [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) [09:24:14] RECOVERY - Disk space on maps-test2003 is OK: DISK OK [09:28:07] !log installing pillow/python-imaging security updates on Ubuntu systems [09:28:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:30:12] (03CR) 10Filippo Giunchedi: [C: 04-1] "Some comments inline, looks generally good" (039 comments) [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [09:30:42] (03CR) 10Filippo Giunchedi: bdsync debian directory (031 comment) [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [09:32:08] !log reimaging mw1220, mw1236, mw1237 to jessie [09:32:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:34:46] (03PS7) 10Filippo Giunchedi: prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) [09:36:05] (03PS3) 10Alexandros Kosiaris: cxserver: Deploy Youdao MT service [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) (owner: 10KartikMistry) [09:36:25] (03PS1) 10Elukey: Add extra compiler warnings to the Makefile [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/314662 (https://phabricator.wikimedia.org/T147436) [09:38:26] (03CR) 10Filippo Giunchedi: [C: 032] prometheus: add varnish_exporter [puppet] - 10https://gerrit.wikimedia.org/r/310557 (https://phabricator.wikimedia.org/T145659) (owner: 10Filippo Giunchedi) [09:39:38] (03CR) 10Alexandros Kosiaris: [C: 031] Add the VMs for kubernetes/networking etcd [dns] - 10https://gerrit.wikimedia.org/r/314656 (https://phabricator.wikimedia.org/T147620) (owner: 10Giuseppe Lavagetto) [09:40:44] (03CR) 10Giuseppe Lavagetto: [C: 032] Add the VMs for kubernetes/networking etcd [dns] - 10https://gerrit.wikimedia.org/r/314656 (https://phabricator.wikimedia.org/T147620) (owner: 10Giuseppe Lavagetto) [09:42:59] (03CR) 10Alexandros Kosiaris: [C: 031] "Should we merge now? Or is there some timing constraint ?" [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) (owner: 10KartikMistry) [09:45:21] PROBLEM - salt-minion processes on puppetmaster1001 is CRITICAL: PROCS CRITICAL: 5 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:45:48] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698986 (10Joe) 05Open>03Resolved [09:46:00] (03CR) 10Alexandros Kosiaris: [C: 032] openldap: Add a retry syncrepl parameter [puppet] - 10https://gerrit.wikimedia.org/r/310815 (owner: 10Alexandros Kosiaris) [09:46:04] (03PS2) 10Alexandros Kosiaris: openldap: Add a retry syncrepl parameter [puppet] - 10https://gerrit.wikimedia.org/r/310815 [09:46:07] (03CR) 10Alexandros Kosiaris: [V: 032] openldap: Add a retry syncrepl parameter [puppet] - 10https://gerrit.wikimedia.org/r/310815 (owner: 10Alexandros Kosiaris) [09:48:05] <_joe_> !log creating etcd100[1-6].eqiad.wmnet on ganeti, T147620 [09:48:06] T147620: 6 small VMs for etcd clusters for kubernetes and its networking component - https://phabricator.wikimedia.org/T147620 [09:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:48:12] RECOVERY - salt-minion processes on puppetmaster1001 is OK: PROCS OK: 4 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:48:31] RECOVERY - tilerator on maps-test2004 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.091 second response time [09:48:59] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2698990 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2004.codfw.wmnet'] ``` Those hosts were successful: ``` [] ``` [09:48:59] (03CR) 10Muehlenhoff: [C: 04-1] Imply Pivot UI's puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [09:49:42] RECOVERY - Disk space on maps-test2004 is OK: DISK OK [09:49:48] (03CR) 10KartikMistry: "Planned for next week, Wednesday." [puppet] - 10https://gerrit.wikimedia.org/r/314648 (https://phabricator.wikimedia.org/T146731) (owner: 10KartikMistry) [09:51:01] (03CR) 10Elukey: Imply Pivot UI's puppetization (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [09:52:39] (03CR) 10Paladox: [C: 031] "Hi thanks, may want to do a follow up patchthat reduces it to 1 instead of using 3?" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [09:53:03] RECOVERY - kartotherian endpoints health on maps-test2003 is OK: All endpoints are healthy [09:53:04] RECOVERY - kartotherian endpoints health on maps-test2004 is OK: All endpoints are healthy [09:59:26] RECOVERY - cassandra CQL 10.192.16.35:9042 on maps-test2004 is OK: TCP OK - 0.036 second response time on port 9042 [10:00:18] <_joe_> !log updated conftool to 0.3.1 on all the cluster except caches, T147480 [10:00:19] T147480: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480 [10:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:00:58] 06Operations, 10Traffic, 15User-Joe, 07discovery-system: Upgrade conftool to 0.3.1 - https://phabricator.wikimedia.org/T147480#2698995 (10Joe) a:05Joe>03ema [10:04:49] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build etcd clusters to support Kubernetes and calico - https://phabricator.wikimedia.org/T147421#2698997 (10Joe) a:03Joe [10:11:17] (03PS8) 10Elukey: Imply Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [10:12:26] PROBLEM - puppet last run on cp4020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:13:20] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, a minor nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [10:13:40] 06Operations, 10Pybal, 06Services, 13Patch-For-Review, and 2 others: Depool / repool scripts execute successfully even when the host has not been (r|d)epooled - https://phabricator.wikimedia.org/T145518#2699001 (10Joe) In the larger context of restarting safely a service behind LVS without relying on the a... [10:14:11] (03PS1) 10Muehlenhoff: Update to 1.0.2j [debs/openssl] - 10https://gerrit.wikimedia.org/r/314667 [10:22:03] (03PS10) 10Elukey: Imply Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [10:27:41] !log mw122[89] back in live api server pool [10:27:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:07] !log reimaging mw123[01] to Debian Jessie [10:28:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:28:42] (03CR) 10Muehlenhoff: [C: 032] Update to 1.0.2j [debs/openssl] - 10https://gerrit.wikimedia.org/r/314667 (owner: 10Muehlenhoff) [10:30:36] (03PS1) 10Alexandros Kosiaris: udp2log: Remove the $monitor_packet_loss and $packet_loss_log parameters [puppet] - 10https://gerrit.wikimedia.org/r/314668 [10:30:38] (03PS1) 10Alexandros Kosiaris: monitoring: Kill check_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/314669 [10:32:15] 06Operations: dubnium disk full - https://phabricator.wikimedia.org/T147173#2699008 (10akosiaris) 05stalled>03Resolved Backscatter has stopped, blocks removed for now. Resolving [10:33:13] 06Operations, 10ContentTranslation-CXserver, 10ContentTranslation-Deployments, 10MediaWiki-extensions-ContentTranslation, 13Patch-For-Review: Migrate apertium to SCB - https://phabricator.wikimedia.org/T147288#2699010 (10akosiaris) 05Open>03Resolved Migration happened yesterday successfully. Resolving. [10:34:36] RECOVERY - puppet last run on cp4020 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [10:37:57] 06Operations, 06Operations-Software-Development: Evaluation of automation/orchestration tools - https://phabricator.wikimedia.org/T143306#2699012 (10Volans) 05Open>03Resolved The discussion during the Operation team offsite lead to the decision of start testing and using ClusterShell as a parallel transpor... [10:42:47] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "no change in https://puppet-compiler.wmflabs.org/4238/fluorine.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/314668 (owner: 10Alexandros Kosiaris) [10:43:03] (03PS2) 10Alexandros Kosiaris: udp2log: Remove the $monitor_packet_loss and $packet_loss_log parameters [puppet] - 10https://gerrit.wikimedia.org/r/314668 [10:43:05] (03CR) 10Alexandros Kosiaris: [V: 032] udp2log: Remove the $monitor_packet_loss and $packet_loss_log parameters [puppet] - 10https://gerrit.wikimedia.org/r/314668 (owner: 10Alexandros Kosiaris) [10:52:52] !log reimaging mw1238, mw1239 to jessie [10:52:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:01:48] (03CR) 10Jcrespo: [C: 04-1] "Yes, maybe. The problem is the current patch doesn't work. See: https://puppet-compiler.wmflabs.org/4234/db1043.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [11:02:17] moritzm: last ones?? [11:03:24] yep, expect the test app servers, but I'll send a brief pre-announcement before converting those [11:03:51] and the remaining jobrunners, that's up for Monday [11:07:59] (03PS1) 10Marostegui: db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314673 (https://phabricator.wikimedia.org/T145533) [11:08:25] \o/ [11:14:41] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314673 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [11:15:07] (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1082 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314673 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [11:17:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with a bit less weight than usual to start with - T145533 (duration: 00m 55s) [11:17:06] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [11:17:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:31:50] PROBLEM - MariaDB Slave SQL: s2 on dbstore1001 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1146, Errmsg: Error Table bgwiki.hitcounter doesnt exist on query. Default database: bgwiki. Query: [snipped] [11:35:24] I will fix that [11:37:31] fixed [11:37:38] Memory tables are a pain :_( [11:39:06] RECOVERY - MariaDB Slave SQL: s2 on dbstore1001 is OK: OK slave_sql_state Slave_SQL_Running: Yes [11:42:20] marostegui: told ya yesterday you'll get this today ;) [11:42:41] volans: Yeah, I was expecting it :p [11:51:04] (03CR) 10Paladox: "Oh would you know why it dosent work?" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [11:56:29] (03PS2) 10Giuseppe Lavagetto: kubernetes: add etcd clusters for k8s, networking [puppet] - 10https://gerrit.wikimedia.org/r/314657 (https://phabricator.wikimedia.org/T147620) [11:56:32] (03PS1) 10Giuseppe Lavagetto: installserver: add dhcp and netboot data for etcd100[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/314677 [11:59:52] (03PS1) 10Mforns: Remove mobile and exdist reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/314678 (https://phabricator.wikimedia.org/T147000) [12:03:30] volans: https://phabricator.wikimedia.org/T132837#2698895 [12:04:28] (03CR) 10Paladox: "Ah I see now" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [12:05:27] marostegui: yeah [12:05:30] (03CR) 10Giuseppe Lavagetto: [C: 032] installserver: add dhcp and netboot data for etcd100[1-6] [puppet] - 10https://gerrit.wikimedia.org/r/314677 (owner: 10Giuseppe Lavagetto) [12:06:13] (03PS1) 10Marostegui: db-eqiad.php: Restore db1082 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314680 (https://phabricator.wikimedia.org/T145533) [12:06:58] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Restore db1082 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314680 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [12:07:25] (03Merged) 10jenkins-bot: db-eqiad.php: Restore db1082 original weight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314680 (https://phabricator.wikimedia.org/T145533) (owner: 10Marostegui) [12:08:27] RECOVERY - cassandra CQL 10.192.16.34:9042 on maps-test2003 is OK: TCP OK - 0.036 second response time on port 9042 [12:09:11] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1082 with its original weight - T145533 (duration: 00m 52s) [12:09:12] T145533: Investigate db1082 crash - https://phabricator.wikimedia.org/T145533 [12:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:22:06] (03PS10) 10Paladox: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [12:22:43] (03CR) 10Paladox: "I believe I may have fixed the problem also see how they create the database and tables here https://phabricator.wikimedia.org/diffusion/P" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [12:24:43] (03PS1) 10Gilles: Upgrade to 0.1.23 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/314683 [12:30:15] (03CR) 10DCausse: [C: 031] "these wikis are ready" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314257 (https://phabricator.wikimedia.org/T146208) (owner: 10DCausse) [12:39:47] !log reimaging mw123[23] to Debian Jessie [12:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:42:20] 06Operations, 10media-storage: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2699200 (10fgiunchedi) I took a look at the time but forgot to update this task: I couldn't find anything obvious in swift (not even the images themselves) so it might be that the image... [12:49:35] (03CR) 10Jcrespo: [C: 04-1] "No, that is not the way we want to get it fixed, hardcoding the database. Neither we want a new database." [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [12:52:13] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.23 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/314683 (owner: 10Gilles) [12:56:09] (03PS1) 10Marostegui: .bashrc: Add my own .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/314685 [12:56:18] (03CR) 10Muehlenhoff: [C: 031] Imply Pivot UI's puppetization [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:02:35] (03CR) 10Alexandros Kosiaris: [C: 04-1] "a few inline comments that caught my eye" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:03:31] thanks akosiaris! [13:03:57] rest LGTM more or less [13:04:37] more or less means that I should do my homework and retry? :) [13:04:54] so "Imply" in there is the name of the company releasing pivot [13:05:10] hehehe [13:05:12] maybe I should reword it, too confusing [13:05:15] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/314685 (owner: 10Marostegui) [13:05:16] yes please [13:05:38] (03CR) 10Marostegui: [C: 032] .bashrc: Add my own .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/314685 (owner: 10Marostegui) [13:05:42] I definitely though that was a verb [13:05:46] thought* [13:08:11] (03PS2) 10Rush: bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 [13:10:56] yes my bad, didn't think about it [13:11:02] (03PS11) 10Elukey: Introduce the Imply Pivot UI's module and role [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) [13:11:03] it looks indeed weird :) [13:14:44] PROBLEM - puppet last run on pollux is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:15:14] PROBLEM - puppet last run on dbproxy1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/home/marostegui/.bashrc] [13:15:29] marostegui: ^ [13:16:27] akosiaris: strange [13:16:30] Error: Could not set 'file' on ensure: Error 404 on SERVER: Not Found: Could not find file_content modules/admin/home/marostegui/.bashrc [13:16:42] hmm, it did work on a second run [13:16:54] did you by any chance force a puppet run before puppet-merge was done ? [13:17:06] akosiaris: no :| [13:17:16] Notice: /Stage[main]/Admin/Admin::Hashuser[marostegui]/Admin::User[marostegui]/File[/home/marostegui/.bashrc]/ensure: defined content as '{md5}18598b3e07cd0e7b392d5088bb98530e' [13:17:28] this has all the telling signs of a race [13:17:46] akosiaris: I did a normal puppet-merge (I had no idea I should have done something before) [13:17:48] I'm running puppet on a host where it has not yet run [13:17:50] to be sure [13:17:54] akosiaris: How can I fix it then? [13:17:55] RECOVERY - puppet last run on dbproxy1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:18:04] 06Operations, 10Cassandra: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2699279 (10Dzahn) [13:18:32] marostegui: I don't think there's something to fix.. All I am saying is that puppet run on that host before the content was available on the puppetmaster [13:18:49] run fine on other hosts [13:18:52] akosiaris: Ah right, I see I see. Interesting race condition :) [13:19:04] (03PS3) 10Rush: bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 [13:19:36] akosiaris: will be possible to have in the logs of puppet on the hosts the puppetmaster it connected to for that run and timestamps? [13:19:39] * volans hides [13:20:13] volans: it can only connect on 1 puppetmaster.. the frontend of the corresponding DC [13:20:22] so puppetmaster1001.eqiad.wmnet in this case [13:20:25] on a more serious note, could be possible to get the catalog from one puppetmaster and a the file from another? [13:20:28] ok so no... :) [13:20:29] yes [13:20:35] ahhh [13:20:38] that's what I am thinking it happened [13:20:52] maybe got the catalog from puppetmaster1001 and the file from puppetmaster1002 [13:21:01] but if you just said only 1, corresponding of the DC? :D [13:21:02] but that's easy to find out by grepping apache logs [13:21:06] you're fooling me [13:21:10] :-P [13:21:18] the 2 puppetmasters are proxied by the frontend [13:21:26] so the host knows nothing about that [13:21:31] only connects to 1 [13:21:32] the frontend [13:21:34] 06Operations, 06Release-Engineering-Team, 06Security-Team, 15User-greg: Determine a core set or a checklist of permissions for deployment purpose - https://phabricator.wikimedia.org/T140270#2699289 (10Dzahn) [13:21:42] ok [13:21:48] the frontend does the balancing to backends [13:21:53] itself and whatever else [13:22:00] it's a weighted round robin more or less [13:22:11] so what you say is the most probably scenario for this race [13:22:17] ok [13:22:26] maybe we can make this a bit more sticky [13:22:35] you mean transactional? :-P [13:22:37] so that a host is always served by the same backend [13:22:56] it might make sense to get the catalog and the files from the same backend for a given run [13:23:19] yes.. assuming it does not already do that. I need to have a look at the config [13:23:19] we have enough entropy between hosts and runs to do the round robin anyway [13:23:23] but I think it does not [13:23:25] (03PS4) 10Rush: bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 [13:23:39] yeah, probably worth solving at the proxy level [13:23:44] if it does could be a race in the middle of a git pull [13:23:53] harder to fix I guess [13:24:32] that should not happen though .. informing the puppetmaster that files have changed is actually done by touching site.pp [13:24:48] so it could be thought of as being an atomic operations [13:24:55] s/s$// [13:26:30] (03CR) 10Dzahn: "duplicate of https://gerrit.wikimedia.org/r/#/c/310897/ which has been waiting for a while" [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [13:28:42] (03CR) 10Dzahn: "well, not really, since this uses require_package. fwiw, recently on a similar change for labs, chasemp said that require_package doesnt b" [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [13:28:57] not for this case but if we modify site.pp it gets touched also during the git pull and could be in the middle [13:30:17] (03CR) 10Alexandros Kosiaris: [C: 031] Introduce the Imply Pivot UI's module and role [puppet] - 10https://gerrit.wikimedia.org/r/312495 (https://phabricator.wikimedia.org/T138262) (owner: 10Elukey) [13:31:14] true [13:31:54] we should use .dont_touch :-P [13:33:11] I would expect much more race conditions in commits that change site.pp in this case, but I'm going off topic, sorry [13:33:49] it's possible yes [13:34:10] whether those race conditions actually exhibit something is a different story [13:34:31] it's actually impossible to tell [13:34:41] btw .dont_touch would not work. It's site.pp explicitly [13:35:05] well, the entry point anyway [13:36:12] ah ok so should be much more ugly, empty entry point with one line import site.pp :D [13:36:31] yes, but there is one caveat [13:36:35] import is deprecated :P [13:36:39] ofc , it's puppet :D [13:36:46] and to enable the future parser we need to actually kill it :P [13:36:57] at least IIRC [13:37:10] (03CR) 10Filippo Giunchedi: [C: 04-1] "LGTM, some other comments" (033 comments) [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [13:37:23] hmm maybe not... [13:37:36] I see puppet parser validate not whining on my puppet 4.x install [13:37:36] RECOVERY - puppet last run on pollux is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:44:54] (03PS3) 10Muehlenhoff: Update to 4.4.23 [debs/linux44] - 10https://gerrit.wikimedia.org/r/314002 [13:45:38] (03PS3) 10Andrew Bogott: couple more labs support host hiera cluster key cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309690 (owner: 10Alex Monk) [13:45:46] (03CR) 10Rush: bdsync debian directory (031 comment) [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [13:46:08] (03PS5) 10Rush: bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 [13:47:59] (03CR) 10Rush: bdsync debian directory (031 comment) [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [13:48:02] (03CR) 10Andrew Bogott: [C: 032] couple more labs support host hiera cluster key cleanups [puppet] - 10https://gerrit.wikimedia.org/r/309690 (owner: 10Alex Monk) [13:49:13] (03PS2) 10Andrew Bogott: more labs regex hiera cluster key fix for wikimedia.org hosts [puppet] - 10https://gerrit.wikimedia.org/r/309691 (owner: 10Alex Monk) [13:50:36] (03PS1) 10Alexandros Kosiaris: Use the distribution's check_nrpe binary and stop shipping our own [puppet] - 10https://gerrit.wikimedia.org/r/314689 [13:52:22] (03CR) 10Andrew Bogott: [C: 032] more labs regex hiera cluster key fix for wikimedia.org hosts [puppet] - 10https://gerrit.wikimedia.org/r/309691 (owner: 10Alex Monk) [13:53:32] (03PS3) 10Andrew Bogott: set labtest up in ganglia like labs [puppet] - 10https://gerrit.wikimedia.org/r/309692 (owner: 10Alex Monk) [13:53:39] (03PS6) 10Rush: bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 [13:55:04] (03CR) 10Andrew Bogott: [C: 032] set labtest up in ganglia like labs [puppet] - 10https://gerrit.wikimedia.org/r/309692 (owner: 10Alex Monk) [13:55:50] (03PS3) 10Andrew Bogott: remove extra cluster: labvirt hieradata [puppet] - 10https://gerrit.wikimedia.org/r/309693 (owner: 10Alex Monk) [13:56:05] !log reimaing mw123[45] to Debian Jessie (last two api appservers) [13:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:58:22] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, nice!" [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [14:00:21] lets add debian glue jenkins job to that repo :D [14:01:10] https://gerrit.wikimedia.org/r/314691 Debian glue for ops/debs/bdsync [14:02:11] (03CR) 10Hashar: "recheck" [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [14:02:22] thanks hashar [14:02:38] which is most probably going to be broken [14:02:45] but it is not voting so will always Verified+1 :] [14:02:55] (03CR) 10Thiemo Mättig (WMDE): "Code is fine, but I would like to test this manually and track this properly in our next Sprint before we merge it." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/312944 (https://phabricator.wikimedia.org/T146707) (owner: 10Jon Harald Søby) [14:03:02] (03CR) 10Muehlenhoff: [C: 032] Update to 4.4.23 [debs/linux44] - 10https://gerrit.wikimedia.org/r/314002 (owner: 10Muehlenhoff) [14:03:34] 06Operations, 10Cassandra: Change graphite aggregation function for cassandra 'count' metrics - https://phabricator.wikimedia.org/T121789#2699380 (10fgiunchedi) a:03fgiunchedi Taking this, I'll be changing graphite's storage_aggregation to account for cassandra's `count` metrics [14:06:10] chasemp: beside lintian pedantic issues, looks like the Jenkins job build bdsync all fine ! https://integration.wikimedia.org/ci/job/debian-glue-non-voting/292/testReport/ ;] [14:06:26] (03PS4) 10Andrew Bogott: Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 [14:07:12] (03CR) 10Rush: [C: 032] bdsync debian directory [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314591 (owner: 10Rush) [14:12:33] (03CR) 10Rush: labstore: Add monitoring for secondary HA cluster health (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [14:12:38] (03PS7) 10Rush: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [14:13:34] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "https://puppet-compiler.wmflabs.org/4240/ says PCC is happy, merging" [puppet] - 10https://gerrit.wikimedia.org/r/314669 (owner: 10Alexandros Kosiaris) [14:13:40] (03PS2) 10Alexandros Kosiaris: monitoring: Kill check_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/314669 [14:13:42] (03CR) 10Alexandros Kosiaris: [V: 032] monitoring: Kill check_ganglia [puppet] - 10https://gerrit.wikimedia.org/r/314669 (owner: 10Alexandros Kosiaris) [14:13:50] (03PS2) 10Rush: toolschecker: remove all references to labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/313896 (https://phabricator.wikimedia.org/T146455) (owner: 10Jcrespo) [14:13:58] (03CR) 10Rush: [C: 032 V: 032] toolschecker: remove all references to labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/313896 (https://phabricator.wikimedia.org/T146455) (owner: 10Jcrespo) [14:14:12] (03PS2) 10Alexandros Kosiaris: Use the distribution's check_nrpe binary and stop shipping our own [puppet] - 10https://gerrit.wikimedia.org/r/314689 [14:14:18] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Use the distribution's check_nrpe binary and stop shipping our own [puppet] - 10https://gerrit.wikimedia.org/r/314689 (owner: 10Alexandros Kosiaris) [14:14:45] (03PS3) 10Rush: toolschecker: remove all references to labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/313896 (https://phabricator.wikimedia.org/T146455) (owner: 10Jcrespo) [14:14:47] (03CR) 10Rush: [V: 032] toolschecker: remove all references to labsdb1002 [puppet] - 10https://gerrit.wikimedia.org/r/313896 (https://phabricator.wikimedia.org/T146455) (owner: 10Jcrespo) [14:17:56] (03CR) 10Paladox: "Oh, how should we do it since there isent a database we can use to store the table in?" [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [14:18:41] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2699434 (10BBlack) [14:19:12] (03PS3) 10Giuseppe Lavagetto: kubernetes: add etcd clusters for k8s, networking [puppet] - 10https://gerrit.wikimedia.org/r/314657 (https://phabricator.wikimedia.org/T147620) [14:20:49] (03CR) 10Giuseppe Lavagetto: [C: 032] kubernetes: add etcd clusters for k8s, networking [puppet] - 10https://gerrit.wikimedia.org/r/314657 (https://phabricator.wikimedia.org/T147620) (owner: 10Giuseppe Lavagetto) [14:22:04] 06Operations, 06Labs, 10Tool-Labs, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2661551 (10BBlack) (took the cache host out of the title to prevent confusion in future Phab searches for problems on specific cache hosts, since it didn't turn out to be relevant). [14:23:06] 06Operations, 10Mail, 10OTRS: OTRS spam classification methods and systems - https://phabricator.wikimedia.org/T146968#2699443 (10grin) >>! In T146968#2696232, @pajz wrote: > Now, I can't say anything definite given the relevant servers are operated by the WMF, so I suppose only they'd be able to provide per... [14:23:58] (03PS1) 10Giuseppe Lavagetto: etcd::auth: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/314694 [14:24:12] wikibugs_, yeah, that's the one. So, dear robot, do you know who do the email stuff coordination at WMF lately? [14:24:17] <_joe_> or, finding a typo in your code that has been live for 1 year... [14:25:25] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] etcd::auth: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/314694 (owner: 10Giuseppe Lavagetto) [14:25:35] (03PS8) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [14:26:19] humans are welcome to reply as well. :-) [14:26:43] (03CR) 10jenkins-bot: [V: 04-1] labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [14:27:26] (03CR) 10Andrew Bogott: [C: 032] Add $use_ssl switch to role::labs::novaproxy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/314441 (owner: 10Andrew Bogott) [14:27:33] (03PS5) 10Andrew Bogott: Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 [14:28:26] (03PS1) 10Muehlenhoff: Add es-tool also on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314695 [14:28:49] PROBLEM - puppet last run on etcd1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 1 minute ago with 2 failures. Failed resources (up to 3 shown): Exec[Etcd disable auth],File[/etc/etcd/local/],Etcd_user[root] [14:29:18] _joe_: ^^^ [14:29:32] <_joe_> I know [14:29:34] (03CR) 10jenkins-bot: [V: 04-1] Add es-tool also on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314695 (owner: 10Muehlenhoff) [14:29:43] <_joe_> I'm bringing those machines up :) [14:29:58] just in case ;) [14:30:26] (03CR) 10Andrew Bogott: [V: 032] Add $use_ssl switch to role::labs::novaproxy [puppet] - 10https://gerrit.wikimedia.org/r/314441 (owner: 10Andrew Bogott) [14:31:00] (03PS9) 10Madhuvishy: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) [14:31:09] (03PS2) 10Muehlenhoff: Add es-tool also on jessie [puppet] - 10https://gerrit.wikimedia.org/r/314695 [14:34:10] (03PS1) 10Muehlenhoff: videoscalers: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/314699 [14:37:13] (03PS1) 10Gehel: maps-test - change partman config to use a scheme with /srv [puppet] - 10https://gerrit.wikimedia.org/r/314703 [14:40:26] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [14:41:04] !log Update cxserver to fa2f715 (T147552) [14:41:04] !log uploaded openssl 1.0.2j for jessie-wikimedia to carbon [14:41:05] T147552: Add machine translation support for Papiamento (pap) - https://phabricator.wikimedia.org/T147552 [14:41:08] known ^ [14:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:41:33] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699497 (10BBlack) [14:42:10] ^Thank you for the report, bblack [14:43:05] (03PS11) 10Paladox: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [14:43:08] indeed, thanks! [14:43:10] (03PS12) 10Paladox: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [14:43:36] (03CR) 10Paladox: "@Jcrespo oh I see now, we want to use phabricator_search to create the stopwords table." [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) (owner: 10Paladox) [14:43:55] jynus ^^ is that the way you want it? [14:44:24] I dropped the config since it was not working, and instead just added phabricator_search in the places where it was db_name/table [14:44:28] 06Operations, 06Commons, 06Multimedia: Deploy a PHP and HHVM patch (Exif values retrieved incorrectly if they appear before IFD) - https://phabricator.wikimedia.org/T140419#2699513 (10matmarex) Nice surprise. Thanks! [14:45:05] paladox, that is the idea, but we cannot sacrifice the variable in case the database changes [14:45:14] Oh [14:45:22] I think i may know why it dosen't work [14:45:24] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699515 (10BBlack) [14:45:36] paladox, yes, it is probably a silly mistake [14:45:54] since we used a scope.lookupvar [14:46:06] jynus ^^ i will amend now to re add the config [14:46:08] without using scope.lookupvar [14:46:26] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699516 (10fgiunchedi) Could be related to thumbor deployment, partial deployment was enabled on Sept 7th with https://gerrit.wikimedia.org/r/#/c/308746/ on Sept 13th on small wikis h... [14:47:09] I have to check also /etc/mysql/phabricator-init.sql [14:47:23] as according to reports, that may not be working [14:47:46] (03PS13) 10Paladox: phabricator: Create & configure a phabricator_stopwords table for innodb [puppet] - 10https://gerrit.wikimedia.org/r/314286 (https://phabricator.wikimedia.org/T146673) [14:48:13] jynus ok, does ^^ that look better? Could you also run that puppet test on the patch above please? [14:48:14] PROBLEM - puppet last run on etcd1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 14 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[Etcd disable auth],File[/etc/etcd/local/],Etcd_user[root] [14:48:58] jynus that init sql file your talking about only has [14:48:59] set global ft_boolean_syntax = ' |-><()~*:""&^'; [14:49:07] Which we can drop since innodb does not support that [14:49:09] (03CR) 10Elukey: [C: 031] videoscalers: Restrict to domain networks [puppet] - 10https://gerrit.wikimedia.org/r/314699 (owner: 10Muehlenhoff) [14:49:15] thats myisam only [14:49:24] Want to drop it too? [14:50:12] is it? [14:50:53] It is a bug: https://bugs.mysql.com/bug.php?id=71551 [14:51:10] Yes [14:51:18] Let me check the link you gave [14:51:23] Yeh [14:51:30] the problem is we have to get that to work [14:51:35] jynus it's also decribed in the docs for full text search [14:51:43] jynus it wont work4 [14:51:47] it is not supported [14:51:52] if not on mysql, on phabricator [14:51:55] We will need to create a patch [14:52:13] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [14:52:21] (03PS3) 10Dzahn: contint: allow ssh from cobalt, in addition to lead [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) [14:52:24] like adding "+" to all search terms [14:52:29] jynus maybe a better thing to do is create a bug on mariadb? [14:52:36] PROBLEM - etcd service on etcd1002 is CRITICAL: CRITICAL - Expecting active but unit etcd is activating [14:52:37] Since mysql is getting slow with [14:52:39] there updates [14:52:44] Mariadb is the new mysql [14:52:53] "Mariadb is the new mysql"? [14:53:14] Yes, oracel doint really want mysql, but it was part of the aggrement if they brought sun [14:53:16] years ago [14:53:21] really? [14:54:00] yes [14:54:01] haha [14:54:07] I find that strange, because oracle has 3 releases (5.6, 5.7, and 8.0) with intenresting new features [14:54:22] jynus that aggrement ended i think a year or two ago [14:54:24] while mariadb has been advocating for the use of closed source [14:54:57] I read that some where, but carn't seem to find it (Carn't remeber what i typed in google now) [14:55:01] http://www.infoworld.com/article/3109213/open-source-tools/open-source-uproar-as-mariadb-goes-commercial.html [14:55:11] good read^ [14:55:29] (03CR) 10Dzahn: [C: 032] contint: allow ssh from cobalt, in addition to lead [puppet] - 10https://gerrit.wikimedia.org/r/314641 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [14:55:33] so yes, I agree, MariaDB (corporation) is the new Oracle :-) [14:56:16] Oh [14:56:22] I guess mariadb is finished then [14:56:40] well, don't be so drastic [14:56:43] someone needs to create a fork before it gets closed [14:56:57] oh [14:56:58] there is a lot of open source in the mysql/mariadb community [14:57:14] Yep, but why would they want to close of the code for 2.* [14:58:31] But now it is diffilcult, oracle is still supporting mysql but can drop support at any time due to the aggrement they had said they had to support mysql for at least 10-15 years one of those years [14:58:35] but i think it has now finished [14:58:56] hello. could i get someone to run two small SQL UPDATE queries on commonswiki master? [14:58:57] update image set img_metadata='' where img_name = "Jena_-_Hummelsberg_05.jpg"; -- T145953 [14:58:57] T145953: File displaying invalid metadata in Commons, when the original Exif seems fine. - https://phabricator.wikimedia.org/T145953 [14:59:00] update image set img_metadata='' where img_name = "20160927_St_George's_Church_(The_Winery)_Mohegan_Lake_2.jpg"; -- T147015 [14:59:00] T147015: Exif orientation problem (no metadata extracted due to empty JPEG segment) - https://phabricator.wikimedia.org/T147015 [14:59:46] this is to purge the incorrect image metadata we have for those files, and to verify that the respective issues are fixed, before i try to make T32961 happen to do this for all files. [14:59:47] T32961: Run refreshImageMetadata.php --force - https://phabricator.wikimedia.org/T32961 [15:00:03] MatmaRex, let me help [15:00:27] but as you may know, queries on the server are rare, I need to get context and make backups [15:00:43] (03CR) 10RobH: [C: 031] "after some review with Guillaume, it turns out these test systems use the H200 controller. As they are test systems, we'll leave them han" [puppet] - 10https://gerrit.wikimedia.org/r/314703 (owner: 10Gehel) [15:00:58] MatmaRex, please include me on the ticket [15:01:17] jynus: yeah, sure, i know this isn't standard :D which ticket? [15:01:21] ah, i'll add you to all of them [15:01:31] are there many? ok :-) [15:01:38] jynus on there website https://mariadb.org/ it says "Guaranteed to stay open source." [15:01:42] False advertising [15:02:05] paladox, do not confuse the business with the foundation [15:02:13] but I do not blame you :-) [15:02:14] oh [15:02:15] mariadb server is remaining open source, only maxscale isn't [15:02:20] Oh [15:02:28] didn't know there was maxscale [15:02:32] What is that? [15:02:34] moritzm, funnily, because oracle requires it [15:02:59] as it is GPL :-) [15:03:33] RECOVERY - etcd service on etcd1002 is OK: OK - etcd is active [15:04:18] MatmaRex, so there are a bunch of files that created wrong metadata that needs to be reparsed? [15:04:18] jynus https://techcrunch.com/2012/08/18/oracle-makes-more-moves-to-kill-open-source-mysql/ [15:04:44] paladox, exactly what we did, in 2012-2013 [15:04:48] 06Operations, 10Gerrit, 10hardware-requests: Allocate spare misc box in eqiad for gerrit replacement - https://phabricator.wikimedia.org/T147596#2699555 (10mark) 05Open>03Resolved Approved. [15:04:50] Oh [15:05:46] jynus: yeah, like hundreds of thousands of them probably, but i can't tell you the exact number before we reparse them all and see which changed :D [15:06:03] for now i'd just like to do it for those two i mentioned [15:06:14] oh, so this is like a test for a proper job [15:06:15] ok [15:06:21] thanks, I needed the context [15:07:20] jynus anyways Innodb supports boolean support [15:07:41] 06Operations, 10Traffic, 10media-storage: Unexplained increase in thumbnail 500s - https://phabricator.wikimedia.org/T147648#2699558 (10fgiunchedi) Another data point, scrolling now the list of 500s I see a good chunk with size '0px' coming from `user_agent": "Wikipedia/942 CFNetwork/808.0.2 Darwin/16.0.0"`... [15:07:51] go to http://dev.mysql.com/doc/refman/5.7/en/fulltext-fine-tuning.html [15:07:59] Then to section Modifying Boolean Full-Text Search Operators [15:08:19] "InnoDB does not have an equivalent setting" [15:08:19] See also http://dev.mysql.com/doc/refman/5.7/en/fulltext-boolean.html [15:08:22] Yep [15:08:42] the point is, if innodb does not support it, we still have to have the functionality [15:08:49] Yep [15:09:00] But the only way to do that is if we contribute a patch [15:09:02] (03PS2) 10Gehel: maps-test - change partman config to use a scheme with /srv [puppet] - 10https://gerrit.wikimedia.org/r/314703 [15:09:07] that supports this for innodb :) [15:09:13] paladox, not necesarilly [15:09:25] Oh, but how would we get it supported? [15:09:44] we could hack phab to add automatically + to search terms when they are not - or + [15:09:50] that would be easier [15:09:54] Oh [15:10:10] and does not require a new package or upgrade [15:10:15] just an idea [15:10:18] jynus if you know how we can do that, we could possibly test on phab-01 but i have switched that to elasticsearch [15:10:26] or could use phab-05? [15:10:55] well, that is the point, suggest that and then wait, because maintaining code is not an easy task [15:11:06] Ok [15:11:30] (03CR) 10Gehel: [C: 032] maps-test - change partman config to use a scheme with /srv [puppet] - 10https://gerrit.wikimedia.org/r/314703 (owner: 10Gehel) [15:12:15] \o/ [15:12:19] yay standardization! [15:12:22] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 11 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[Etcd disable auth],File[/etc/etcd/local/],Etcd_user[root] [15:12:39] gehel: something is wrong with me that actually makes me happy ;] [15:13:11] (03PS1) 10Madhuvishy: tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) [15:13:23] robh: :] /me is happy to make you happy :) [15:13:24] someday all our servers will store our wmf specific crap in a similar directory structure. [15:13:43] robh: nah, we will change the standard at that point... [15:13:49] i'll resist the urge to salt a search to find out how many /a hosts we still have [15:13:54] that used to be where we put all our shit.... [15:14:10] when we didnt just dump it over normal system files. [15:14:19] (03PS1) 10Giuseppe Lavagetto: etcd: bootstrap the networking cluster; the k8s cluster is now bootstrapped [puppet] - 10https://gerrit.wikimedia.org/r/314708 [15:14:21] (03CR) 10jenkins-bot: [V: 04-1] tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) (owner: 10Madhuvishy) [15:15:04] (03PS2) 10Elukey: Remove mobile and exdist reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/314678 (https://phabricator.wikimedia.org/T147000) (owner: 10Mforns) [15:15:39] (03CR) 10Giuseppe Lavagetto: [C: 032] etcd: bootstrap the networking cluster; the k8s cluster is now bootstrapped [puppet] - 10https://gerrit.wikimedia.org/r/314708 (owner: 10Giuseppe Lavagetto) [15:16:19] (03CR) 10Elukey: [C: 032] Remove mobile and exdist reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/314678 (https://phabricator.wikimedia.org/T147000) (owner: 10Mforns) [15:16:23] (03PS3) 10Elukey: Remove mobile and exdist reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/314678 (https://phabricator.wikimedia.org/T147000) (owner: 10Mforns) [15:16:51] (03CR) 10Elukey: [V: 032] Remove mobile and exdist reportupdater jobs [puppet] - 10https://gerrit.wikimedia.org/r/314678 (https://phabricator.wikimedia.org/T147000) (owner: 10Mforns) [15:17:00] (03PS2) 10Madhuvishy: tools proxy: Add health check and icinga monitoring [puppet] - 10https://gerrit.wikimedia.org/r/314707 (https://phabricator.wikimedia.org/T143638) [15:18:43] PROBLEM - puppet last run on cp3021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:19:26] jynus i filled https://jira.mariadb.org/browse/MDEV-10978 :) [15:19:37] Please fill in more info if i didn't include it all [15:27:05] jynus i think it's this https://github.com/wikimedia/phabricator/blob/bf75469a3427f7b9bab9628f6c6a62ec8f7e7f1f/src/applications/search/fulltextstorage/PhabricatorMySQLFulltextStorageEngine.php file your looking for for phab? [15:28:06] (03PS1) 10Gilles: Upgrade to 0.1.24 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/314709 [15:28:22] https://github.com/wikimedia/phabricator/blob/bf75469a3427f7b9bab9628f6c6a62ec8f7e7f1f/src/applications/search/fulltextstorage/PhabricatorMySQLFulltextStorageEngine.php#L177 [15:28:38] paladox, please take some time- I proposed a solution, but there is a different group "owning" phabricator, it is for them to decide if my proposal is useful [15:28:49] Ok [15:28:53] they may decide a different option [15:28:58] Ok [15:29:17] if you are looking for work, I have several tickets that you could help with :-) [15:29:25] if you like mysql [15:29:34] jynus upstream are looking at innodb https://secure.phabricator.com/T11741 and oh [15:29:40] What would these task be? [15:30:08] So on my board: https://phabricator.wikimedia.org/project/view/1060/ [15:30:33] There is a "backlog (help welcome)" [15:30:41] oh [15:30:42] where I put tickets I cannot do because lack of time [15:30:54] Thats alot of tasks LOL [15:31:01] mainly helping tool users [15:31:10] which I cannot do for every single user [15:31:26] Ok [15:31:26] but I suggest those for contributors [15:31:41] is not that I do not welcome help on the others [15:31:43] I do [15:31:58] but those are marked as "easy for people to start" [15:32:22] others have sometimes lots of blockers [15:32:28] again, it is only a suggestion [15:33:05] other things I need help with, if you have labs access [15:33:11] oh [15:33:16] 06Operations, 10ops-codfw: update/audit serial of EX4300-spare2-codfw - https://phabricator.wikimedia.org/T147592#2699618 (10Papaul) a:05Papaul>03RobH PE3715320310 [15:33:19] I doint have labs access [15:33:22] only a few instances [15:33:24] sorry, tools? [15:33:33] do you have tools access? [15:33:34] Oh, i only have access to grrrit-wm [15:33:59] I was going to suggest help verify missing data from labsdbs [15:34:02] nope no tools access except from ^^ [15:34:03] oh [15:34:29] (03CR) 10Filippo Giunchedi: [C: 032] Upgrade to 0.1.24 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/314709 (owner: 10Gilles) [15:34:31] sometimes things change and it is difficult to check if tickets have been already fixed or check the exacct problems [15:34:43] Oh [15:34:48] basically the idea is to avoid more than one people working at the time on the same task [15:35:06] Yep [15:35:35] thanks for the mariadb ticket, BTW [15:35:44] Your welcome :) [15:35:50] They seem to be more active up there [15:35:52] then mysql [15:36:08] in bugs, I would agree [15:36:12] Yep [15:36:14] in features, not so much [15:36:19] Yepp [15:36:50] Internet Explorer is using 97% cpu usage [15:36:57] On a 1.8ghz pc [15:37:16] and 1gb of ram [15:40:01] jynus i rember earlyer this year when i was trying mariadb on Ubuntu on windows bash [15:40:06] It broke mysql [15:40:15] But luckly one build fix the issue [15:40:30] I am now using ubuntu 16.04 as they recently added support for it. [15:40:46] I found native ubuntu on windows faster then vm [15:40:54] PROBLEM - puppet last run on etcd1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 19 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[Etcd disable auth],File[/etc/etcd/local/],Etcd_user[root] [15:40:59] https://msdn.microsoft.com/en-gb/commandline/wsl/about [15:46:12] RECOVERY - puppet last run on cp3021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:52:01] (03PS1) 10Giuseppe Lavagetto: etcd::networking: cluster is bootstrapped [puppet] - 10https://gerrit.wikimedia.org/r/314711 [15:53:18] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] etcd::networking: cluster is bootstrapped [puppet] - 10https://gerrit.wikimedia.org/r/314711 (owner: 10Giuseppe Lavagetto) [15:53:25] (03PS2) 10Giuseppe Lavagetto: etcd::networking: cluster is bootstrapped [puppet] - 10https://gerrit.wikimedia.org/r/314711 [15:53:27] (03CR) 10Giuseppe Lavagetto: [V: 032] etcd::networking: cluster is bootstrapped [puppet] - 10https://gerrit.wikimedia.org/r/314711 (owner: 10Giuseppe Lavagetto) [15:54:32] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build etcd clusters to support Kubernetes and calico - https://phabricator.wikimedia.org/T147421#2699641 (10Joe) 05Open>03Resolved p:05Triage>03Normal [15:57:47] jynus: so, could you run those two UPDATEs (or would it be okay for me to get someone else to do it? i don't have prod access myself)? or are you preparing/backuping/something? [15:58:16] MatmaRex, I will, but I need to prepare backups [15:58:22] in a meeting now, give me some time [15:58:48] hm, it's just two rows, and the current data is wrong. but alright, thanks [15:58:54] MatmaRex, hopefuly I can do those after it finishes [15:59:10] "it is only 2 rows" are good last words [15:59:13] :-) [15:59:32] backups only take a second, I only need to do them carefully [16:00:33] 06Operations, 10Prod-Kubernetes, 05Kubernetes-production-experiment, 15User-Joe: Build etcd clusters to support Kubernetes and calico - https://phabricator.wikimedia.org/T147421#2699650 (10Joe) [16:00:35] 06Operations, 10Prod-Kubernetes, 10vm-requests, 05Kubernetes-production-experiment, and 2 others: 6 small VMs for etcd clusters for kubernetes and its networking component - https://phabricator.wikimedia.org/T147620#2699649 (10Joe) 05Open>03Resolved [16:05:26] (03CR) 10Eevans: "PC output: http://puppet-compiler.wmflabs.org/4242" [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [16:05:33] (03PS3) 10Eevans: Add time-window compaction strategy jar to classpath [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) [16:06:02] !log updated hhvm package for jessie to 3.12.9 [16:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:23] PROBLEM - puppet last run on etcd1005 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 28 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[Etcd disable auth],File[/etc/etcd/local/],Etcd_user[root] [16:13:04] PROBLEM - puppet last run on etcd1006 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[Etcd disable auth],File[/etc/etcd/local/],Etcd_user[root] [16:14:09] !log build python-irclib for jessie and upload it to apt.wikimedia.org jessie-wikimedia/main [16:14:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:40] (03PS1) 10Ema: cache_text backend VCL: use bereq in misspass_mangle [puppet] - 10https://gerrit.wikimedia.org/r/314715 (https://phabricator.wikimedia.org/T131503) [16:14:42] (03PS1) 10Ema: WIP: Text VCL forward-port to Varnish 4 [puppet] - 10https://gerrit.wikimedia.org/r/314716 (https://phabricator.wikimedia.org/T131503) [16:25:04] !log reimage maps-test2001 - T147194 [16:25:05] T147194: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194 [16:25:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:26:21] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2699694 (10ops-monitoring-bot) Script wmf_auto_reimage was launched by gehel on neodymium.eqiad.wmnet for hosts: ``` ['maps-test2001.codfw.wmnet'] ``` The log can be found in `/va... [16:30:04] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2699697 (10elukey) >>! In T109226#2696811, @BBlack wrote: > On your repro attempts: I think the original... [16:33:55] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2699698 (10BBlack) hmm I'm pretty sure we were able to repro reliably at one point in the past, but I'd... [16:35:17] 06Operations, 10Traffic, 10Wikimedia-Apache-configuration, 13Patch-For-Review: Sometimes apache error 503s redirect to /503.html and this redirect gets cached - https://phabricator.wikimedia.org/T109226#2699700 (10BBlack) Maybe there are different ways in which `HHVM is busted`, and having hhvm be down isn... [16:39:47] MatmaRex, could you update the ticket with the specific row to update [16:40:16] jynus: yeah [16:40:21] update image set img_metadata='' where img_name = "Jena_-_Hummelsberg_05.jpg"; -- T145953 [16:40:22] T145953: File displaying invalid metadata in Commons, when the original Exif seems fine. - https://phabricator.wikimedia.org/T145953 [16:40:27] update image set img_metadata='' where img_name = "20160927_St_George's_Church_(The_Winery)_Mohegan_Lake_2.jpg"; -- T147015 [16:40:27] T147015: Exif orientation problem (no metadata extracted due to empty JPEG segment) - https://phabricator.wikimedia.org/T147015 [16:40:32] i'll put these on the tasks [16:40:46] yes please, more permanent [16:40:48] there [16:41:16] (e.g. imagine someone else breaks because of that, even if unlikely) [16:41:41] also this is commons, which was not 100% clear to me [16:41:47] i imagine they'll get !log-ged [16:41:47] at first [16:41:52] yes [16:41:58] but anyway [16:42:02] sorry, yes. i think i mentioned it earlier, but the tasks might be unclear [16:42:32] for me this is just one of 900 databases, so I have to ask [16:43:59] also, adding me to a closed ticket was not very clear [16:44:20] at first [16:44:47] however, blanking the metadata may now work [16:45:02] if it didn't work for a regular purge [16:45:13] but we can try anyway [16:51:17] Ok I have made a backup now [16:53:31] !log testing img_metadata nuking for T145953 and T147015 (backups on neodymium) [16:53:32] T145953: File displaying invalid metadata in Commons, when the original Exif seems fine. - https://phabricator.wikimedia.org/T145953 [16:53:33] T147015: Exif orientation problem (no metadata extracted due to empty JPEG segment) - https://phabricator.wikimedia.org/T147015 [16:53:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:27] MatmaRex, done [16:54:35] I will update the tickets [16:55:59] jynus: thanks. and it looks like the field got re-generated as expected (and with correct data) after i viewed the file pages. [16:56:10] oh, interesting [16:56:42] I would have thought it would have required a more complex step (force reparsing) [16:57:37] there's some magic code to detect invalid values and update them, and '' happens to be an invalid value. but there's no way to force this to happen from the user interface [16:57:54] nice [16:58:08] but metadata is now cut? [16:58:16] is that normal? [16:58:22] hm? [16:58:50] jynus: cut? [16:58:54] oh, sorry [16:58:57] it was less [16:59:00] client side [16:59:17] I do not normally read db content [16:59:35] there's some weird binary data in the metadata for one of the files [16:59:38] everything is fien [16:59:42] *fine [16:59:55] just my client was ignoring long content [16:59:59] i suppose it could be cut off if it exceeded the field length. but it doesn't in this case :) [17:00:00] ok :) [17:00:35] so you will create a maintenance task now? [17:01:23] that's https://phabricator.wikimedia.org/T32961 [17:01:29] we should work on exif/metadata a bit- it was recently causing issues [17:01:38] because 3MB metadata lookups [17:02:12] yeah, i heard someone wants to put it into a separate table, rather than a serialized blob [17:02:16] ostriches: sooo.. i restored data from lead to cobalt via bacula, you can try getting on cobalt and see /srv/gerrit/ there [17:02:28] ostriches: thing is, it's just 2.2G and not 20G [17:02:29] actually, InnoDB does that internally [17:02:50] the issue is some bots asking for all that data on all images [17:02:51] ostriches: is the a compression thing or is it missing things? it's not immediately obvious to me.. [17:03:16] Hmmm, lemme look [17:03:38] we definitelly want to do at some point for revision + rev_comment, specially to allow longer comments [17:03:49] Weird. [17:05:08] for example i randomly picked xowa.git [17:05:35] on lead 55M , on cobalt 1.8M [17:05:40] and both 477 subdirs [17:05:49] and the restore is supposed to be from latest [17:07:02] s/subdirs/files (find . | wc -l) [17:08:13] We can just rsync it [17:09:41] what's with srv/gerrit2 on lead now [17:10:04] ok, setting up rsyncd/ferm via puppet, 3 min [17:13:33] 06Operations, 10Cassandra, 10hardware-requests, 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2699842 (10Eevans) From https://phabricator.wikimedia.org/T139961#2541110 re: the available AMS nodes: > To summarize a conversation with @mark on IRC... [17:14:25] 06Operations, 10Cassandra, 06Services, 10hardware-requests, 07Wikimedia-Incident: Staging / Test environment(s) for RESTBase - https://phabricator.wikimedia.org/T136340#2699847 (10Eevans) [17:15:45] (03CR) 10Mobrovac: [C: 031] Add time-window compaction strategy jar to classpath [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [17:16:32] (03PS1) 10Dzahn: gerrit: add rsyncd on cobalt for migrating data [puppet] - 10https://gerrit.wikimedia.org/r/314726 (https://phabricator.wikimedia.org/T147597) [17:18:03] (03CR) 10Dzahn: [C: 032] "only touches new server not in prod yet" [puppet] - 10https://gerrit.wikimedia.org/r/314726 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [17:21:16] ostriches: rsync is running.. [17:21:23] Mmk [17:21:39] it's in a screen on lead [17:21:40] (03CR) 10Eevans: [C: 04-1] "Merging this, and the subsequent restart of Cassandra nodes should be deferred until Monday (2016-10-10). So: I am -1'ing this now Just I" [puppet] - 10https://gerrit.wikimedia.org/r/314603 (https://phabricator.wikimedia.org/T133395) (owner: 10Eevans) [17:21:50] i moved the bacula data to gerrit-bacula [17:23:01] Hi could we disable phd on phabricator [17:23:13] to prevent it from git cloning and causing high load? [17:23:42] i was expecting we finish the first run, then we stop things and then we rsync the diff [17:24:11] Should we also put gerrit in read only mode? [17:24:52] 06Operations, 06Labs: Move maps share to labstore1003 - https://phabricator.wikimedia.org/T147657#2699887 (10madhuvishy) [17:24:57] I think you have to, to get a clean sync at the end [17:25:25] I don't know if whatever that readonly mode is, is reliable, either [17:25:28] (03PS1) 10Madhuvishy: maps: Mount maps share on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/314727 (https://phabricator.wikimedia.org/T147657) [17:25:47] maybe finish up a live rsync, then go readonly and rsync again, then stop the gerrit service and rsync (hopefully 0 bytes xfer) one last time [17:26:03] I bet at least some metadata changes in that final rsync [17:26:32] oh [17:26:55] But should we stop phab from trying to clone, since it will fail when gerrit goes down? [17:27:04] yeah probably [17:27:13] what bblack said, yes [17:27:18] stopping it to prevent its load impact on the rsyncing is nice anyways [17:27:31] Yep [17:27:54] I'm semi-offline for a bit now unfortunately, but call/text if you need me for anything [17:28:00] I can get back in a few mins [17:28:10] thank you [17:28:24] it's currently working on the gazillion small objects in mediawiki core repo [17:28:51] !log rsyncing gerrit data from lead to cobalt [17:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:35:04] We need to land https://gerrit.wikimedia.org/r/#/c/314628/ before we migrate. [17:37:25] (03PS1) 10Chad: Gerrit: Put lead into maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/314729 [17:41:49] (03CR) 10Paladox: [C: 031] Gerrit: Put lead into maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/314729 (owner: 10Chad) [17:58:27] 06Operations, 06Discovery, 06Maps, 03Interactive-Sprint: reimage maps-test* servers - https://phabricator.wikimedia.org/T147194#2699986 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['maps-test2001.codfw.wmnet'] ``` Those hosts were successful: ``` ['maps-test2001.codfw.wmnet'] ``` [18:10:54] PROBLEM - puppet last run on restbase1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:11:10] (03PS2) 10Andrew Bogott: Follow-up I695dab22: Fix style [puppet] - 10https://gerrit.wikimedia.org/r/312280 (owner: 10Alex Monk) [18:12:05] Would someone mind doing a quick IP whitelist for 98.174.142.53 - it's the Old Globe Theatre at Balboa Park, where Wikiconference North America is currently happening [18:12:19] Edit-a-thon is just starting, some have had problems already, but hopefully not too much of an issue [18:13:20] Folks are doing IP edits, https://en.wikipedia.org/wiki/Special:Contributions/98.174.142.53 [18:14:32] ostriches: rsync is done [18:14:41] (03CR) 10Rush: [C: 032] maps: Mount maps share on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/314727 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [18:14:46] (03PS2) 10Rush: maps: Mount maps share on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/314727 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [18:15:06] (03CR) 10Andrew Bogott: [C: 032] Follow-up I695dab22: Fix style [puppet] - 10https://gerrit.wikimedia.org/r/312280 (owner: 10Alex Monk) [18:15:27] (03CR) 10Madhuvishy: [V: 032] maps: Mount maps share on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/314727 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [18:15:37] mutante: awesome. let's land maint mode then do the rest [18:15:40] (03PS3) 10Madhuvishy: maps: Mount maps share on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/314727 (https://phabricator.wikimedia.org/T147657) [18:15:45] (03CR) 10Madhuvishy: [V: 032] maps: Mount maps share on labstore1003 [puppet] - 10https://gerrit.wikimedia.org/r/314727 (https://phabricator.wikimedia.org/T147657) (owner: 10Madhuvishy) [18:16:13] ostriches: just, how do i merge after maint mode ? [18:16:38] hmm good question lol [18:16:51] the "slave" change i could do before [18:17:02] let's do slave first [18:17:10] I'll amend the ip one [18:17:15] and we'll do that [18:17:17] then maint [18:17:37] err need a cobalt role patch then mainy [18:17:39] (03CR) 10Dzahn: [C: 032] "compiled http://puppet-compiler.wmflabs.org/4243/lead.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/314623 (owner: 10Chad) [18:17:41] *maint [18:18:12] (03PS5) 10Dzahn: Gerrit: provide a way to specify slave mode [puppet] - 10https://gerrit.wikimedia.org/r/314623 (owner: 10Chad) [18:19:00] 06Operations, 10ops-eqiad: many items in rack 'z1' are lacking info - https://phabricator.wikimedia.org/T145158#2700061 (10Cmjohnson) 05Open>03Resolved All info in that rack has been updated. [18:19:52] (03PS1) 10Dzahn: gerrit: activate gerrit::server role on cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314734 (https://phabricator.wikimedia.org/T147597) [18:20:00] slave change is submitted [18:20:02] 06Operations, 10ops-eqiad: investigate spare ex4500 serial number - https://phabricator.wikimedia.org/T147590#2700065 (10Cmjohnson) 05Open>03Resolved updated in racktables GG0210396122 [18:21:05] ostriches: first one applied on lead ... right now, restarts service [18:21:24] puppet does [18:21:30] Yah restart for no net change, yippie [18:21:39] done [18:22:15] how did we disable maintenance mode last time? [18:23:24] Probably hacked around it [18:23:41] Should probably rewrite how that works anyway [18:23:43] But not now [18:23:48] of course the bot died ... [18:24:04] RECOVERY - puppet last run on restbase1013 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:24:17] paladox: would you mind [18:24:28] grrrit-wm [18:25:11] ostriches: so https://gerrit.wikimedia.org/r/#/c/314734/ now? [18:25:44] Oh [18:25:48] Ok [18:25:54] :) thank you [18:25:58] Your welcome [18:27:14] https://gerrit.wikimedia.org/r/#/c/314628/should be goodl now [18:27:16] mutante ^^ done [18:28:05] paladox: cool [18:28:11] ostriches: yes, on it [18:28:13] (03CR) 10Paladox: [C: 031] Gerrit: Specify public IPs for eqiad, we're not changing them [puppet] - 10https://gerrit.wikimedia.org/r/314628 (https://phabricator.wikimedia.org/T147597) (owner: 10Chad) [18:28:15] (03CR) 10Chad: [C: 031] gerrit: activate gerrit::server role on cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314734 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [18:28:46] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/4244/" [puppet] - 10https://gerrit.wikimedia.org/r/314628 (https://phabricator.wikimedia.org/T147597) (owner: 10Chad) [18:29:42] (03PS3) 10Dzahn: gerrit: activate gerrit::server role on cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314734 (https://phabricator.wikimedia.org/T147597) [18:30:26] (03PS4) 10Rush: labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 [18:30:38] (03PS5) 10Rush: labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 [18:30:52] (03PS5) 10Rush: labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 [18:30:55] (03CR) 10jenkins-bot: [V: 04-1] labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 (owner: 10Rush) [18:30:56] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700159 (10Smalyshev) [18:30:59] (03PS1) 10Dzahn: gerrit: remove cobalt.yaml from hiera, uses role now [puppet] - 10https://gerrit.wikimedia.org/r/314735 [18:31:16] Is this https://gerrit.wikimedia.org/r/#/c/314729/ ready to be merged? or arn't we there yet [18:31:28] (03CR) 10Dzahn: [C: 032] gerrit: activate gerrit::server role on cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314734 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [18:32:11] ostriches: wanna run puppet on cobalt to see how it goes? [18:32:22] submitted on master [18:32:48] (03PS2) 10Dzahn: gerrit: remove cobalt.yaml from hiera, uses role now [puppet] - 10https://gerrit.wikimedia.org/r/314735 [18:32:49] times on SAL are GMT (judging by last entry) correct? [18:32:54] (03CR) 10Rush: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/314028 (owner: 10Rush) [18:33:10] paladox: i think one more [18:33:14] Ok [18:33:15] thanks [18:33:17] https://gerrit.wikimedia.org/r/#/c/314735/ [18:33:27] (03PS6) 10Rush: labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 [18:33:28] thanks [18:33:39] (03CR) 10Rush: [C: 032] labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 (owner: 10Rush) [18:33:41] (03CR) 10Paladox: [C: 031] gerrit: remove cobalt.yaml from hiera, uses role now [puppet] - 10https://gerrit.wikimedia.org/r/314735 (owner: 10Dzahn) [18:33:45] (03PS6) 10Rush: labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 [18:33:50] (03CR) 10Rush: [V: 032] labstore: drbd resource setup sanity [puppet] - 10https://gerrit.wikimedia.org/r/312023 (owner: 10Rush) [18:35:07] (03CR) 10Rush: [C: 032] labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 (owner: 10Rush) [18:37:03] mutante just wondering did we switch phd off iridium? [18:37:36] paladox: not yet [18:37:42] (03PS7) 10Rush: labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 [18:38:17] Ok [18:39:18] (03CR) 10Rush: [C: 032] labstore: align tools drbd with current prod [puppet] - 10https://gerrit.wikimedia.org/r/314028 (owner: 10Rush) [18:42:48] (03Abandoned) 10Chad: Gerrit: Put lead into maintenance mode [puppet] - 10https://gerrit.wikimedia.org/r/314729 (owner: 10Chad) [18:43:50] ^ Maint mode is dumb the way it's written, we'll just let the 503s serve properly if we kill the process [18:44:54] (03PS1) 10Dzahn: gerrit/cobalt: fix duplicate role usage [puppet] - 10https://gerrit.wikimedia.org/r/314738 [18:45:14] - $formatter = $factory->newSnakFormatterForLanguage( [18:45:14] + $formatter = $factory->newEscapedPlainTextSnakFormatter( [18:45:17] whoops [18:45:26] Is there a way I can view zero.wikipedia.org ? I'd like to see what exactly is being sent to text only users for my IEG Alt Text grant [18:45:37] (03PS2) 10Dzahn: gerrit/cobalt: fix duplicate role usage [puppet] - 10https://gerrit.wikimedia.org/r/314738 [18:45:40] We should probaly try and add a script that reboots grrrit-wm when gerrit.wikimedia.org goes down :) [18:45:40] (03CR) 10jenkins-bot: [V: 04-1] gerrit/cobalt: fix duplicate role usage [puppet] - 10https://gerrit.wikimedia.org/r/314738 (owner: 10Dzahn) [18:45:45] not now [18:45:50] But i mean feature wise [18:46:27] paladox: that would be great but it would be a lot easier if the bot was running in prod [18:46:41] Yeh [18:46:54] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:46:57] maybe some day we can put the most important bots on a special machine [18:47:01] not sure [18:47:06] Yep [18:47:13] Or maybe have a yaml file in puppet [18:47:31] (03CR) 10Dzahn: [C: 032] gerrit/cobalt: fix duplicate role usage [puppet] - 10https://gerrit.wikimedia.org/r/314738 (owner: 10Dzahn) [18:47:33] that we just enable an auto restart every 10 mins to give us enough time to revert [18:47:35] :) [18:47:40] I've said for months I'll pay someone $50 to make grrrit-wm behave better on restarts. [18:47:47] (03PS3) 10Dzahn: gerrit: remove cobalt.yaml from hiera, uses role now [puppet] - 10https://gerrit.wikimedia.org/r/314735 [18:47:54] (03CR) 10Dzahn: [C: 032] gerrit: remove cobalt.yaml from hiera, uses role now [puppet] - 10https://gerrit.wikimedia.org/r/314735 (owner: 10Dzahn) [18:48:53] LOL [18:48:56] (03CR) 10Chad: [C: 031] gerrit: remove cobalt.yaml from hiera, uses role now [puppet] - 10https://gerrit.wikimedia.org/r/314735 (owner: 10Dzahn) [18:49:09] (03PS3) 10Dzahn: gerrit/cobalt: fix duplicate role usage [puppet] - 10https://gerrit.wikimedia.org/r/314738 [18:49:30] I did improve grrrit-wm though, uptime is getting better [18:49:39] but will restart several times during the day [18:50:54] Gerrit had an uptime of 188 days until the hardware issue yesterday ;-) [18:51:12] :) [18:52:00] paladox: it's all about a way to tell the labs instance to restart it, but without the prod server having to login there [18:52:48] Yep, im just figuring that we could possibly do it with gerrit, ie a special config when enable for maint it will auto restart the bot [18:53:02] then we revert the patch once the main is over [18:53:17] but then again that wont work for if gerrit goes down without any notice [18:53:26] puppet ran on cobalt now [18:53:33] (03CR) 10Andrew Bogott: [C: 032] Remove wikitech references from ldapconfig [puppet] - 10https://gerrit.wikimedia.org/r/309705 (owner: 10Alex Monk) [18:53:36] and finished the first run after we applied the role [19:00:07] mutante gerrit's gone offline? [19:00:18] hello [19:00:22] hi [19:02:39] !log cobalt, disabled puppet, removed service IP from interface [19:02:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:49] paladox: works again? [19:03:13] iirc we will have to manually accept the new ssh host key [19:03:21] on zuul [19:03:32] yep [19:03:36] mutante nope [19:03:44] The ssh key and IP did not change [19:03:47] does for me [19:03:51] Also we haven't finished migrating [19:03:57] ostriches but the hostname is [19:04:00] Zuul should be ok [19:04:14] applying the puppet role on the new host made it add the service IP [19:04:22] right now i removed that again [19:04:22] yeah zuul all fine [19:04:27] so we can merge the maint mode change [19:06:19] nevermind, we dont do the maintenance mode change [19:06:26] see ostriche's comment above [19:06:35] So I think kill puppet + gerrit on lead [19:06:38] i'll rsync again once gerrit is stopped [19:06:39] rsync one last time [19:06:48] Then swap dns and master mode [19:07:38] !log stopping gerrit on lead [19:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:07:49] !log stopped puppet on lead [19:07:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:08:27] Now we rsync, swap dns, then bring back up [19:08:38] !log db1065 swapping failed disk slot 9 T147396 [19:08:39] T147396: db1065: Degraded RAID - https://phabricator.wikimedia.org/T147396 [19:08:40] it's stopped [19:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:15] !log rsyncing gerrit data one more time from lead to cobalt [19:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:09:33] done [19:09:54] * hashar waits for dns [19:10:42] PROBLEM - HTTPS on cobalt is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [19:11:56] hm [19:12:01] my phone must be in the other room [19:12:25] ACKNOWLEDGEMENT - HTTPS on cobalt is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn migration [19:12:44] not paged in any case [19:13:26] mutante were getting Gerrit is down. We're working on bringing it back as soon as possible. now :) [19:14:12] PROBLEM - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [19:14:32] PROBLEM - SSH access on lead is CRITICAL: Connection refused [19:15:37] mutante: it's literally "gdnsd reload-zones" on the commandline as root [19:15:58] or I can do it since I'm logged in there anyways [19:16:26] what's the new IP? [19:16:41] bblack it's the same ip i think [19:16:44] ostriches ^^ [19:16:54] bblack: we want to move 208.80.154.85 from lead to cobalt [19:17:15] it's gerrit.wm.org and the puppet role puts it on the interface [19:17:15] oh that's not a DNS change then, right? [19:17:34] PROBLEM - puppet last run on ms-be1018 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:57] hieradata/hosts/lead.yaml:role::gerrit::server::ipv4: '208.80.154.85' [19:18:00] hieradata/hosts/lead.yaml:role::gerrit::server::ipv6: '2620:0:861:3:208:80:154:85' [19:18:03] but my puppet copy is out of date [19:18:10] yes, sorry, that is it [19:18:17] if we remove it from the interface on lead [19:18:24] and then re-enable puppet on cobalt [19:18:34] then that IP would be added to eth0 there [19:19:16] ostriches: ^ i'll do that? [19:19:17] I think that's all you need to do really [19:19:30] I guess hieradata already changed since my last sync [19:19:38] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700159 (10AlexMonk-WMF) This would be either `restricted` or the much more powerful `deployment`. But probably not `ldap-admins` [19:19:42] we moved it to a common file [19:19:50] away from the hostname.yaml [19:19:53] ok [19:20:03] so yeah, we can manually remove the IP on lead [19:20:04] but that caused it to be on both of them [19:20:14] doing that now [19:20:14] PROBLEM - puppet last run on kafka1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:21:10] !log removed gerrit IP from lead's interface, v4 and v6 [19:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:21:18] !log re-enabling puppet on cobalt [19:21:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:22:01] puppet added the IP there now [19:22:30] no apache running there [19:22:33] mutante: Fine by me... [19:22:45] PROBLEM - puppet last run on cobalt is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 45 seconds ago with 1 failures. Failed resources (up to 3 shown): Exec[reindex_gerrit_jetty] [19:23:11] the puppet fail is about it not being able to start gerrit service [19:23:26] ostriches: reindex? [19:23:38] it's still in slave mode [19:23:42] that's why that part fails [19:24:02] why isn't puppet trying to start apache for the gerrit web UI? [19:24:03] PROBLEM - puppet last run on kafka2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:24:35] oh I see, all related [19:24:45] it won't start the web service until the reindex works, which refuses to work in slave mode [19:24:53] PROBLEM - puppet last run on bromine is CRITICAL: CRITICAL: Puppet has 4 failures. Last run 6 minutes ago with 4 failures. Failed resources (up to 3 shown): Exec[git_pull_wikimedia/TransparencyReport-private],Exec[git_pull_wikimedia/TransparencyReport],Exec[git_pull_wikimedia/annualreport],Exec[git_pull_wikimedia/endowment] [19:25:01] Yeah I just saw that [19:25:02] fatal: Cannot run reindex in slave mode [19:25:13] how do you take it out of slave mode? [19:25:13] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] [19:25:37] gerrit::jetty::slave [19:25:44] without gerrit commits :) [19:25:45] But since we're down, I'll hack it [19:25:50] ;-) [19:26:05] PROBLEM - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 6 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] [19:26:19] should i still copy /var/lib/gerrit2/review_site/index ? [19:26:45] Running reindex now [19:26:45] mutante: that sounds like something that, if we're to do it all, should happen before puppet enabled and gerrit commands trying to start up... [19:26:51] but I bet reindex rebuilds it all [19:27:13] bblack yeh reindex should re build it all [19:27:15] disabled puppet on lead (before it re-adds the IP, narf) [19:27:15] bblack: It's supposed to run reindex, but I've never quite gotten puppet + gerrit init to work 100% nicely [19:27:31] ok @ reindex [19:27:42] org.h2.jdbc.JdbcSQLException: Database may be already in use: "Locked by another process". Possible solutions: close all other connection(s); use the server mode [90020-176] [19:27:44] wtf? [19:27:46] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700343 (10RobH) a:03Smalyshev @smalyshev: What exactly do you need to do with your access to terbium? (This will let us know which group to add you do.) As Alex already commented on,... [19:27:53] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_refinery_source],Exec[git_pull_analytics/discovery-stats],Exec[git_pull_aggregator_code],Exec[git_pull_analytics/reportupdater] [19:27:53] heh [19:27:59] I think this is an ordering problem.... [19:28:19] disable puppet, manually delete the new gerrit.wm.o IPs so external things stop talking to it [19:28:23] I really really hate gerrit's lucene index. [19:28:26] then fix slave mode -> reindex -> startup [19:28:31] then turn on puppet + IPs [19:29:13] ACKNOWLEDGEMENT - puppet last run on stat1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 9 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_geowiki-scripts],Exec[git_pull_analytics.wikimedia.org] daniel_zahn caused by gerrit maint. [19:29:13] ACKNOWLEDGEMENT - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 8 failures. Last run 7 minutes ago with 8 failures. Failed resources (up to 3 shown): Exec[git_pull_refinery_source],Exec[git_pull_analytics/discovery-stats],Exec[git_pull_aggregator_code],Exec[git_pull_analytics/reportupdater] daniel_zahn caused by gerrit maint. [19:29:13] ACKNOWLEDGEMENT - puppet last run on tungsten is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 8 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/software/xhprof],Exec[git_pull_operations/software/xhgui] daniel_zahn caused by gerrit maint. [19:29:40] !log disabled puppet on lead and cobalt [19:29:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:29:50] robh, I think Stas already provided the info needed.. [19:30:00] !log removed gerrit IPs from cobalt interfaces [19:30:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:30:16] ^ first part done [19:30:34] ostriches: now again? [19:30:48] should be able to reindex now with IPs dead + gerrit dead? [19:30:50] Krenair: true enough, i guess deployment then.... [19:31:09] robh, hm? I was expecting you to go with restricted [19:31:26] deployment can log into all mw servers and deploy new code to them [19:32:03] im guessing the script for reindexing isnt just on www-data user locally [19:32:08] but im not certain [19:32:12] I thought it was [19:32:17] SMalyshev: you about? [19:32:31] reindex really don't like me [19:32:33] It's an MW maintenance script isn't it? [19:32:40] ostriches: what did it say this time? [19:32:58] Complained a bunch about not being able to read cache/* files [19:33:03] Which are all owned by gerrit2 [19:33:13] can I take a stab? [19:33:17] Krenair: restricted is for www-data locally on terbium/fluroine/etc [19:33:46] not on all the mw hosts, imu [19:33:47] yep [19:33:49] yes [19:33:49] i rsynced with -avp ,fwiw [19:33:51] so how would restricted work? [19:33:56] he wants to run a maintenance script [19:34:01] but gerrit2 has different UID on lead and cobalt [19:34:05] that hits multiple hosts (not just flurine's www-data) [19:34:12] I don't want to step on others running non-read-ony stuff [19:34:14] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700366 (10Smalyshev) @RobH: for now mainly run scripts, e.g. reindexing Elasticsearch after deploying various changes to mappings/indexing, like these: https://wikitech.wikimedia.org/wik... [19:34:15] not deploy new code to all the apaches serving user traffic [19:34:17] bblack: I killed all the cache/* files, still no bueno [19:34:20] no? [19:34:36] can I go run the command and debug what's happening? [19:34:45] it sounds like the script they run touches more than fluroine, so im not sure how giving access to only flurines www-data would help [19:35:01] [2016-10-07 19:33:52,116] [Index-Batch-5] ERROR com.google.gerrit.server.change.MergeabilityCacheImpl : Error checking mergeability of 420a0afdae202aed37f4825847bf4b2828b7cf54 into 41557f9b798e8bc828b7ae4dc4b8a54ca0190dc6 (MERGE_IF_NECESSARY) [19:35:01] com.google.gerrit.server.git.IntegrationException: Cannot merge 420a0afdae202aed37f4825847bf4b2828b7cf54 [19:35:07] eg ^ [19:35:20] Something to do with MERGE_IF_NECESSARY [19:35:20] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700368 (10AlexMonk-WMF) >>! In T147666#2700366, @Smalyshev wrote: > @RobH: for now mainly run scripts, e.g. reindexing Elasticsearch after deploying various changes to mappings/indexing,... [19:35:42] I doubt the problem runs that deep if things were cleanly shut off before the final sync of everything to the new machine [19:35:43] robh, I'm not sure fluorine has a www-data, but this is not just access to fluorine [19:35:57] i can always run the rsync one more time, just in case [19:35:59] the restricted group provides access to several servers [19:36:01] i dont think SMalyshev realized that both of the groups listed are sudo groups [19:36:03] one of which is the requested server [19:36:13] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700369 (10Smalyshev) a:05Smalyshev>03RobH [19:36:28] what directoreis did we sync with what commands? [19:36:46] bblack: /srv/gerrit/ to /srv/gerrit/ with -avp [19:36:47] Krenair: i didnt realize restricted allowed access to the server they want access to [19:36:52] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700159 (10Smalyshev) @AlexMonk-WMF OK, I didn't know that :) Then I do need the sudo which allows to use mwscript. [19:36:58] 06Operations, 10ops-eqiad: Rack/Setup new memcache servers mc1019-36 - https://phabricator.wikimedia.org/T137345#2700371 (10Cmjohnson) @faidon ping [19:37:01] ok [19:37:21] is there any reason not to wipe + resync /var/lib/gerrit2/ ? [19:37:57] ostriches https://groups.google.com/forum/#!topic/repo-discuss/QnsjqixL4oo [19:38:57] .... [19:39:03] PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 8 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:39:10] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 10 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:39:40] PROBLEM - SSH access on cobalt is CRITICAL: Connection refused [19:39:52] PROBLEM - puppet last run on graphite1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_performance/docroot] [19:40:02] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:40:23] bblack: We can [19:40:32] (basically if things were all cleanly stopped on lead, they should cleanly start on cobalt if copied) [19:40:32] PROBLEM - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [19:40:45] if we get a clean copy of all related data brought over [19:41:03] mutante: do you ahve the rsync stuff up to do it? just grab all of /var/lib/gerrit2/ from lead->cobalt [19:41:39] https://gerrit-review.googlesource.com/#/c/80065/ could be related? [19:41:42] PROBLEM - puppet last run on neon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:43] ACKNOWLEDGEMENT - SSH access on cobalt is CRITICAL: Connection refused daniel_zahn gerrit migration [19:41:44] ACKNOWLEDGEMENT - gerrit process on cobalt is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn gerrit migration [19:41:44] ACKNOWLEDGEMENT - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 2 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_analytics/reportupdater],Exec[git_pull_geowiki-scripts],Exec[git_pull_statistics_mediawiki] daniel_zahn gerrit migration [19:41:44] ACKNOWLEDGEMENT - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 13 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] daniel_zahn gerrit migration [19:42:11] we don't really need to reindex anyways, though [19:42:15] bblack: kind of, i did the setup via puppet, i'll look how to hack it [19:42:20] we just need to copy over all the relevant data and restart the process [19:42:25] the setup? [19:42:31] yea, rsyncd setup and ferm [19:42:33] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [19:42:40] oh, ok [19:42:57] I'm so over freenode today [19:43:01] RECOVERY - puppet last run on ms-be1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:43:03] aww [19:43:05] I can set it up manually too if we that's what we need [19:43:30] (assume I've lost all scrollback) [19:43:59] bblack: just added to /etc/rsyncd.conf [19:44:39] still no nice way to copy a bunch of crap from one server in prod to another. [19:44:50] something to go with orchestration maybe [19:45:01] there is, if you have gerrit to merge puppet changes [19:45:07] starting sync now [19:45:25] !log rsyncing /var/lib/gerrit2 from lead to cobalt [19:45:30] delete old one first? [19:45:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:45:39] ok [19:45:50] delete current contents on cobalt, I mean [19:46:03] "old one" is not clear in all of this context heh [19:46:10] :-) [19:46:12] yes [19:46:24] !log deleted old /var/lib/gerrit2/ data on cobalt, syncing from lead [19:46:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:46:38] done [19:46:43] ostriches: back to you? [19:46:49] sync is done? [19:46:52] yes [19:46:57] that would be demon_imposter I suppose [19:46:59] ok [19:47:02] so.... [19:47:09] it's all owned by 444 [19:47:10] IMHO, next step is just re-enable and run puppet [19:47:15] ls 444 right? [19:47:35] great, different uids for gerrit2 heh [19:47:36] it was on lead, but on cobalt gerrit2 is 114/119 [19:47:40] of course [19:48:24] one day ... [19:48:24] grrr seriously? [19:48:29] well demon_imposter I dunno what they wanted you to do [19:48:35] I got split->rejoin [19:48:37] may have missed some lines [19:48:37] here they come [19:48:41] you missed a lot of netsplit lines [19:48:43] man, everytime we have the UID issue [19:48:44] and now the freenode [19:48:44] I think that might be because we removed grrrit2 from wikitech? [19:48:45] mutante ^^ [19:48:45] bblack ^^ [19:48:45] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:48:55] i can fix it with find -exec [19:48:55] bblack I think that might be because we removed grrrit2 from wikitech? [19:48:57] I fidxed the perms [19:48:59] unless you already are [19:49:03] I would love it if we would set the UIDS as fixed someday [19:49:10] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:49:11] apergos: wikitech page UID [19:49:11] I think freenode sent a service wide message [19:49:14] root@cobalt:/var/lib/gerrit2# chown -R gerrit2:gerrit2 . [19:49:16] saying they are doing maint [19:49:21] so, looking at the puppetization [19:49:44] the puppetization handles: turning the IPs back on, not running reindex because the index file already exists, and starting the actual gerrit server (jetty and otherwise) [19:49:50] seems like we just run puppet at this point, no? [19:50:11] unless there's possibly more missing data we need to rsync back [19:50:22] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [19:50:25] fuck freenode [19:50:56] bblack: Srsly [19:51:18] anyways, I'm going to enable and run puppet. I think it will *not* try to reindex (we copied in the index), and it should re-addr the IPs and start the gerrit services, and maybe it Just Works [19:51:38] Copied index should be ok [19:51:57] bblack: yes, please [19:52:05] heh it turned slave mode back on [19:52:10] wtf... [19:52:26] It most likly did, since we re sync [19:52:41] yeah I mean why is slave_mode = true puppetized? [19:52:56] I think demon_imposter did that? [19:53:08] I was trying to make it easier to spin up secondary r/o instances. [19:53:19] It's proving more annoying than anything tho rn [19:53:28] well it is easier I guess.. than spinning up primary r/w ones :-D [19:53:51] RECOVERY - puppet last run on cobalt is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:54:10] so [19:54:11] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [19:54:21] regardless of re-disabling puppet and turning slave back off [19:54:37] gerrit doesn't start [19:54:54] Dammit [19:54:56] where does it log how it fails to start up? [19:55:09] bblack logs/error.log [19:55:11] I look at /etc/init.d/gerrit and I see a whole lot of redirecting output to /dev/null [19:55:16] logs/error.log within where? [19:55:23] the review_site folder [19:55:28] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700385 (10RobH) As @AlexMonk-WMF points out, you likely need to run your scripts as www-data, not as yourself. Hence being added to 'restricted' group to allow that access/sudo level.... [19:55:38] ie /var/lib/gerrit2/review_site/logs/error.log [19:55:40] bblack ^^ [19:56:11] RECOVERY - gerrit process on cobalt is OK: PROCS OK: 1 process with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war [19:56:12] nothing there [19:56:17] it's dying fast and early [19:56:37] Oh [19:56:53] bblack theres also /var/log/apache2/gerrit_error.lgo [19:56:55] as in, the service silently fails to start, quickly [19:56:56] lgo = log [19:57:13] bblack: bin/gerrit.sh run works, but `service gerrit start` does not [19:57:25] there is no /var/log/apache2/, nor any apache puppetized AFAIk [19:57:29] use the classic /etc/init.d/gerrit [19:57:41] demon_imposter: is that "normal"? [19:57:43] Oh if there is no /var/log/apache2 then apache is not installed [19:57:54] wt... [19:57:57] bblack: Not really, both should work [19:58:04] yeah puppet isn't even trying to start apache, or install it [19:58:13] but one problem at a time [19:58:39] GerritCodeReview up now, from "/etc/init.d/gerrit start" [19:58:44] don't ask me why the systemd unit fails :P [19:59:06] Heh, doesn't work when I did it but works on it own [19:59:23] may depend on sudo state too, if you're doing single-command sudo and I'm not [19:59:38] so GerritCodeReview is seemingly-online [19:59:52] we now have two problems: [20:00:00] 1) No apache puppetized into this host/role at all? [20:00:16] bblack: It is, we just need to turn off slave mode in hiera. [20:00:17] 2) being able to run puppet to fix that without having it try to make cobalt a slave again, probably without merging more new puppet commits [20:00:30] oh that's slave-mode too? [20:00:32] RECOVERY - SSH access on cobalt is OK: SSH OK - GerritCodeReview_2.12.3 (SSHD-CORE-0.14.0) (protocol 2.0) [20:01:10] for future reference / if-we-could-rewind-time: disable agent on both hosts and fix puppetization for cobalt to be normal (not slave) [20:01:20] before taking down gerrit for further merges :P [20:01:29] yes, the systemd unit tries to "auto translate" [20:01:36] since there is no unit file [20:01:38] bblack: Yeah, I disabled the whole proxy stuff in slave mode because the LE stuff will fail [20:01:40] and it doesnt work [20:01:46] but the oldschool init script does [20:02:06] demon_imposter: we can copy /etc/acme/ over with rsync to get past LE bootstrap in a situation like this, too [20:02:36] so, we need to change hieradata now, without going through gerrit [20:02:47] I didn't think of that. I was kinda thinking ahead for running a warm slave in codfw for failover. [20:03:02] Two birds, one stone, didn't quite work out :( [20:03:20] I'm guessing we need it changed on both eqiad masters and that's enough? [20:03:31] it will break puppet-merge till we do fixups on the repo/master states after [20:04:02] so let's see. if we live hack it on the two master repos [20:04:10] right [20:04:12] then push that out [20:04:13] I'm doing that now [20:04:21] then once it's running... [20:04:26] considering how mission critical gerrit is, I'd say we should put a bit more effort into making it fault tolerant ;) [20:04:26] nobody touch puppet-merge, etc, even after things seem to work, please [20:04:26] here's where I'm stuck [20:04:48] it's not that mission critical [20:04:56] we can still deploy emergency fixes without it if necessary [20:05:01] s/fault tolerant/easy to redeploy/, at least [20:05:45] Krenair: I'd consider it pretty mission critical. Case in point: apparently opsen have no easy way to make changes to puppet when gerrit is offline [20:05:47] disable puppet on gerrits after this push, undo the live hack, add puppet change in gerrit, merge, [20:05:59] -kloeri- [Server Notice] Hiya, the server you're currently connected to (orwell.freenode.net) will be rebooted in a few minutes. We recommend reconnecting to chat.freenode.net. [20:06:01] Oh great [20:06:06] puppet-merge, re0enable puppet on gerrit boxen? [20:06:20] one step at a time [20:06:24] might be missing something there [20:06:39] yeah just thinking aloud [20:06:51] 06Operations, 06Labs, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2700396 (10chasemp) OK things we don't monitor yet: * DRBD service state (and add it to the role to start post all resources) * A check which validates that nfs-kernel-se... [20:06:51] i have to step back for about half an hour, family obligations [20:06:55] I have the manual hieradata hack to turn off slave mode for cobalt on the 2x eqiad puppetmasters [20:07:05] and cobalt is running puppet now and starting up master-ish things [20:07:39] please don't merge changes through gerrit yet, anyone (not that everyone that could/would even looks here) [20:08:56] !log disabled phd + puppet on iridium and scheduled downtime in icinga to silence alerts [20:10:06] error: unpack failed: error Permission denied [20:10:06] fatal: Unpack error, check server log [20:10:20] oh blerg [20:10:20] (on trying to push up an ops/puppet change to undo the hacks) [20:11:20] ok I got this [20:11:25] 444 on /srv/gerrit/... stuff too [20:11:29] ah meh [20:11:37] Yeah just gotta reown [20:12:04] (I was gonna hardcode the UID/GID in the package but was told not to ;-)) [20:12:31] tar translates uids in these cases given the right options [20:12:39] I bet rsync can too, but I don't know offhand if/how [20:12:48] yeah hardcoding single UIDs is not the solution, we need to fix it up generally [20:13:01] (03PS1) 10BBlack: cobalt: gerrit master [puppet] - 10https://gerrit.wikimedia.org/r/314740 [20:13:09] (03CR) 10BBlack: [C: 032 V: 032] cobalt: gerrit master [puppet] - 10https://gerrit.wikimedia.org/r/314740 (owner: 10BBlack) [20:14:41] RECOVERY - puppet last run on kafka1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:33] ok ops/puppet state is sane on masters, agent is enabled and runs fine on cobalt, etc [20:15:36] i belive loading gerrit.wikimedia.org is faster then when we first used lead for gerrit.wikimedia.org [20:15:39] I think we're basically "up" [20:15:43] omg [20:16:09] dare someone test it? [20:16:16] awesome! [20:16:18] well I just did, I pushed a change through :) [20:16:24] (the un-hacking change) [20:16:26] well I mean [20:16:29] bblack if i hear explosion do we want us to assume thats good :P [20:16:29] someone else :-D [20:16:31] We need to fix zuul's ferm rules now [20:16:53] that can be done with normal gerrit workflow though [20:17:09] just maybe need to ignore lack of CI and "manually" C+2/V+2 [20:17:20] RECOVERY - puppet last run on kafka2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:17:25] bblack thats recipe for diseaster just saying [20:17:29] Er, is it ok? [20:17:35] Zppix: what is? [20:17:41] turning off CI [20:18:01] Zppix: we're talking about merging changes to re-enable currently broken CI infrastructure [20:18:03] bblack need to restart zuul [20:18:08] and things should start working [20:18:10] hashar ^^ [20:18:10] (03PS10) 10Rush: labstore: Add monitoring for secondary HA cluster health [puppet] - 10https://gerrit.wikimedia.org/r/311723 (https://phabricator.wikimedia.org/T144633) (owner: 10Madhuvishy) [20:18:15] ah, my bad i kinda came in here in the middle of everything bblack [20:18:15] could you restart zuul please? [20:18:25] Zuul should be ok [20:18:31] srange => '@resolve((lead.wikimedia.org cobalt.wikimedia.org gerrit.wikimedia.org))', [20:18:34] zuul is just being zuul :P [20:18:37] ah ok [20:18:39] It was already adjusted [20:18:41] But it needs restarting when ssh is disconnected [20:18:44] We can drop lead later [20:18:58] I misread "we can drop dead later". a bit shocking [20:18:59] Oh it's working https://integration.wikimedia.org/zuul/ [20:18:59] now [20:19:03] oh good [20:19:24] as soon as we're pretty confident things are stable for all purposes, we should probably switch lead to a vanilla role and puppet it, to keep it from trying to interfere with anything [20:19:42] RECOVERY - puppet last run on bromine is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:48] (03PS1) 10Reedy: Remove wikimania2013wiki specific translate config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314742 [20:19:51] RECOVERY - puppet last run on stat1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:19:56] even now, I can't be sure it doesn't ahve cronjobs that touch the network or something [20:20:28] no kidding [20:20:30] bblack: gerrit::crons only do local stuff, it won't harm anything [20:20:34] ok [20:20:38] and manually turn off all cron over there, removing the role won't probably do that [20:20:46] good to know [20:20:52] should we stop ci ? [20:20:54] or is gerrit fine [20:20:55] the clean thing to do would be reinstall it with a fresh image [20:21:06] but I don't want to lose our last-known-good copy of live gerrit data too soon, either :) [20:21:14] cronjobs are server-side only unless specified otherwise i believe [20:21:30] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:31] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:21:42] I vote for keeping it around untouched until we are sure [20:21:52] hashar: gerrit's supposed to be fine [20:21:56] and maybe trigger a backup of lead? [20:21:58] we want CI to start working again [20:22:01] worst thing thing that coudl happen is that we would need to resubmit changes [20:22:39] well CI looks all happy [20:22:42] ok [20:22:59] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700412 (10Smalyshev) restricted looks good then. IIRC elasticsearch will take care of the replication etc. Pinging @EBernhardson just in case I miss something, but looks good to me. [20:23:23] so aside from lead preservation/cleanup, is anything still borked or turned off? [20:23:26] phab stuff? [20:23:45] !log restarted phd on iridium [20:24:01] bblack: I just took care of phab [20:24:01] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:25:40] cobalt seems to have half as many cores, but newer/faster [20:25:45] I wonder if that's HT [20:25:58] It's jvm, we only need ram [20:26:07] who cares about cores long as they're doing >200mhz? [20:26:10] what's cobalt got compared to led? [20:26:13] *lead [20:26:22] bblack it's 3ghz whereas lead is 2.50ghz [20:26:24] i think [20:26:30] ram I mean [20:26:31] RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:26:37] (03PS1) 10Reedy: wfLoadExtension for 8 more extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314746 (https://phabricator.wikimedia.org/T140852) [20:26:41] yeah it's HT [20:26:47] cobalt doesn't have HT turned on, lead does [20:26:47] sweet [20:26:52] apergos: I think they are both 32G [20:26:56] that's why cobalt seems to have half the CPUs [20:27:00] hm so not that [20:27:14] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700452 (10EBernhardson) being able to run scripts from terbium is the only important part, sounds like the restricted group should cover our bases. [20:27:14] nice to have proof that ht really makes a difference [20:27:34] we should probably reboot it and turn HT on, but it can wait [20:27:46] bblack bc cobalt is better dude its more durable and stuff :P [20:28:25] its prob a priorty issue within the server infrastructure, however i have no clue how you guys set you your servers so i may be completely wrong [20:29:29] I stopped apache2 on lead, and switched the puppet disable to have the message DO-NOT-RE-ENABLE [20:29:50] did you slap your name on it? [20:30:03] well an email will take care of that if needed, or a ! log [20:30:23] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:30:31] !log lead.wikimedia.org: replaced by cobalt functionally, please leave it untouched for now with puppet disabled! [20:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:31:00] you folks really pulled victory from the jaws of disaster on this one [20:31:08] Ok couple of take-aways.... [20:31:16] A) slave mode sucks cuz it doesn't do indexing [20:31:32] PROBLEM - puppet last run on labservices1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[/usr/local/bin/labs-ip-alias-dump.py] [20:31:40] B) We need to have a master we can talk to at all times, working around lack of gerrit sucks [20:31:58] well (B) is probably-impossible in cases like these with any sanity, right? [20:32:22] bblack: We can fail better, I think [20:32:25] right now there is one master. we could have one master in codfw [20:32:30] as well, as a fall back [20:32:34] phabricator supports master-master now... :) [20:32:38] Anyway, finally, (C) Other things I'm sure but it's friday [20:32:40] it wouldn't be horrible [20:32:46] ah yeah. that [20:32:48] it wouldn't have the same data, though :) [20:33:00] Oh, and yeah +1990128912891298 on spare in codfw [20:33:00] RECOVERY - puppet last run on graphite1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:33:06] Talked to mutante about that earlier [20:33:11] We should have a warm-ish slave [20:33:14] my main takeaway is puppetization and procedures need to be rock-solid, especially for such a critical service [20:34:44] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:35:51] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:36:00] PROBLEM - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [20:39:54] Holy **** [20:40:00] 06Operations, 06Labs, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2700497 (10chasemp) [20:40:28] (resend cause net split) so C) it's friday, really we had to do this on a friday? [20:41:05] apergos: we wanted to give engineers a 4 day weekend in the US (monday is Indigenous People's Day) ;) [20:41:07] whats is up with all the netsplits [20:41:08] for the 'editorial we', since I didn't actually do jack. but anyways [20:41:24] greg-g: I would have deferred to next working day [20:41:50] Zppix: it's Freenode doing some server restarts, you should have messages from the Freenode admins where ever they show up in your client [20:42:07] (03PS2) 10Reedy: wfLoadExtension for many more extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314746 (https://phabricator.wikimedia.org/T140852) [20:42:21] apergos: mostly it's "hardware issues are scary, this should be done asap or fear a forced migration at a REALLY bad time" [20:42:30] im in canada server land rn i think they got all of canada already [20:42:36] I hear you [20:42:51] Wikibugs is called <7YUAAAAUM> [20:42:52] LOL [20:43:00] paladox? [20:43:09] LMFAO [20:43:15] GG wikibugs gg [20:43:22] In -labs wikibugs is being called <7YUAAAAUM> [20:43:29] here too [20:43:31] <7YUAAAAUM> Labs: Change the way manage-nfs-volumes is monitored - https://phabricator.wikimedia.org/T91806#2700496 (chasemp) Open>Invalid [20:44:10] its doing in til just now in -debv [20:44:11] huh. morebots is still here, no idea how functional [20:44:12] its doing in til just now in -dev* [20:44:17] (03PS1) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [20:44:47] i wonder how well Tools labs is handling all the netsplits for irc bots [20:45:03] I'm once again out for a bit, I think things are stable [20:45:09] go go [20:45:10] feel free to call me if not! [20:45:16] I'll stick around until mutante comes back [20:45:16] thank you bblack :) [20:45:20] and others! [20:45:26] if stuff explodes i will get us donuts :P [20:45:26] (03CR) 10Reedy: [C: 04-1] "+1 for the idea, -1 because it depends on other patches" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314748 (https://phabricator.wikimedia.org/T147234) (owner: 10Dereckson) [20:45:48] I'm so over freenode today [20:45:54] <7YUAAAAUM> 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700525 (10RobH) 05Open>03stalled Sounds good. As noted previously, this uses sudo for www-data, so it'll have to be approved in the Operations meeting next week. The meeting typica... [20:45:58] can we just host our own IRC servers xD [20:46:06] wikibugs go home your drunk [20:46:13] I have one! [20:46:25] LOL [20:46:37] Zppix: I'm mildly drunk :P [20:46:42] Zppix we used too but we shut it down [20:46:43] lol [20:46:51] demon_imposter: heh, security updates... [20:46:56] paladox to much security risks? [20:47:17] Not sure [20:47:29] mutante may know when he comes back [20:47:31] (03PS2) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [20:47:32] or even robh [20:47:43] paladox: heh, but it is a freenode server, not a network [20:47:52] I have ine server, which is a network too :P [20:47:58] so we had an irc server on the network [20:47:59] Oh yep [20:48:05] Zppix ^^ [20:48:06] but it was taken offline during the freenode security issue [20:48:18] and now its just offline since no one in ops works on it and it took too much upkeep. [20:48:31] really it was core_n's baby [20:48:45] Karatkievich.freenode.net seems to be stable ive been on it for 1-2 hrs now without netsplit [20:48:45] back, what's up [20:48:51] gerrit [20:48:53] ci [20:48:54] phab [20:48:55] all up [20:48:57] Yes [20:48:59] ie: it has to be a base image install, not our typical netboot image [20:49:01] thank bbla ck mostly [20:49:09] so no more freenode server by wmf. [20:49:15] apergos: :)) [20:49:19] he left for awhile [20:49:29] I said I'd stick around til you got back, which was almost instantly [20:49:32] awesome [20:49:38] thank you [20:49:45] robh freenode should give us a private server for us to use for volunteers to connect to freenode so we dont have to spam a channel with netsplit msgs [20:49:49] (03PS1) 10Reedy: Remove spurious transcoding-labs.org usage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314753 [20:49:49] xD [20:50:18] so when we had one (dickson) there were a few opsen who thought it was a bad idea for all of wmf to connect to it specifically [20:50:46] I did never connected to it by name [20:50:51] heres an idea dont use the same info for everthing :P [20:50:55] if irc.freenode.net gave me that host, then fine [20:50:59] how do u connect to a certain server anyways? [20:50:59] so not sure why our hosting one is any benefit to us [20:51:14] by listing it rather than the pool url. [20:51:16] it's a benefit in the sense that it makes freenode a little stronger [20:51:29] so yeah, all of that is why we dont bother to run a freenode server anymore. [20:51:39] theres like 20 wmf volunteers to 4 non volunteers in total [20:52:09] Zppix: i am not sure what you mean [20:52:13] correction 1 not 4 :P [20:52:35] we use alot of bandwith just wmf volunteers in all [20:52:37] xD [20:52:56] I have unlimited bandwith anyways [20:53:05] "unlimited" [20:53:07] freenode does not [20:53:07] Really unlimited no caps and no throttle [20:53:14] BT [20:53:21] oh [20:53:26] Throttle applied by exchange/cab contention [20:53:41] Oh, that's virgin, but BT dosent with me [20:54:00] BT are known to do it [20:54:01] I probaly use like 1-5tb a week or even more, since the router is shared [20:54:06] When kit is oversubscribed [20:54:15] Reedy ? [20:54:19] Oh now i get it [20:54:26] yeh, but never happends in our area [20:54:50] (03PS1) 10Rush: bdsync examples [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314764 [20:55:12] RECOVERY - puppet last run on labservices1001 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:55:35] so it's midnight, I'ma gonna make dinner [20:55:58] Oh wow late dinner? [20:56:06] can someone cr +2 https://gerrit.wikimedia.org/r/#/c/314713/5 please? [20:56:07] stuff happens [20:56:43] (03CR) 10Rush: [C: 032 V: 032] bdsync examples [debs/bdsync] - 10https://gerrit.wikimedia.org/r/314764 (owner: 10Rush) [21:00:31] robh hi could you update the topic to say staus up please? [21:00:35] Instead of gerrit maint [21:02:28] (03PS3) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [21:03:42] mutante: or ostriches: wanna send an email about the migration finishing? [21:03:49] ostriches idea, what about create a website on grrrit-wm with a restart button?, we will have a login page for only autherwised users to restart the bot [21:03:53] to prevent abuse [21:04:25] (03PS4) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [21:07:12] tldr: nexus is a beat [21:07:48] minimal email sent [21:07:51] What is nexus? [21:08:22] thanks mutante [21:09:40] <7YUAAAAUM> 06Operations, 06Labs: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2700580 (10chasemp) a:03madhuvishy A few notes on where this is at for madhu to take over. We have been testing backup schemes and have settled for now on something like what is described in https://p... [21:11:22] 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2700582 (10Dzahn) [21:12:00] <7YUAAAAUM> 06Operations, 10Gerrit, 13Patch-For-Review: setup/deploy cobalt as gerrit warm standby/replacement - https://phabricator.wikimedia.org/T147597#2697753 (10Dzahn) 20:30 bblack: lead.wikimedia.org: replaced by cobalt functionally, please leave it untouched for now with puppet disabled! 19:46 mutante: deleted o... [21:12:44] Why is there two wikibus [21:12:48] wikibugs [21:12:50] (03PS5) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [21:12:55] wikibugs> and <7YUAAAAUM> [21:13:32] tools prob glitched out with the netsplits and stuf [21:13:49] Oh [21:13:55] idk tho [21:15:46] paladox: kill one? [21:15:46] (03PS1) 10Dzahn: gerrit: remove backup::host include from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314767 (https://phabricator.wikimedia.org/T147597) [21:15:55] mutante i carn't [21:16:28] paladox and myself arent actually in wikimedia-ops team i dont think ik i am not xD [21:16:35] im here becauses always comical [21:16:35] (03PS6) 10Hashar: (WIP) contint: Sonatype Nexus (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/314751 (https://phabricator.wikimedia.org/T147635) [21:17:02] (03PS2) 10Dzahn: gerrit: remove backup::host, rsyncd include from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314767 (https://phabricator.wikimedia.org/T147597) [21:17:05] I am not on ops either [21:17:41] how is that related to wikibugs though? [21:18:10] mutante not realted, anyways, i doint have access to wikibugs [21:18:19] but twentyafterfour could you restart it please? [21:20:55] Zippix if you want to be the bot's maintainer you can ask -labs :) [21:25:30] paladox: I don't know anything about wikibugs, it doesn't seem to respond to the commands I run [21:27:05] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:27:44] (03CR) 10Paladox: [C: 031] gerrit: remove backup::host, rsyncd include from cobalt [puppet] - 10https://gerrit.wikimedia.org/r/314767 (https://phabricator.wikimedia.org/T147597) (owner: 10Dzahn) [21:30:15] Dereckson, so, Flow on wikitech? [21:35:28] (03PS1) 10Dzahn: gerrit: mv standard incl to role, rm duplicate firewall [puppet] - 10https://gerrit.wikimedia.org/r/314768 [21:36:27] Oh, thanks [21:37:06] (03CR) 10Dzahn: "i know we could combine the nodes with a regex, but did not do that on purpose" [puppet] - 10https://gerrit.wikimedia.org/r/314768 (owner: 10Dzahn) [21:39:37] (03PS1) 10Krinkle: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314769 [21:40:09] (03PS2) 10Krinkle: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314769 [21:41:05] (03CR) 10Dzahn: [C: 031] Stop using package=>latest for standard packages [puppet] - 10https://gerrit.wikimedia.org/r/314270 (https://phabricator.wikimedia.org/T115348) (owner: 10Muehlenhoff) [21:41:28] (03Abandoned) 10Dzahn: base: don't use 'latest' for standard package installs [puppet] - 10https://gerrit.wikimedia.org/r/310897 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [21:44:42] (03PS1) 10Eevans: Update firewall to allow terbium access to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) [21:44:51] (03PS1) 10Rush: labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 [21:45:51] (03CR) 1020after4: "or just make this work on trusty and jessie, worry about precise later..." [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [21:46:38] (03CR) 10Dzahn: [C: 04-1] "you most likely also want "wasat", the equivalent of terbiium in codfw, see how above there is srange => '$DEPLOYMENT_HOSTS', i think ther" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [21:46:40] (03CR) 1020after4: "in other words, just merge this and don't worry about precise right now." [puppet] - 10https://gerrit.wikimedia.org/r/297975 (https://phabricator.wikimedia.org/T139738) (owner: 10Dereckson) [21:46:57] (03CR) 10jenkins-bot: [V: 04-1] labsdb: puppetize maintenance scripts [puppet] - 10https://gerrit.wikimedia.org/r/314773 (owner: 10Rush) [21:47:22] (03CR) 10Eevans: "PC output for logstash1001: http://puppet-compiler.wmflabs.org/4245/" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [21:47:45] mutante: wow, you are fast. [21:49:07] urandom: hmm. maybe there is no $MAINTENANCE_HOSTS yet, but i was trying to do something like here https://gerrit.wikimedia.org/r/#/c/302774/ [21:49:27] urandom: so maybe not, but still both terbium and wasat in some way [21:49:40] kk [21:49:45] or maybe we can add it where DEPLOYMENT_HOSTS is defined [21:50:11] yea, i think it's that change above [21:51:01] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [21:51:12] no, it's modules/network/manifests/constants.pp [21:51:20] it's confusing because it's always in flux [21:51:32] i think we wanted to move more to hiera [21:51:59] is that how $DEPLOYMENT_HOSTS is set? [21:52:08] via modules/network/manifests/constants.pp [21:52:09] ? [21:52:10] yea, as of now it's there [21:52:18] yeesh [21:52:50] so we could add terbium and wasat there as maintenance hosts, and then we'll wonder about IPv6 and that brings us to [21:52:57] https://gerrit.wikimedia.org/r/#/c/302649/ :p [21:54:08] mapped? [21:54:25] (03CR) 10Krinkle: [C: 032] robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314769 (owner: 10Krinkle) [21:54:31] for example when the v4 IP is: [21:54:40] 10.64.0.196 [21:54:41] oooh [21:54:43] and the v6 is: [21:54:45] i see [21:54:49] 2620:0:861:101:10:64:0:196 [21:54:51] (03Merged) 10jenkins-bot: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314769 (owner: 10Krinkle) [21:54:58] nice [21:56:03] (03PS2) 10Eevans: add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [21:56:23] (03CR) 10Eevans: [C: 031] add mapped v6 IPs for terbium and wasat [puppet] - 10https://gerrit.wikimedia.org/r/302649 (owner: 10Dzahn) [21:56:31] fwiw [21:56:36] thanks [21:57:16] (03PS1) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [21:57:43] (03CR) 10Eevans: [C: 031] network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [21:57:54] (03CR) 10Dzahn: "the v6 IPs would be added after 302649 gets merged" [puppet] - 10https://gerrit.wikimedia.org/r/314778 (owner: 10Dzahn) [22:00:06] (03CR) 10Dzahn: "ideally first https://gerrit.wikimedia.org/r/#/c/302649/ , then amend to https://gerrit.wikimedia.org/r/#/c/314778/ to add v6, then use th" [puppet] - 10https://gerrit.wikimedia.org/r/314772 (https://phabricator.wikimedia.org/T147366) (owner: 10Eevans) [22:05:29] urandom: i'll try to get those merged next week, hope you are not immediately blocked [22:05:45] wanted to get some reviews whether it should be there or in hirea [22:05:45] no, that works [22:05:48] cool [22:05:57] thanks! [22:06:01] yw [22:11:58] (03PS3) 10Dzahn: Gerrit: Update error.html message to include channel #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/314608 (owner: 10Paladox) [22:12:06] (03CR) 10Dzahn: [C: 032] Gerrit: Update error.html message to include channel #wikimedia-operations [puppet] - 10https://gerrit.wikimedia.org/r/314608 (owner: 10Paladox) [22:17:05] (03PS2) 10Dzahn: network/constants: add maintenance hosts (v4) [puppet] - 10https://gerrit.wikimedia.org/r/314778 [22:18:18] (03PS2) 10Dzahn: add deployment, maintenance servers to hieradata common [puppet] - 10https://gerrit.wikimedia.org/r/302774 [22:18:46] (03CR) 10jenkins-bot: [V: 04-1] add deployment, maintenance servers to hieradata common [puppet] - 10https://gerrit.wikimedia.org/r/302774 (owner: 10Dzahn) [22:20:55] (03CR) 10Dzahn: "yea, needs manual rebase, i'll fix it but first wanted to know if hiera is the right place or i should put it in network/constants.pp (my " [puppet] - 10https://gerrit.wikimedia.org/r/302774 (owner: 10Dzahn) [22:21:48] ACKNOWLEDGEMENT - HTTPS on lead is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused daniel_zahn gerrit migrated [22:21:49] ACKNOWLEDGEMENT - SSH access on lead is CRITICAL: Connection refused daniel_zahn gerrit migrated [22:21:49] ACKNOWLEDGEMENT - gerrit process on lead is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^GerritCodeReview .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daniel_zahn gerrit migrated [22:22:07] rip [22:22:28] !log etcd servers have puppet issue with Etcd_user[root] [22:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:02] mutante how long is it suppose to normally take to clone core.git [22:23:03] ? [22:23:27] depends on your connection, it's a big repo [22:23:57] it taking longer then i remember (i fudged up and deleted my remote orgin so i have to reclone now :D <3 my self sometimes lol [22:24:03] but its prob just my end [22:26:08] Zppix: actually the new server is faster than than the old one [22:26:21] it will be even faster after a reboot soonish [22:26:32] i dont know how mw/core would normally take though [22:26:38] the web ui feels faster to me [22:26:52] its the .git folder that killed its speeds it rocketting into the nearby universe rm [22:26:53] rn* [22:27:25] well, what i can say is that rsyncing the data was also like "FAST... mw/core forever... FAST again" [22:27:38] because there are soooo many small files [22:28:47] du -h of my (not totally up to date) checkout: 1.1g [22:28:50] of mw core [22:29:46] 06Operations, 06Discovery, 06Discovery-Analysis: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10mpopov) [22:31:16] 06Operations, 10DBA, 10MediaWiki-General-or-Unknown, 13Patch-For-Review: img_metadata queries for PDF files saturates s4 slaves - https://phabricator.wikimedia.org/T147296#2687819 (10brion) I strongly recommend investing in T32906 -- storing the text blobs and such for DjVu and PDF in a structured way inst... [22:35:24] 06Operations, 06Discovery, 06Discovery-Analysis: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10Dzahn) r-base and r-cran-mysql are installed by puppet via class statistics::compute on stat1003 86 ensure_packages([ .. 89... [22:37:07] 06Operations, 06Discovery, 06Discovery-Analysis: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700737 (10Dzahn) adding analytics ops [22:39:07] question im just curious, if jenkins were just stop functioning would we have to manually do CI or would a slave take over? [22:39:47] Zppix: we'd have to bring up a new jenkins master node. [22:40:24] Figured but eh you never know some people like hard way better [22:40:28] which, in theory, should be just some puppet changes and restoring a backup [22:40:54] bd808 i was going to say you prob could just get latest jenkins repo couldn't ya? [22:41:36] bd808: we did that dance in May :) [22:42:05] we'll be migrating to the new server the week after our offiste (on tuesday oct 25th). [22:42:06] yeah, and we could recreate the jobs from the jjb build files if needed. There is some stuff that I think is only managed in the xml files that the web ui maintains [22:42:29] 06Operations, 10Ops-Access-Requests: Requesting access to terbium for smalyshev - https://phabricator.wikimedia.org/T147666#2700159 (10Dzahn) @Smalyshev Btw, it's not just terbium it's also wasat, the equivalent of terbium in codfw. So if we switch over at some point it will be that to do the same thing. You... [22:42:48] greg-g: I hope your offsite is in Chicago so you can watch the Cubs clench the pennant :) [22:43:10] and, with what we learned from gerrit this week, I'm going to put "setup contint2001 as warm spare for contint1001" as a goal for Q2 ;) [22:43:19] bd808: nope! D.C. [22:43:23] atleast cardinals are full grown unlike the cubs (ok thats a bad joke p.s. i dont watch sports so pls dont kill meh) [22:44:41] greg-g: seems like a reasonable idea. A bit of rsync would make like easier when the hardware eventually decides to be uncooperative. [22:44:45] *life [22:44:48] yup [22:45:11] I don't like "oh shit" moments, too much [22:45:17] 06Operations, 06Discovery, 06Discovery-Analysis: Can't install R package Boom (& bsts) on stat1002 (but can on stat1003) - https://phabricator.wikimedia.org/T147682#2700807 (10mpopov) [22:45:23] "Then why'd you take the Release Manager gig?" you might ask... [22:46:00] (to hopefully get us to a place where we have fewer) [22:46:05] because you didn't have much of a choice? ;) [22:46:06] deletes cron spam from 2013 [22:46:23] and oh wonder, i have only 40% disk usage, not 95% anymore [22:48:07] imagine being forced to share ALL of the bots on tools lab storage with gerrit repos (ik some bots use wikimedia's gerrit but most of em use github) that would suck [22:48:33] manually hacking the DNS zone files to switch Gerrit would have been annoying too, so i'm glad we ended up not needing that [22:49:11] and just moved the service IP instead, that puppet adds to interface [22:49:38] 06Operations, 10media-storage: Two recently uploaded files have disappeared (404) - https://phabricator.wikimedia.org/T147040#2700817 (10greg) p:05Unbreak!>03Normal Yup. [22:49:51] so ci is using 2 servers now or just 1 i didnt catch what we upgraded servers for [22:51:14] today it was only Gerrit, not CI [22:51:15] I think there a 2 production network boxes running Jenkins and Zuul and then a number of static Labs VMs and a dynamic Labs VM pool as well [22:52:13] gallium/contint1001/labnodepool1001? [22:52:15] the only real slowdown with jenkins is those damn jessie jobs [22:52:28] gallium->contint1001 is Oct 25 or so [22:53:04] (thanks, btw, mutante, for offering to help with that) [22:53:55] yw [22:54:31] that mutante guy is pretty nice :) [22:55:07] aaw [22:57:53] if CI is down we'll still be able to merge with Gerrit, overriding jenkins [22:59:15] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [22:59:47] something something centralization [23:00:54] something something reading about hdfs at 2 am [23:02:16] apergos: ewwww, dumps? [23:02:28] workflow manager evals [23:02:38] this haddop-yarn-hdfs-oozie one is killing meeeeeee [23:02:42] *hadoop [23:03:01] ugh [23:03:37] go back and forth between reading two books, the apache docs, the cdh5 docs, various other docs and the examples, and testing 'does this work? hm. not exactly.' [23:03:40] groan [23:03:47] anyways at 2am it's all going to be painful [23:05:13] uh, fatalmonitor in logstash does not look happy what so eve [23:05:15] r [23:05:28] https://logstash.wikimedia.org/goto/d9f8d9aad5b397feb09998ca6927a7c1 [23:05:48] 30k of Notice: Undefined variable: title in /srv/mediawiki/w/robots.php on line 6 [23:06:12] same number of: Catchable fatal error: Argument 1 passed to WikiPage::factory() must be an instance of Title, null given in /srv/mediawiki/php-1.28.0-wmf.21/includes/page/WikiPage.php on line 136 [23:06:29] thcipriani: around? ^ [23:06:48] sorry for the 5pm ping, but, this doesn't look healthy [23:07:05] oh good. [23:07:13] * thcipriani looks [23:07:17] some wiki missing Mediawiki:robots.txt ? [23:07:58] all clustered on 4 hosts: mw1267, mw1268, mw1261, mw1275 [23:08:10] maybe out of sync? [23:08:12] not totally [23:08:22] https://logstash.wikimedia.org/goto/51790df71daf7d084968220817656175 [23:08:38] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [23:08:43] oh, you're right I didn't see the scrollbar [23:08:44] started around 21:55 UTC [23:10:01] It's got to have something to do with the wikidb or there would be way more I think [23:10:32] blerg, no https://en.wikipedia.org/robots.txt is an erro page [23:11:22] that line hasn't changed for 4+ years [23:11:47] there was a change recently that removed some empty line [23:11:57] heh [23:12:15] robots.php definitely has a modification time that is fairly recent [23:12:34] https://gerrit.wikimedia.org/r/#/c/314769/ [23:12:38] bblack, another 500 showing up as 503 - gzip issues again? [23:12:51] (the robots.txt link bd808 posted above) [23:12:56] https://gerrit.wikimedia.org/r/#/c/314769/ [23:13:00] ooh, too slow [23:13:14] thcipriani: you want to take the revert honors [23:13:17] yup [23:13:24] and trout krinkle [23:13:44] probably a three character fix but it's a self-merge [23:14:05] on in mediawiki-config, right [23:14:16] yup, reproducible everywhere [23:15:02] !log thcipriani@tin Synchronized w/robots.php: Revert change to robots.php (duration: 00m 49s) [23:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:15:09] thanks all [23:15:33] was the deployment not on sal? [23:15:43] no sal for the prior sync. maybe during a netsplit? [23:15:45] and where is grrrit-wm [23:15:51] * Krenair kicks some bots [23:16:04] (03PS1) 10Thcipriani: Revert "robots.php: Use WikiPage instead of Article class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314786 [23:16:05] Krenair its probably lost in cyberspace [23:16:07] !log test morebots [23:16:11] also not a good change to self merge on a Friday just as a matter of practice [23:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:32] (03CR) 10Thcipriani: [C: 032] Revert "robots.php: Use WikiPage instead of Article class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314786 (owner: 10Thcipriani) [23:16:59] (03Merged) 10jenkins-bot: Revert "robots.php: Use WikiPage instead of Article class" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314786 (owner: 10Thcipriani) [23:17:10] all clean [23:17:17] error rate looking more normal, too [23:17:25] thanks for paying attention greg-g :) [23:17:30] ^ [23:17:39] (03CR) 10Greg Grossmeier: "This was trivially broken in both Beta and production (simply loading robots.txt gave you an error) causing a huge number of 500s in the f" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314769 (owner: 10Krinkle) [23:17:52] greg-g: you can have an I broke WP sticker from the pile that you should have somewhere :) [23:18:15] bd808: wait, I do? are they at the office? :) [23:18:43] I gave some to Roan to give to you when we were in Jerusalem. [23:19:05] ahhh, I didn't see them on my desk when i was there last month, I'll check at all hands :) [23:19:32] lol. in the office once a quarter whether you need it or not [23:22:57] I need to get the scap canary check hooked up to hhvm fatals still not just mediawiki errors. worried about false positives from things beyond deployers control though :\ [23:23:29] yeah, I ran across that in my gerrit review backlog again today [23:23:52] I should...-1 that for the time being. [23:24:27] want to filter out db/redis things before that's a thing [23:24:51] (03CR) 10Thcipriani: [C: 04-1] "Want to avoid creating false positives from db and redis timeouts, needs some tweaks." [puppet] - 10https://gerrit.wikimedia.org/r/304327 (https://phabricator.wikimedia.org/T142784) (owner: 10Thcipriani) [23:28:16] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [23:29:37] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:30:56] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:37:30] ohai Krinkle [23:37:57] did you see the 500 spike from the broken robots.txt? [23:38:08] don't worry, we fixed it :) [23:39:55] I did not. IRCCloud/freenode has been buggy [23:40:30] https://logstash.wikimedia.org/goto/540498592edcb0f2f7d167d32a1ef267 [23:40:33] that's what it looked like [23:40:56] Krinkle: please no more self merges and deploys on Friday [23:41:14] especially without even loading the page that was effected (robots.txt) [23:43:00] Aye, yeah, I totally should've seen that. My bad. (IRCCloud went away for me right after the sync and got distracted). I have no reason for why I didn't test it on mw1017 first. Especially after advocating that practice so much. [23:43:57] Krinkle: word, thanks. :) Enjoy the rest of your weekend. [23:45:03] * Krinkle leaves the revised commit for review on Monday [23:45:39] (03PS1) 10Krinkle: robots.php: Use WikiPage instead of Article class [mediawiki-config] - 10https://gerrit.wikimedia.org/r/314790 [23:56:59] Hi