[00:00:09] jouncebot: next [00:00:09] In 82 hour(s) and 59 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1100) [00:00:12] jouncebot: now [00:00:12] No deployments scheduled for the next 82 hour(s) and 59 minute(s) [00:02:00] Reedy: thinking of something? :) [00:02:10] greg-g: The request in -tech [00:03:50] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 04m 26s) [00:03:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:30] Urgency in its case seems to be POV [00:09:11] (03PS1) 10Dzahn: install_server: use raid1-gpt partman recipe for phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/436702 (https://phabricator.wikimedia.org/T196019) [00:10:39] (03CR) 10Dzahn: [C: 032] install_server: use raid1-gpt partman recipe for phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/436702 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [00:10:41] (03PS2) 10Dzahn: install_server: use raid1-gpt partman recipe for phab1002 [puppet] - 10https://gerrit.wikimedia.org/r/436702 (https://phabricator.wikimedia.org/T196019) [00:11:28] * mutante schedules a de-deployment of tin for tomorrow [00:13:14] greg-g: ^ i was about to actually do that so it shows up in jouncebot but the entire day Friday isnt in that cal and that seems tricky to not mess up? [00:18:58] mutante: add an entry with the time you plan to do it and it'll create the right entry for it [00:24:49] greg-g: thanks! https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1793439&oldid=1793398 ? [00:25:10] ah, the window name is already wrong of course [00:30:46] jouncebot: next [00:30:46] In 82 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1100) [00:30:51] jouncebot: refresh [00:30:52] I refreshed my knowledge about deployments. [00:30:54] jouncebot: next [00:30:54] In 82 hour(s) and 29 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180604T1100) [00:32:08] mutante: put it in the right section... I'll move it [00:33:56] https://wikitech.wikimedia.org/w/index.php?title=Deployments&type=revision&diff=1793440&oldid=1793439 [00:33:59] jouncebot: refresh [00:34:01] I refreshed my knowledge about deployments. [00:34:03] jouncebot: next [00:34:03] In 14 hour(s) and 25 minute(s): Deployment server switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180601T1500) [00:34:37] 10Operations, 10ops-eqiad, 10Patch-For-Review: WMF4727 hardware issue - disks dont detect in installer - https://phabricator.wikimedia.org/T189804#4247617 (10Dzahn) 05Open>03Resolved problem is gone with raid1-gpt partman recipe which supports disks over 2 TB (thanks Papaul for pointing it out and the di... [00:34:39] greg-g: thanks :) [00:35:50] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247630 (10Dzahn) OS has been installed (with wmf-auto-reimage-host) after using raid1-gpt partman. Debian GNU/Linux 9 phab1002 ttyS1... [00:36:16] mutante: Do you know if the switch from tin to deploy1001 also switches the default PHP? For app servers we made the distro switch in such a way that doesn't affect PHP used , e.g. stayed on HHVM. [00:36:32] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247631 (10Dzahn) [00:36:36] I would expect the same to happen for this as well. [00:37:35] Krinkle: it should stay on HHVM.. and also when terbium switches to mwmaint1001 [00:38:48] mutante: thx for confirming. I ask because initially there was some push to also switch to PHP7 [00:38:58] But Aaron just uncovered another memcached incompatibility [00:39:07] which should block the switch in prod [00:39:32] Yeah, this is just newer base os, with php 7... but still using hhvm for now [00:39:39] I guess dumps has sailed, but I guess it's either not yet caused noticable disruption or might be lucky to not hit that particular behaviour. [00:40:38] Does dumps use memcached? :P [00:40:47] thanks for double confirming that, heh [00:45:04] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [00:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:48:15] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 03m 11s) [00:48:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:49:57] (03PS1) 10Dmaza: Enable $wgCookieSetOnIpBlock on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) [00:51:02] Reedy: dumps use a MediaWiki maintenance script. I don't think anything gets done that involves Setup.php without touching memcached in some way [00:51:13] and given it fetches revision text, yes, it definitely touches memc [00:51:20] (sorry, wasn't sure whether you were kidding or not, anyway) [00:51:29] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [00:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:43] Doesn't mean it actually uses anything from memcached (ie anything is retrieved it needed) [00:53:54] (03PS1) 10Dzahn: site: add phab1002 with spare role and mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/436708 (https://phabricator.wikimedia.org/T196019) [00:53:55] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 26s) [00:53:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:54:32] (03CR) 10jerkins-bot: [V: 04-1] site: add phab1002 with spare role and mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/436708 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [00:56:49] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [00:56:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:15] (03PS2) 10Dzahn: site: add phab1002 with spare role and mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/436708 (https://phabricator.wikimedia.org/T196019) [00:58:20] that -1 is kind of a bug [00:58:56] you cant add that snippet in role either [00:59:36] (03PS3) 10Dzahn: site: add phab1002 with spare role and mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/436708 (https://phabricator.wikimedia.org/T196019) [00:59:41] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 52s) [00:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:36] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [01:00:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:00] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 24s) [01:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:03:35] (03CR) 10Dzahn: [C: 032] site: add phab1002 with spare role and mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/436708 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:08:45] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921#4247675 (10Reedy) ``` reedy@mwmaint1001:~$ time PHP='hhvm -vEval.Jit=1' mwscript rebuildLocalis... [01:10:27] (03PS1) 10Dzahn: add IPv6 records for phab1002 [dns] - 10https://gerrit.wikimedia.org/r/436709 (https://phabricator.wikimedia.org/T196019) [01:10:59] 10Operations, 10MediaWiki-Platform-Team, 10Performance-Team, 10PHP 7.0 support: php-memcached 3.0 (PHP 7) incompatible with BagOStuff - https://phabricator.wikimedia.org/T196125#4247677 (10Krinkle) p:05Triage>03Normal [01:11:17] (03CR) 10Dzahn: [C: 032] add IPv6 records for phab1002 [dns] - 10https://gerrit.wikimedia.org/r/436709 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:17:19] (03PS1) 10Dzahn: add aphlict and vcs service IPs for phab1002 [dns] - 10https://gerrit.wikimedia.org/r/436710 (https://phabricator.wikimedia.org/T196019) [01:19:00] (03PS2) 10Dzahn: add aphlict and vcs service IPs for phab1002 [dns] - 10https://gerrit.wikimedia.org/r/436710 (https://phabricator.wikimedia.org/T196019) [01:24:24] (03CR) 10Dzahn: [C: 032] add aphlict and vcs service IPs for phab1002 [dns] - 10https://gerrit.wikimedia.org/r/436710 (https://phabricator.wikimedia.org/T196019) (owner: 10Dzahn) [01:24:59] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [01:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:03] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247694 (10Dzahn) [01:26:12] (03PS6) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [01:26:41] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4244341 (10Dzahn) [01:27:33] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 02m 34s) [01:27:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:28:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: setup/install phab1002(WMF4727) - https://phabricator.wikimedia.org/T196019#4247696 (10Dzahn) [01:28:29] (03PS7) 10Krinkle: webperf: Add navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [01:33:04] Krinkle: is that ready? [01:33:13] it seems we want to merge that before tomorrow morning.. i can [01:33:32] mutante: indeed [01:33:38] I'm testing it in beta now [01:33:41] ok [01:33:46] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error [01:33:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:20] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): Redeploy to 2004 to try to reproduce error (duration: 00m 33s) [01:34:21] mutante: actually, beta was my main motivation for this. I didn't realise until a few minutes ago the switch was happening tomorrow [01:34:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:34:33] e.g. to have it work in beta without cloning it manually [01:35:02] yea, i saw the comments on deployment-deploy [01:43:44] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921#4247698 (10Reedy) The previous was with an unclean outdir... This one is clean, dunno if it ac... [01:52:11] Krinkle: i gotta go afk for a bit, but i am ready to merge that before i do deploy1001 tomorrow my morning [01:56:39] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4247721 (10Krinkle) >>! In T176370#4168342, @Joe wrote: > Dumps are already partially running on php 7 and have been thoroug... [01:58:31] (03PS8) 10Krinkle: webperf: Add statsv, navtiming and coal to scap::sources [puppet] - 10https://gerrit.wikimedia.org/r/436601 (https://phabricator.wikimedia.org/T195314) [02:48:37] mutante: I'd recommend not merging for now. [02:49:01] mutante: It seems either webperf repos are configured in a way that doesn't bootstrap well on new hosts, or scap itself has issues. [02:49:18] Either way, we don't deploy very often so I suppose it's fine to leave as-is for now. [02:49:29] We'll figure it out after the switch [02:50:11] Afaik that should not cause webperf1001 prod puppet to fail given it is already set up correctly (ensure present is already fulfilled), but if it does end up failing, then the above patch should work fine there. [03:15:55] (03PS2) 10Reedy: Collapse PHP_SAPI conditionals down into one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/393355 [03:37:44] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 919.44 seconds [03:50:21] Krinkle: ok. ACK. thank you [03:53:31] (03CR) 10Aaron Schulz: [C: 031] mcrouter: fix hiera labels, install on mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [04:30:54] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 234.09 seconds [04:36:12] (03CR) 10Ayounsi: [C: 031] Enable base::service_auto_restart for librenms-syslog [puppet] - 10https://gerrit.wikimedia.org/r/436571 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [05:20:57] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436722 (https://phabricator.wikimedia.org/T191316) [05:24:09] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436722 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:25:31] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436722 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:26:52] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113:3315 for alter table (duration: 00m 57s) [05:26:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:39] !log Deploy schema change on db1113:3315 - T191316 T192926 T89737 T195193 [05:27:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:27:46] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [05:27:46] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [05:27:46] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [05:27:46] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [05:29:21] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436722 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [05:31:27] (03PS1) 10Marostegui: db-codfw.php: Repool db2092 and db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436724 (https://phabricator.wikimedia.org/T190704) [05:33:23] (03CR) 10Marostegui: [C: 032] db-codfw.php: Repool db2092 and db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436724 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:34:40] (03Merged) 10jenkins-bot: db-codfw.php: Repool db2092 and db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436724 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:34:55] (03CR) 10jenkins-bot: db-codfw.php: Repool db2092 and db2062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436724 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:36:03] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Repool db2092 and db2062 in s1 (duration: 00m 59s) [05:36:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:42:18] (03PS1) 10Marostegui: mariadb: Move db2075 back to s5 [puppet] - 10https://gerrit.wikimedia.org/r/436725 (https://phabricator.wikimedia.org/T190704) [05:44:11] (03CR) 10Marostegui: [C: 032] mariadb: Move db2075 back to s5 [puppet] - 10https://gerrit.wikimedia.org/r/436725 (https://phabricator.wikimedia.org/T190704) (owner: 10Marostegui) [05:45:51] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4247884 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [05:47:57] (03PS1) 10Marostegui: sX.hosts: Add db2075 back to s5 [software] - 10https://gerrit.wikimedia.org/r/436726 [05:49:23] (03CR) 10Marostegui: [C: 032] sX.hosts: Add db2075 back to s5 [software] - 10https://gerrit.wikimedia.org/r/436726 (owner: 10Marostegui) [05:50:18] (03Merged) 10jenkins-bot: sX.hosts: Add db2075 back to s5 [software] - 10https://gerrit.wikimedia.org/r/436726 (owner: 10Marostegui) [06:06:17] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4247888 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2075.codfw.wmnet'] ``` and were **ALL** successful. [06:08:55] (03PS1) 10Marostegui: db-eqiad.php: Depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436727 [06:12:35] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436727 (owner: 10Marostegui) [06:13:57] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436727 (owner: 10Marostegui) [06:15:10] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2059 (duration: 00m 56s) [06:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:38] !log Stop MySQL on db2059 to clone db2075 - T190704 [06:15:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:15:42] T190704: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704 [06:19:26] (03CR) 10jenkins-bot: db-eqiad.php: Depool db2059 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436727 (owner: 10Marostegui) [06:19:37] 10Operations, 10Traffic, 10media-storage, 10Patch-For-Review: Remove unnecessary response headers - https://phabricator.wikimedia.org/T194814#4209672 (10TheDJ) ``` X-Varnish 521726689 533337780, 225083667 220092282, 525815818 515121340 Server mw1238.eqiad.wmnet ``` Especially these two can be handy. In the... [06:30:39] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/apparmor.d/abstractions/ssl_certs] [06:37:09] (03PS4) 10Ema: prometheus: export intel-microcode information via node_exporter [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) [06:37:45] (03PS2) 10Ema: vcl: strip away unnecessary response headers set by Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/433573 (https://phabricator.wikimedia.org/T194814) [06:38:23] (03CR) 10Ema: [C: 032] vcl: strip away unnecessary response headers set by Thumbor [puppet] - 10https://gerrit.wikimedia.org/r/433573 (https://phabricator.wikimedia.org/T194814) (owner: 10Ema) [06:55:49] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:58:14] (03PS3) 10Muehlenhoff: Remove now obsolete os conditional [puppet] - 10https://gerrit.wikimedia.org/r/436241 [06:58:54] (03CR) 10Muehlenhoff: [C: 032] Remove now obsolete os conditional [puppet] - 10https://gerrit.wikimedia.org/r/436241 (owner: 10Muehlenhoff) [07:00:41] (03PS4) 10Elukey: Remove pivot from puppet [puppet] - 10https://gerrit.wikimedia.org/r/436503 [07:01:11] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for librenms-syslog [puppet] - 10https://gerrit.wikimedia.org/r/436571 (https://phabricator.wikimedia.org/T135991) [07:01:37] (03CR) 10Elukey: [C: 032] Remove pivot from puppet [puppet] - 10https://gerrit.wikimedia.org/r/436503 (owner: 10Elukey) [07:02:12] bye bye pivot [07:04:19] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for librenms-syslog [puppet] - 10https://gerrit.wikimedia.org/r/436571 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:04:23] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for librenms-syslog [puppet] - 10https://gerrit.wikimedia.org/r/436571 (https://phabricator.wikimedia.org/T135991) [07:11:53] is the prometheus issue in T196137 known? [07:11:54] T196137: toolforge: prometheus issue is filling up email queue - https://phabricator.wikimedia.org/T196137 [07:12:15] https://www.irccloud.com/pastebin/r5t3A9cW/ [07:17:13] (03PS4) 10Volans: Create a custom mysql backend and use it [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 (https://phabricator.wikimedia.org/T167504) [07:19:26] arturo: never seen that, but one of the two (the perm issue) might be resolvable with a simple puppet patch? [07:23:55] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4248014 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [07:26:08] PROBLEM - MariaDB Slave Lag: s7 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 18323.51 seconds [07:26:38] ^ probably the maintenance script hit s7 now [07:27:20] UPDATE /* PopulateExternallinksIndex60::doDBUpdates www-data@terbiu [07:27:24] yep, that is the script [07:30:28] (03PS3) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: update the ssl paths [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) [07:30:30] (03PS3) 10Giuseppe Lavagetto: mcrouter: fix hiera labels, install on mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) [07:30:43] (03CR) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: update the ssl paths (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [07:31:31] (03CR) 10Giuseppe Lavagetto: [C: 032] profile::mediawiki::mcrouter_wancache: update the ssl paths [puppet] - 10https://gerrit.wikimedia.org/r/436531 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [07:34:58] (03PS1) 10Marostegui: install_server: Allow reimage db2059 [puppet] - 10https://gerrit.wikimedia.org/r/436730 [07:35:57] (03PS2) 10Marostegui: install_server: Allow reimage db2059 [puppet] - 10https://gerrit.wikimedia.org/r/436730 [07:40:40] (03CR) 10Marostegui: [C: 032] install_server: Allow reimage db2059 [puppet] - 10https://gerrit.wikimedia.org/r/436730 (owner: 10Marostegui) [07:47:02] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4248025 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [07:48:24] (03PS10) 10Jcrespo: mariadb: Add extra_port on port + 20 for multiinstance hosts [puppet] - 10https://gerrit.wikimedia.org/r/435751 [07:48:26] (03PS1) 10Jcrespo: mariadb: Allow reimage of db1083 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436731 [08:03:50] (03PS2) 10Jcrespo: mariadb: Allow reimage of db1083 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436731 [08:03:53] (03PS3) 10Volans: debmonitor: specify MySQL connection options [puppet] - 10https://gerrit.wikimedia.org/r/436286 (https://phabricator.wikimedia.org/T191299) [08:03:55] (03PS2) 10Volans: debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) [08:03:57] (03PS3) 10Volans: debmonitor: add basic HTTP Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) [08:04:29] (03CR) 10Volans: "Dropped the isolation_level as 'read committed' is the default in Django mysql backend." [puppet] - 10https://gerrit.wikimedia.org/r/436286 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [08:05:23] (03PS1) 10Jcrespo: mariadb: Depool db1083 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436736 [08:06:21] (03PS1) 10Volans: Documentation: remove example setting [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436737 (https://phabricator.wikimedia.org/T167504) [08:06:30] (03PS1) 10Giuseppe Lavagetto: Add mock CA infrastructure for mcrouter [labs/private] - 10https://gerrit.wikimedia.org/r/436738 [08:06:35] (03CR) 10Jcrespo: [C: 032] mariadb: Allow reimage of db1083 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/436731 (owner: 10Jcrespo) [08:07:13] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Add mock CA infrastructure for mcrouter [labs/private] - 10https://gerrit.wikimedia.org/r/436738 (owner: 10Giuseppe Lavagetto) [08:10:06] Hi ops team - Just a ping to let you know I deploy hadoop jobs repo with elukey [08:10:17] !log joal@tin Started deploy [analytics/refinery@7a72241]: Regular weekly deploy [08:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:10] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11333/mwdebug1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) (owner: 10Giuseppe Lavagetto) [08:11:17] (03PS4) 10Giuseppe Lavagetto: mcrouter: fix hiera labels, install on mwdebug servers [puppet] - 10https://gerrit.wikimedia.org/r/436532 (https://phabricator.wikimedia.org/T192771) [08:13:03] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1083 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436736 (owner: 10Jcrespo) [08:13:43] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4248039 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on neodymium.eqiad.wmnet for hosts: ``` ['db2... [08:14:25] (03Merged) 10jenkins-bot: mariadb: Depool db1083 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436736 (owner: 10Jcrespo) [08:16:55] PROBLEM - Check systemd state on mwdebug1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:17:39] <_joe_> that's me ^^ [08:17:55] RECOVERY - Check systemd state on mwdebug1002 is OK: OK - running: The system is fully operational [08:18:01] <_joe_> damn erbs and whitespaces [08:19:24] (03CR) 10jenkins-bot: mariadb: Depool db1083 for reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436736 (owner: 10Jcrespo) [08:20:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1083 (duration: 01m 05s) [08:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:05] PROBLEM - Check systemd state on mwdebug2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:21:44] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:22:45] PROBLEM - puppet last run on mwdebug2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mcrouter/ssl] [08:22:48] !log joal@tin Finished deploy [analytics/refinery@7a72241]: Regular weekly deploy (duration: 12m 31s) [08:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:14] PROBLEM - puppet last run on mwdebug1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mcrouter/ssl] [08:23:43] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436740 [08:23:50] (03PS1) 10Giuseppe Lavagetto: mcrouter: fix ssl options [puppet] - 10https://gerrit.wikimedia.org/r/436741 [08:23:59] (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436740 [08:24:41] !log temporarily reducing s7-codfw-master consistency to aliviate lag (binlog_sync, flush_log) [08:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:24] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:25:35] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:25:54] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation=compareAndSwap https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:26:24] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:26:35] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:26:53] (03PS1) 10Marostegui: Revert "install_server: Allow reimage db2059" [puppet] - 10https://gerrit.wikimedia.org/r/436742 [08:26:54] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:27:06] (03PS2) 10Marostegui: Revert "install_server: Allow reimage db2059" [puppet] - 10https://gerrit.wikimedia.org/r/436742 [08:27:21] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436740 (owner: 10Marostegui) [08:27:24] PROBLEM - Check systemd state on mwdebug2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:27:52] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: fix ssl options [puppet] - 10https://gerrit.wikimedia.org/r/436741 (owner: 10Giuseppe Lavagetto) [08:28:09] (03PS3) 10Marostegui: Revert "install_server: Allow reimage db2059" [puppet] - 10https://gerrit.wikimedia.org/r/436742 [08:28:24] PROBLEM - puppet last run on mwdebug2002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/mcrouter/ssl] [08:28:43] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436740 (owner: 10Marostegui) [08:29:17] (03CR) 10Marostegui: [C: 032] Revert "install_server: Allow reimage db2059" [puppet] - 10https://gerrit.wikimedia.org/r/436742 (owner: 10Marostegui) [08:30:05] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1113:3315 after alter table (duration: 01m 03s) [08:30:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:35] !log reimage db1083 [08:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:27] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436740 (owner: 10Marostegui) [08:31:47] (03PS1) 10Giuseppe Lavagetto: mcrouter: add missing continuation line [puppet] - 10https://gerrit.wikimedia.org/r/436743 [08:32:33] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: add missing continuation line [puppet] - 10https://gerrit.wikimedia.org/r/436743 (owner: 10Giuseppe Lavagetto) [08:33:14] RECOVERY - puppet last run on mwdebug1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:33:46] !log pnorman@tin Started deploy [tilerator/deploy@709ca69] (cleartables): reenable v3view on 2004 [08:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:34] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.078 second response time [08:35:14] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational [08:37:23] PROBLEM - mcrouter process on mwdebug2001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter [08:38:39] !log pnorman@tin Finished deploy [tilerator/deploy@709ca69] (cleartables): reenable v3view on 2004 (duration: 04m 53s) [08:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:04] PROBLEM - mcrouter process on mwdebug1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter [08:40:43] I've filed an unbreak now, marostegui check if it corresponds to some ongoing alter table [08:40:43] PROBLEM - mcrouter process on mwdebug2002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter [08:41:16] jynus: there are no ongoing alters for the revision table [08:41:18] I am checking [08:42:24] it seems to be heavy on dewiki [08:42:29] but happens on others, too [08:42:45] I am trying to find a task where it was dropped [08:43:18] yeah, that would help [08:43:19] when did it start happening? [08:43:25] I am looking at it [08:43:28] it seems train-related [08:45:38] That index isn't on tables.sql: https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables.sql [08:45:40] there is a FORCE INDEX (rev_user_timestamp) [08:45:58] https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables.sql#L357 [08:46:28] most likely a misspell of user_timestamp [08:46:30] ? [08:46:50] however, FORCES are not to be added lightly [08:47:23] PROBLEM - mcrouter process on mwdebug1002 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (mcrouter), command name mcrouter [08:48:03] RECOVERY - Check systemd state on mwdebug2001 is OK: OK - running: The system is fully operational [08:49:33] (03CR) 10Muehlenhoff: [C: 04-1] prometheus: export intel-microcode information via node_exporter (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/436553 (https://phabricator.wikimedia.org/T127825) (owner: 10Ema) [08:50:56] I cannot find a task where that index was removed no [08:52:53] RECOVERY - puppet last run on mwdebug2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:54:35] 10Operations, 10MediaWiki-Platform-Team, 10HHVM, 10TechCom-RFC (TechCom-Approved), 10User-ArielGlenn: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370#4248093 (10MoritzMuehlenhoff) >>! In T176370#4247721, @Krinkle wrote: > For unattended scripts I don't think the cost is rea... [08:55:23] RECOVERY - Check systemd state on mwdebug2002 is OK: OK - running: The system is fully operational [08:58:31] RECOVERY - puppet last run on mwdebug2002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [08:58:37] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921#4248100 (10Nikerabbit) I wonder why are we testing `rebuildLocalisationCache.php` with `--threa... [09:01:27] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4248105 (10elukey) [09:01:53] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4093870 (10elukey) Thanks for the report! I added the following to the heartbeat database, and puppet now ru... [09:09:27] (03PS1) 10Jcrespo: mariadb: Repool db1083 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436747 [09:12:57] (03PS1) 10Elukey: profile::mariadb::misc::eventlogging:db: remove unused heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) [09:13:39] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#3366199 (10MoritzMuehlenhoff) There's quite a bit of crom spam by planet2001: ``` http_proxy="http://url-downloader.codfw.wikimedia.org:8080" https_proxy=... [09:14:27] (03CR) 10Jcrespo: [C: 031] "I honestly do not have anything here to review- if it works, deploy it" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:18:25] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [09:21:19] 10Operations, 10DBA, 10Goal, 10Patch-For-Review: Convert all sanitarium hosts to multi-instance and increase its reliability/redundancy - https://phabricator.wikimedia.org/T190704#4248146 (10ops-monitoring-bot) Completed auto-reimage of hosts: ``` ['db2059.codfw.wmnet'] ``` Of which those **FAILED**: ```... [09:21:34] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:24:45] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1953 bytes in 0.088 second response time [09:25:34] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1083 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436747 (owner: 10Jcrespo) [09:26:58] (03Merged) 10jenkins-bot: mariadb: Repool db1083 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436747 (owner: 10Jcrespo) [09:28:38] (03PS1) 10Marostegui: db-eqiad.php: Depool db1113:3315,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436750 [09:28:55] (03PS2) 10Marostegui: db-eqiad.php: Depool db1113:3315,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436750 [09:29:34] (03CR) 10jenkins-bot: mariadb: Repool db1083 with low load after reimage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436747 (owner: 10Jcrespo) [09:29:47] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler02/11335/" [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) (owner: 10Elukey) [09:30:27] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 with low load (duration: 01m 03s) [09:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:11] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1113:3315,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436750 (owner: 10Marostegui) [09:33:25] (03PS1) 10Marostegui: db2075: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/436751 [09:33:34] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1113:3315,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436750 (owner: 10Marostegui) [09:34:59] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1113:3315,db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436750 (owner: 10Marostegui) [09:35:16] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1113:3315 db1096:3315 (duration: 01m 03s) [09:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:32] (03CR) 10Volans: [C: 032] "From all tests in labs everything seems fine with this patch. Merging to get unblocked. @akosiaris let me know if you have any concern and" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:36:06] (03CR) 10Marostegui: [C: 032] db2075: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/436751 (owner: 10Marostegui) [09:36:26] (03PS4) 10Volans: debmonitor: specify MySQL connection options [puppet] - 10https://gerrit.wikimedia.org/r/436286 (https://phabricator.wikimedia.org/T191299) [09:36:44] (03Merged) 10jenkins-bot: Create a custom mysql backend and use it [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436592 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:36:54] (03CR) 10Volans: [C: 032] Documentation: remove example setting [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436737 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:37:41] (03CR) 10Volans: [C: 032] debmonitor: specify MySQL connection options [puppet] - 10https://gerrit.wikimedia.org/r/436286 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:37:54] !log Stop replication in sync on db1113:3315 and db1096:3315 for data checks [09:37:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:01] (03Merged) 10jenkins-bot: Documentation: remove example setting [software/debmonitor] - 10https://gerrit.wikimedia.org/r/436737 (https://phabricator.wikimedia.org/T167504) (owner: 10Volans) [09:41:34] (03PS1) 10Arturo Borrero Gonzalez: toollabs: add /etc/aliases file for tools-mail server [puppet] - 10https://gerrit.wikimedia.org/r/436752 (https://phabricator.wikimedia.org/T196137) [09:41:59] (03PS1) 10Volans: Updated Debmonitor submodule to v0.1.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436753 (https://phabricator.wikimedia.org/T191299) [09:42:45] (03CR) 10Volans: [V: 032 C: 032] Updated Debmonitor submodule to v0.1.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436753 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:47:46] (03PS1) 10Elukey: turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436754 [09:50:34] (03PS1) 10Volans: Built wheels for Debmonitor v0.1.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436755 (https://phabricator.wikimedia.org/T191299) [09:51:09] (03CR) 10Volans: [V: 032 C: 032] Built wheels for Debmonitor v0.1.1 [software/debmonitor/deploy] - 10https://gerrit.wikimedia.org/r/436755 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [09:53:16] !log volans@tin Started deploy [debmonitor/deploy@fe8df6e]: Release v0.1.1 [09:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:50] !log volans@tin Finished deploy [debmonitor/deploy@fe8df6e]: Release v0.1.1 (duration: 00m 33s) [09:53:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:55] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.085 second response time [10:04:04] PROBLEM - MariaDB Slave Lag: m2 on db2078 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 641.39 seconds [10:04:22] !log ladsgroup@terbium:~$ foreachwikiindblist medium deleteAutoPatrolLogs.php --sleep 2 --check-old [10:04:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:11] (03PS1) 10Giuseppe Lavagetto: mcrouter: add rsyslog/logrotate configuration [puppet] - 10https://gerrit.wikimedia.org/r/436756 [10:09:13] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: add prometheus exporter for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436757 [10:09:15] (03PS1) 10Giuseppe Lavagetto: role::prometheus::ops: collect mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/436758 [10:09:17] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: install mcrouter everywhere [puppet] - 10https://gerrit.wikimedia.org/r/436759 [10:09:19] (03PS1) 10Giuseppe Lavagetto: role::deployment_server: add mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436760 [10:11:43] PROBLEM - Device not healthy -SMART- on db2059 is CRITICAL: cluster=mysql device=cciss,11 instance=db2059:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2059&var-datasource=codfw%2520prometheus%252Fops [10:11:50] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: add rsyslog/logrotate configuration [puppet] - 10https://gerrit.wikimedia.org/r/436756 (owner: 10Giuseppe Lavagetto) [10:12:14] RECOVERY - MariaDB Slave Lag: m2 on db2078 is OK: OK slave_sql_lag Replication lag: 0.48 seconds [10:15:44] (03PS1) 10Giuseppe Lavagetto: mcrouter: fix logrotate rule [puppet] - 10https://gerrit.wikimedia.org/r/436762 [10:15:54] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1113:3315,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436763 [10:16:04] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: fix logrotate rule [puppet] - 10https://gerrit.wikimedia.org/r/436762 (owner: 10Giuseppe Lavagetto) [10:16:29] (03PS1) 10Jcrespo: mariadb: Repool db1083 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436764 [10:19:38] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1113:3315,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436763 (owner: 10Marostegui) [10:21:23] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436763 (owner: 10Marostegui) [10:21:39] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1113:3315,db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436763 (owner: 10Marostegui) [10:21:49] (03PS1) 10Giuseppe Lavagetto: mcrouter: run as user mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436765 [10:22:11] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter: run as user mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436765 (owner: 10Giuseppe Lavagetto) [10:22:46] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1113:3315 db1096:3315 (duration: 01m 02s) [10:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:26] (03PS1) 10Mark Bergsma: Move NaiveBGPPeeringTestCase to test_peering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436766 [10:25:34] RECOVERY - mcrouter process on mwdebug1002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [10:25:34] (03CR) 10jerkins-bot: [V: 04-1] Move NaiveBGPPeeringTestCase to test_peering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436766 (owner: 10Mark Bergsma) [10:26:21] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: add prometheus exporter for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436757 [10:26:23] RECOVERY - mcrouter process on mwdebug2002 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [10:27:04] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1969 bytes in 0.073 second response time [10:29:05] (03PS1) 10Muehlenhoff: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T106381) [10:29:56] (03PS2) 10Muehlenhoff: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) [10:30:43] (03PS1) 10Marostegui: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436768 (https://phabricator.wikimedia.org/T191316) [10:30:46] (03PS2) 10Mark Bergsma: Move NaiveBGPPeeringTestCase to test_peering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436766 [10:32:18] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436768 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:32:40] (03PS3) 10Giuseppe Lavagetto: role::mediawiki::common: add prometheus exporter for mcrouter [puppet] - 10https://gerrit.wikimedia.org/r/436757 [10:33:38] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436768 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:33:53] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1096:3315 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436768 (https://phabricator.wikimedia.org/T191316) (owner: 10Marostegui) [10:34:13] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1971 bytes in 0.081 second response time [10:34:57] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1096:3315 (duration: 01m 03s) [10:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:03] !log Deploy schema change on db1096:3315 - T191316 T192926 T89737 T195193 [10:35:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:09] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [10:35:10] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [10:35:10] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [10:35:10] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [10:35:22] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11337/mwdebug1002.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/436757 (owner: 10Giuseppe Lavagetto) [10:37:39] (03PS1) 10Mark Bergsma: Cleanup NaiveBGPPeeringTestCase [debs/pybal] - 10https://gerrit.wikimedia.org/r/436769 [10:41:44] (03PS2) 10Giuseppe Lavagetto: role::prometheus::ops: collect mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/436758 [10:42:52] (03PS1) 10Muehlenhoff: Add bmansurov to restricted group instead of maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/436773 (https://phabricator.wikimedia.org/T189285) [10:43:36] (03PS3) 10Giuseppe Lavagetto: role::prometheus::ops: collect mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/436758 [10:43:56] (03CR) 10Mark Bergsma: [C: 031] debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:44:55] (03CR) 10Mark Bergsma: [C: 031] Add debmonitor endpoints [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [10:46:13] (03PS4) 10Giuseppe Lavagetto: role::prometheus::ops: collect mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/436758 [10:48:34] RECOVERY - mcrouter process on mwdebug2001 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [10:48:42] (03CR) 10Elukey: [C: 032] turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436754 (owner: 10Elukey) [10:48:48] (03PS2) 10Elukey: turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436754 [10:48:51] (03CR) 10Elukey: [V: 032 C: 032] turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436754 (owner: 10Elukey) [10:48:54] RECOVERY - mcrouter process on mwdebug1001 is OK: PROCS OK: 1 process with UID = 114 (mcrouter), command name mcrouter [10:52:07] (03CR) 10Giuseppe Lavagetto: [C: 032] role::prometheus::ops: collect mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/436758 (owner: 10Giuseppe Lavagetto) [10:52:13] (03PS5) 10Giuseppe Lavagetto: role::prometheus::ops: collect mcrouter metrics [puppet] - 10https://gerrit.wikimedia.org/r/436758 [10:57:40] (03PS2) 10Giuseppe Lavagetto: role::mediawiki::common: install mcrouter everywhere [puppet] - 10https://gerrit.wikimedia.org/r/436759 [10:59:05] <_joe_> !log disabling puppet on all hosts with role::mediawiki::common while installing mcrouter everywhere [10:59:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:38] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::common: install mcrouter everywhere [puppet] - 10https://gerrit.wikimedia.org/r/436759 (owner: 10Giuseppe Lavagetto) [10:59:46] <_joe_> alea iacta est [11:04:23] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.112 second response time [11:06:30] (03PS1) 10Elukey: turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436776 [11:07:12] _joe_ \o/ [11:07:35] (03CR) 10Elukey: [C: 032] turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436776 (owner: 10Elukey) [11:08:29] (03PS2) 10Muehlenhoff: Add bmansurov to restricted group instead of maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/436773 (https://phabricator.wikimedia.org/T189285) [11:09:39] (03CR) 10Muehlenhoff: [C: 032] Add bmansurov to restricted group instead of maintenance-log-readers [puppet] - 10https://gerrit.wikimedia.org/r/436773 (https://phabricator.wikimedia.org/T189285) (owner: 10Muehlenhoff) [11:10:49] _joe_: can you please ping the channel when puppet is enabled again? [11:15:46] <_joe_> moritzm: yes sure [11:15:52] <_joe_> I'm doing some final testing [11:16:24] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1962 bytes in 0.068 second response time [11:20:40] (03CR) 10Volans: "Compiler results: https://puppet-compiler.wmflabs.org/compiler02/11340/" [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [11:22:13] (03PS1) 10Elukey: turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436777 [11:22:51] (03CR) 10Elukey: [C: 032] turnilo: update config.yaml [puppet] - 10https://gerrit.wikimedia.org/r/436777 (owner: 10Elukey) [11:22:56] <_joe_> oh ok, I'm a moron [11:25:02] <_joe_> but at least I can reenable puppet I would say [11:25:41] lol [11:28:27] _joe_: puppet is failing on terbium as it can't install prometheus-mcrouter-exporter (probably only available for stretch) [11:29:03] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[prometheus-mcrouter-exporter] [11:29:03] <_joe_> moritzm: hah, yeah [11:29:16] <_joe_> ok lemme add an if guard then [11:29:35] ack, we can drop it next week hopefully [11:29:40] <_joe_> yeah [11:30:29] (03PS1) 10Giuseppe Lavagetto: mcrouter_wancache: expose the TLS-protected port [puppet] - 10https://gerrit.wikimedia.org/r/436779 [11:31:01] 10Operations, 10Ops-Access-Reviews, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access for bmansurov to run mwscript in terbium - https://phabricator.wikimedia.org/T189285#4248449 (10MoritzMuehlenhoff) 05Open>03Resolved a:03MoritzMuehlenhoff @bmansurov I changed your group membership, pleas... [11:36:33] (03PS1) 10Giuseppe Lavagetto: role::mediawiki::common: temporarily add an os_version guard [puppet] - 10https://gerrit.wikimedia.org/r/436780 [11:37:09] <_joe_> moritzm: fixing terbium and other jessies ^^ [11:37:22] (03CR) 10Giuseppe Lavagetto: [C: 032] role::mediawiki::common: temporarily add an os_version guard [puppet] - 10https://gerrit.wikimedia.org/r/436780 (owner: 10Giuseppe Lavagetto) [11:44:02] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:44:21] 10Operations: Update prometheus-varnish-exporter on debian to 1.4 - https://phabricator.wikimedia.org/T195252#4248485 (10MoritzMuehlenhoff) p:05Triage>03Normal [12:00:18] (03CR) 10Muehlenhoff: [C: 031] "Looks good, we could probably even skip notrack (our standard memcached servers don't use it either), but it's also fine." [puppet] - 10https://gerrit.wikimedia.org/r/436779 (owner: 10Giuseppe Lavagetto) [12:02:56] (03PS2) 10Giuseppe Lavagetto: mcrouter_wancache: expose the TLS-protected port [puppet] - 10https://gerrit.wikimedia.org/r/436779 [12:03:53] (03CR) 10Giuseppe Lavagetto: [C: 032] mcrouter_wancache: expose the TLS-protected port [puppet] - 10https://gerrit.wikimedia.org/r/436779 (owner: 10Giuseppe Lavagetto) [12:05:11] <_joe_> ouch I messed up [12:06:40] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::mcrouter_wancache: use ferm service, not rule [puppet] - 10https://gerrit.wikimedia.org/r/436781 [12:06:56] (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] profile::mediawiki::mcrouter_wancache: use ferm service, not rule [puppet] - 10https://gerrit.wikimedia.org/r/436781 (owner: 10Giuseppe Lavagetto) [12:07:02] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:11] PROBLEM - puppet last run on mw2262 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:12] PROBLEM - puppet last run on mw2286 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:07:22] <_joe_> that [12:07:23] <_joe_> s me [12:07:41] PROBLEM - puppet last run on mw1245 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:08:21] <_joe_> it should recover by itself [12:08:41] PROBLEM - puppet last run on mw2259 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:09:52] PROBLEM - puppet last run on mw2236 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:22] PROBLEM - puppet last run on mw1277 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:22] PROBLEM - puppet last run on mw2193 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:51] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:52] PROBLEM - puppet last run on mw2146 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:10:52] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:01] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:01] PROBLEM - puppet last run on mw1221 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:02] PROBLEM - puppet last run on mw2153 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:02] PROBLEM - puppet last run on mw2226 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:22] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:11:25] <_joe_> sigh [12:11:34] _joe_: need help? [12:11:48] <_joe_> volans: just remember me how to run puppet only on failed hosts :P [12:11:51] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 0.105 second response time [12:12:01] _joe_: https://wikitech.wikimedia.org/wiki/Cumin#Run_Puppet_only_if_last_run_failed [12:12:02] PROBLEM - puppet last run on mw2250 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:07] <_joe_> because in the meanwhile I confirmed mcrouter replication works cross-dc \o/ [12:12:12] PROBLEM - puppet last run on mw2283 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:12:14] great [12:12:35] <_joe_> set a key in eqiad, read it in codfw, deleted it in codfw, confirmed in eqiad [12:13:32] PROBLEM - puppet last run on mw2203 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:14:34] <_joe_> now the last thing I need to do is to create a prometheus dashboard for mcrouter, then we can move on to do the mediawiki perf testing and rollout [12:15:22] RECOVERY - puppet last run on mw1277 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [12:15:22] RECOVERY - puppet last run on mw2193 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [12:15:52] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:01] RECOVERY - puppet last run on mw2146 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:01] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:02] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:02] RECOVERY - puppet last run on mw1221 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [12:16:02] RECOVERY - puppet last run on mw2153 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:16:02] RECOVERY - puppet last run on mw2226 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [12:16:22] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:11] RECOVERY - puppet last run on mw2250 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:11] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:17:12] RECOVERY - puppet last run on mw2262 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:12] RECOVERY - puppet last run on mw2283 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:17:21] RECOVERY - puppet last run on mw2286 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:17:42] RECOVERY - puppet last run on mw1245 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:18:31] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [12:18:41] RECOVERY - puppet last run on mw2203 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:18:42] RECOVERY - puppet last run on mw2259 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:20:01] RECOVERY - puppet last run on mw2236 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:21:42] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:23:00] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Joe: mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4248539 (10Joe) Mcrouter is now installed across the fleet (minus the deployment servers), and I confirmed that replic... [12:23:16] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) [12:23:19] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar): Deploy mcrouter to production as a wancache backend - https://phabricator.wikimedia.org/T192370#4248541 (10Joe) [12:23:22] 10Operations, 10Availability (MediaWiki-MultiDC), 10Patch-For-Review, 10Performance-Team (Radar), 10User-Joe: mcrouter production architecture - https://phabricator.wikimedia.org/T192771#4248540 (10Joe) 05Open>03Resolved [12:23:50] (03CR) 10jerkins-bot: [V: 04-1] Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [12:24:23] (03PS3) 10Volans: debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) [12:24:39] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for prometheus-mcrouter-exporter [puppet] - 10https://gerrit.wikimedia.org/r/436782 (https://phabricator.wikimedia.org/T135991) [12:26:16] (03CR) 10Volans: [C: 032] debmonitor: add cache misc controller [puppet] - 10https://gerrit.wikimedia.org/r/436504 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:31:48] (03PS4) 10Volans: debmonitor: add basic HTTP Icinga check [puppet] - 10https://gerrit.wikimedia.org/r/436509 (https://phabricator.wikimedia.org/T191299) [12:31:50] (03PS1) 10Volans: cumin: fix debmonitor alias [puppet] - 10https://gerrit.wikimedia.org/r/436784 (https://phabricator.wikimedia.org/T191299) [12:33:06] (03CR) 10Volans: [C: 032] cumin: fix debmonitor alias [puppet] - 10https://gerrit.wikimedia.org/r/436784 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:39:12] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.088 second response time [12:52:50] 10Operations, 10Wikimedia-Planet, 10Patch-For-Review: upgrade planet instances to stretch - https://phabricator.wikimedia.org/T168490#4248581 (10Dzahn) Fixed ownership of the "en" logfile. It was owned by root:root, all others by planet:planet. It was from manually running update command. [12:56:04] (03PS3) 10Volans: Add debmonitor endpoints [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) [12:57:08] (03CR) 10Volans: [C: 032] Add debmonitor endpoints [dns] - 10https://gerrit.wikimedia.org/r/436505 (https://phabricator.wikimedia.org/T191299) (owner: 10Volans) [12:58:37] (03PS2) 10Jcrespo: mariadb: Repool db1083 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436764 [13:02:48] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4248605 (10Vgutierrez) it would be nice to be able to use X25519 curve here, OpenSSL provides support for X25519 since version 1.1.0. Regarding... [13:03:49] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [13:04:07] (03PS1) 10Muehlenhoff: Fix cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/436786 [13:04:41] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1943 bytes in 0.078 second response time [13:04:51] (03PS2) 10Muehlenhoff: Fix cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/436786 [13:06:51] (03CR) 10Muehlenhoff: [C: 032] Fix cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/436786 (owner: 10Muehlenhoff) [13:08:32] (03Draft1) 10Paladox: Planet: Update global.erb rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/436790 [13:08:34] (03CR) 10Dbarratt: [C: 031] Enable $wgCookieSetOnIpBlock on itwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) (owner: 10Dmaza) [13:08:36] (03PS2) 10Paladox: Planet: Update global.erb rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/436790 [13:09:13] (03PS3) 10Paladox: Planet: Update global.erb rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/436790 [13:10:11] (03CR) 10Paladox: "As i wrote in another task this requires a cookie notice due to the GDPR." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436707 (https://phabricator.wikimedia.org/T196121) (owner: 10Dmaza) [13:10:50] (03PS4) 10Elukey: Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:10:54] (03CR) 10Elukey: [C: 031] Enable base::service_auto_restart for Burrow Prometheus exporters [puppet] - 10https://gerrit.wikimedia.org/r/434934 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:17:01] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1949 bytes in 0.057 second response time [13:21:27] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436791 [13:23:52] (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436791 (owner: 10Marostegui) [13:25:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436791 (owner: 10Marostegui) [13:26:42] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1096:3315 (duration: 01m 03s) [13:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:40] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1096:3315" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436791 (owner: 10Marostegui) [13:32:16] !log Deploy schema change on dbstore1002:s5 - T191316 T192926 T89737 T195193 [13:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:24] T89737: Make several mediawiki table fields unsigned ints on wmf databases - https://phabricator.wikimedia.org/T89737 [13:32:24] T192926: Schema change to drop archive.ar_text and archive.ar_flags - https://phabricator.wikimedia.org/T192926 [13:32:24] T195193: Schema change for ct_tag_id field to change_tag - https://phabricator.wikimedia.org/T195193 [13:32:25] T191316: Schema change to make archive.ar_rev_id NOT NULL - https://phabricator.wikimedia.org/T191316 [13:44:32] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609#4248691 (10ema) [13:46:12] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609#3239728 (10ema) [13:53:52] 10Operations, 10Analytics-Cluster, 10Analytics-Kanban, 10Traffic, and 2 others: TLS security review of the Kafka stack - https://phabricator.wikimedia.org/T182993#4248709 (10Ottomata) Hm, ya, sounds like a way off before we get that in Debian then, ya? Is that something that would block removal of IPSec? [13:56:31] (03CR) 10Andrew Bogott: [C: 031] toollabs: add /etc/aliases file for tools-mail server [puppet] - 10https://gerrit.wikimedia.org/r/436752 (https://phabricator.wikimedia.org/T196137) (owner: 10Arturo Borrero Gonzalez) [13:57:32] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1956 bytes in 0.069 second response time [14:08:01] (03CR) 10Jcrespo: [C: 031] profile::mariadb::misc::eventlogging:db: remove unused heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) (owner: 10Elukey) [14:08:16] !log reedy@tin Synchronized php-1.32.0-wmf.6/extensions/FlaggedRevs: T196139 (duration: 01m 08s) [14:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:21] T196139: Key 'rev_user_timestamp' doesn't exist in table 'revision' - https://phabricator.wikimedia.org/T196139 [14:09:17] (03CR) 10Jcrespo: [C: 031] profile::mariadb::misc::eventlogging:db: remove unused heartbeat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) (owner: 10Elukey) [14:15:01] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1968 bytes in 0.077 second response time [14:16:28] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1083 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436764 (owner: 10Jcrespo) [14:16:35] (03PS3) 10Jcrespo: mariadb: Repool db1083 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436764 [14:18:21] (03CR) 10Elukey: profile::mariadb::misc::eventlogging:db: remove unused heartbeat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) (owner: 10Elukey) [14:19:12] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [14:20:01] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1954 bytes in 0.087 second response time [14:20:04] (03PS1) 10Mark Bergsma: Extend NaiveBGPPeering unit testing [debs/pybal] - 10https://gerrit.wikimedia.org/r/436807 [14:20:07] (03PS1) 10Mark Bergsma: Test UPDATE generation of the NaiveBGPPeering [debs/pybal] - 10https://gerrit.wikimedia.org/r/436808 [14:20:09] (03PS1) 10Mark Bergsma: Fix handling of withdrawals for Inet Unicast [debs/pybal] - 10https://gerrit.wikimedia.org/r/436809 [14:20:14] (03PS2) 10Elukey: profile::mariadb::misc::eventlogging:db: remove unused heartbeat [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) [14:22:31] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:22:43] (03CR) 10jenkins-bot: mariadb: Repool db1083 fully [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436764 (owner: 10Jcrespo) [14:23:13] 10Operations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Kanban): mwscript rebuildLocalisationCache.php takes 40 minutes on HHVM (rather than ~5 on PHP 5) - https://phabricator.wikimedia.org/T191921#4248843 (10Reedy) More threads means more I/O contention... Not really unexpected. It's finding... [14:24:33] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1083 fully (duration: 01m 02s) [14:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:54] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/11343/" [puppet] - 10https://gerrit.wikimedia.org/r/436748 (https://phabricator.wikimedia.org/T191109) (owner: 10Elukey) [14:29:40] 10Operations, 10Ops-Access-Reviews, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access for bmansurov to run mwscript in terbium - https://phabricator.wikimedia.org/T189285#4248874 (10bmansurov) @MoritzMuehlenhoff thank you, all good now. [14:30:55] !log killed pt-heartbear-wikimedia after https://gerrit.wikimedia.org/r/436748 on db1107 [14:30:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:07] argh heartbeat [14:31:56] (03PS1) 10Muehlenhoff: Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 [14:32:36] (03CR) 10jerkins-bot: [V: 04-1] Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 (owner: 10Muehlenhoff) [14:35:08] jouncebot: next [14:35:08] In 0 hour(s) and 24 minute(s): Deployment server switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180601T1500) [14:35:33] (03PS2) 10Muehlenhoff: Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 [14:36:11] (03CR) 10jerkins-bot: [V: 04-1] Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 (owner: 10Muehlenhoff) [14:39:02] (03PS2) 10Rush: openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 [14:39:04] (03PS1) 10Rush: openstack: additional notes and docs for nova and l2pop issues [puppet] - 10https://gerrit.wikimedia.org/r/436813 [14:39:45] (03PS3) 10Dzahn: Revert "Revert "switch deployment server from tin to deploy1001"" [puppet] - 10https://gerrit.wikimedia.org/r/422632 [14:39:48] (03CR) 10jerkins-bot: [V: 04-1] openstack: eqiad1 deployment (neutron in eqiad) [puppet] - 10https://gerrit.wikimedia.org/r/436337 (owner: 10Rush) [14:40:37] (03CR) 10Rush: [C: 032] openstack: additional notes and docs for nova and l2pop issues [puppet] - 10https://gerrit.wikimedia.org/r/436813 (owner: 10Rush) [14:42:31] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1961 bytes in 0.067 second response time [14:43:26] 10Operations, 10Traffic, 10Patch-For-Review: Merge cache_misc into cache_text functionally - https://phabricator.wikimedia.org/T164609#4248902 (10ema) [14:47:32] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1974 bytes in 0.077 second response time [14:49:27] (03PS1) 10Dzahn: scap/dsh: add deploy1001 to scap masters [puppet] - 10https://gerrit.wikimedia.org/r/436814 (https://phabricator.wikimedia.org/T175288) [14:50:41] (03CR) 10Dzahn: [C: 032] scap/dsh: add deploy1001 to scap masters [puppet] - 10https://gerrit.wikimedia.org/r/436814 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [14:52:43] !log deploy1001 - scap pull [14:52:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:56] <_joe_> mutante: uh we're moving on with the switch? [14:56:00] <_joe_> \o/ [14:56:21] jouncebot: next [14:56:21] In 0 hour(s) and 3 minute(s): Deployment server switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180601T1500) [14:56:34] _joe_: ^:) scheduled yesterday on calendar, heh [15:00:03] Is it a good idea to do it on a Friday? [15:00:04] mutante: That opportune time is upon us again. Time for a Deployment server switchover deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180601T1500). [15:00:44] <_joe_> marostegui: actually yes, it's the day when no deploys should happen [15:00:53] <_joe_> they have more time to do some test deployments [15:01:07] Sure [15:01:38] <_joe_> and they can always rollback to tin in case tragedy strikes :P [15:01:44] yes, that :) [15:02:10] BTW, not sure if you know that, you can lock deployments by touching a file [15:02:23] in case you need to be sure nothing is ongoing [15:04:47] <_joe_> that's just for mediawiki, and that's besides the point: you want to do a non-automatic transition during the day when something is supposedly unused [15:05:10] !log rsyncing /srv/mediawiki-staging to /srv/mediawiki-staging-before-backup/ on tin as a backup [15:05:10] what? [15:05:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:45] I am not advocating to do it another day, I think friday is the best day [15:06:11] I was just saying a random fact that may be useful to you if you didn't know it [15:06:28] it was helpful to me when I learned it [15:08:31] thanks jynus [15:08:40] so currently deploy1001 is scap pulling from tin [15:08:48] then it will scap pull-master from tin [15:09:02] then we can switch the "scap::deployment_server" (i assume) [15:09:17] after that the big switch will be down to 2 files [15:09:23] global deploy lock on tin might not be a bad idea: umask 022 && echo 'switching deploy servers' > /var/lock/scap-global-lock [15:09:54] hierdata/common.yaml and the default value in profile::mediawiki::deployment::server [15:10:18] thcipriani: thanks [15:10:43] did you already run that? cool [15:11:06] _joe_: it's also for services, but yeah :) [15:12:44] !log tin umask 022 && echo 'switching deploy servers' > /var/lock/scap-global-lock [15:12:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:01] thcipriani: scap pull is done. scap-pull-master wants arguments. "scap-pull-master tin.eqiad.wmnet" right [15:16:14] scap pull-master [15:16:42] mutante: yes [15:16:56] side note: cannot delete non-empty directory: php-1.31.0-wmf.28/cache/l10n [15:17:05] ok,thx [15:17:15] !log [deploy1001:~] $ scap pull-master tin.eqiad.wmnet [15:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:02] (03PS1) 10Dzahn: scap: switch scap::deployment server to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/436816 (https://phabricator.wikimedia.org/T175288) [15:18:15] the cannot delete non-empty directory thing is because we're ignoring l10n files since those get generated on the target, but we also use --delete. It's a fun rsync problem. [15:18:50] ah, that seems vaguely familiar.ok [15:20:07] harmless except for taking up space tracked in T157030 [15:20:08] T157030: cannot delete non-empty directory: php-1.29.0-wmf.3 messages on 'scap sync' on mwdebug1002 - https://phabricator.wikimedia.org/T157030 [15:20:12] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4249024 (10Dzahn) ``` --- /etc/dsh/group/scap-masters 2018-05-24 14:25:47.608760286 +0000 +deploy1001.eqiad.wmnet .. [deploy10... [15:24:08] The only good way to fix that is redo scap clean [15:24:21] Delete excluded is dangerous! [15:25:10] (03PS6) 10Andrew Bogott: keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) [15:25:28] (03PS1) 10Dzahn: switch deployment_server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/436818 (https://phabricator.wikimedia.org/T175288) [15:25:54] (03CR) 10Andrew Bogott: [C: 032] keystonehooks: Add any new project member to bastion [puppet] - 10https://gerrit.wikimedia.org/r/436570 (https://phabricator.wikimedia.org/T165337) (owner: 10Andrew Bogott) [15:28:18] just waiting for rsync [15:29:21] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging [15:29:26] heh [15:34:35] hrm, the uid for mwdeploy is different on deploy1001 vs tin. lots of files owned by uids not tied to a user. [15:34:58] ok, not the first time we run into this issue when migrating things [15:35:03] can fix with find -exec chown [15:35:16] after it's done [15:35:41] scap pull-master still running [15:36:55] once added the UID to https://wikitech.wikimedia.org/wiki/UID .. should have known [15:37:01] and changed that before [15:37:43] 15:37:29 Finished rsync master (duration: 20m 21s) [15:37:43] 15:37:29 Started rebuild CDB staging files [15:39:31] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is OK: Files ownership is ok. [15:39:42] ^ no manual fix by me [15:40:04] 15:39:56 Updated 409 CDB files(s) in /srv/mediawiki-staging/php-1.32.0-wmf.6/cache/l10n [15:40:07] 15:39:56 Finished rebuild CDB staging files (duration: 02m 26s) [15:41:43] !log root@deploy1001:/srv/mediawiki-staging# find . -uid 996 -exec chown mwdeploy {} \; [15:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:40] (03CR) 10Giuseppe Lavagetto: "The code is overall very clean, I have one doubt, but if that's wrong, LGTM." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/434328 (https://phabricator.wikimedia.org/T192437) (owner: 10Vgutierrez) [15:43:35] tin has /srv/redis and deploy1001 does not. but that's also an empty dir on tin [15:44:31] PROBLEM - Host labtestneutron2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:35] may have something to do with trebuchet which relied on redis, but is no longer in use [15:45:20] btw trebuchet.. we haven't renamed that user yet [15:45:30] that was UID 997 [15:46:25] or not, I guess: https://gerrit.wikimedia.org/r/#/c/278836/ [15:46:31] RECOVERY - Host labtestneutron2002 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [15:47:08] mutante: ah, yeah, mind need a find exec in /srv/deployment for trebuchet as well :\ [15:47:11] !log @deploy1001:/srv/deployment# find . -uid 997 -exec chown trebuchet {} \; [15:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:41] PROBLEM - Host labtestneutron2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:50:43] thcipriani: done. permissions better now? [15:50:56] lgtm [15:51:06] great. shall we switch the "scap::deployment_server" [15:51:19] !log elasticsearch cluster restart on codfw completed - T193734 [15:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:24] T193734: Move Serbian language wikis from extra-analysis to extra-analysis-serbian plugin - https://phabricator.wikimedia.org/T193734 [15:51:26] mutante: +1 [15:51:51] RECOVERY - Host labtestneutron2001 is UP: PING OK - Packet loss = 0%, RTA = 36.11 ms [15:52:11] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4249098 (10Dzahn) 11:41 < mutante> !log root@deploy1001:/srv/mediawiki-staging# find . -uid 996 -exec chown mwdeploy {} \; 11:4... [15:52:13] (03PS2) 10Dzahn: scap: switch scap::deployment server to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/436816 (https://phabricator.wikimedia.org/T175288) [15:52:31] RECOVERY - Check systemd state on labtestneutron2001 is OK: OK - running: The system is fully operational [15:53:15] (03PS5) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) [15:53:17] (03CR) 10Dzahn: [C: 032] scap: switch scap::deployment server to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/436816 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [15:54:35] (03PS6) 10Herron: logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) [15:54:51] PROBLEM - Host labtestneutron2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:11] (03CR) 10Herron: [C: 032] logstash: add tcp tls input for syslogs [puppet] - 10https://gerrit.wikimedia.org/r/431830 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [15:55:12] RECOVERY - Host labtestneutron2002 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [15:56:08] <_joe_> mutante, thcipriani please let's try to keep the trebuchet user on the same UID across servers [15:56:36] <_joe_> and mwdeploy as well [15:56:40] <_joe_> or it's going to be a mess [15:56:50] so. the UID page has this "possibly outdated" comment right next to trebuchet UID [15:57:00] 997:998 [15:57:17] <_joe_> trebuchet is 499:499 on deploy2001 [15:57:52] <_joe_> and the same on 1001, ok, cool [15:57:57] the status on tin matches the wiki page [15:58:02] PROBLEM - Host labtestneutron2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:58:03] (03PS3) 10Muehlenhoff: Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 [15:58:07] which is different from deploy1001/2001 [15:58:22] <_joe_> it's ok, the important part is having the same UIDs on the new machines [15:58:26] should we just edit the wiki page then ? [15:58:30] yes, ok [15:58:32] RECOVERY - Host labtestneutron2002 is UP: PING OK - Packet loss = 0%, RTA = 36.25 ms [15:58:33] (03CR) 10jerkins-bot: [V: 04-1] Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 (owner: 10Muehlenhoff) [15:58:43] i will edit that and remove the "outdated" thing [15:59:00] <_joe_> yeah, with time, no need to do it now :) [15:59:38] a few files in mediawiki-staging dont get fixed like all the others [15:59:48] and are still owned by uid 996 [16:00:01] PROBLEM - logstash syslog TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 10514: Connection refused [16:00:19] <_joe_> ? [16:00:22] PROBLEM - logstash JSON linesTCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 11514: Connection refused [16:00:33] <_joe_> mutante: did you disable the rsync? [16:00:51] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1972 bytes in 0.064 second response time [16:01:01] these files are links [16:01:35] (03PS4) 10Muehlenhoff: Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 [16:02:09] (03CR) 10jerkins-bot: [V: 04-1] Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 (owner: 10Muehlenhoff) [16:02:15] <_joe_> mutante: chown -h then? [16:02:34] _joe_: didn't disable the rsync, the next merge would switch the rsync direction though [16:02:55] <_joe_> mutante: anyways, for symlinks you want chown -h [16:03:13] <_joe_> else you chown the original file, not the symlink [16:03:13] yes, that's right, it fixed it :) [16:03:52] (03PS1) 10Herron: logstash: set exposed puppet cert ownerships to logstash:logstash [puppet] - 10https://gerrit.wikimedia.org/r/436821 (https://phabricator.wikimedia.org/T193766) [16:05:41] PROBLEM - logstash log4j TCP port on logstash1007 is CRITICAL: connect to address 127.0.0.1 and port 4560: Connection refused [16:05:52] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1948 bytes in 0.070 second response time [16:06:32] (03CR) 10Herron: [C: 032] logstash: set exposed puppet cert ownerships to logstash:logstash [puppet] - 10https://gerrit.wikimedia.org/r/436821 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [16:06:44] oh good, the cron'd rsync of /srv/deployment just ran so the the exec chown for /srv/deployment will probably need to be re-run. We can do this at the end. [16:07:16] i did that to fix it once and for all [16:07:27] (03PS1) 10Mark Bergsma: Split off BGP factory/peering classes into a separate module [debs/pybal] - 10https://gerrit.wikimedia.org/r/436822 [16:07:30] stopped puppet, commented cron. ran command from cron, ran find [16:07:44] <_joe_> thcipriani: that was exactly my point asking about the rsync [16:08:01] RECOVERY - logstash JSON linesTCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11514 [16:08:41] RECOVERY - logstash syslog TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 10514 [16:08:52] RECOVERY - logstash log4j TCP port on logstash1007 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 4560 [16:09:22] yea, i made sure it won't run automatically, then ran it once manually, fixing the owner [16:09:23] logstash alerts were my bad. permission mismatch on the cert files used for new tls tcp listener. fixed now [16:10:05] (03PS5) 10Muehlenhoff: Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 [16:10:19] (03PS6) 10Muehlenhoff: Implement paged LDAP searches in offboarding script [puppet] - 10https://gerrit.wikimedia.org/r/436812 [16:11:50] thcipriani: _joe_: ok, no more files in entire /srv owned by either 996 or 997. clean now [16:11:59] wiki updated [16:12:15] i have one more thing to merge. just the hiera switch [16:12:35] https://gerrit.wikimedia.org/r/#/c/436818/ [16:12:53] (besides things to remove tin afterwards ) [16:13:18] !log enabled new logstash tcp input with TLS enabled for syslogs on port 16514 T193766 [16:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:23] T193766: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766 [16:15:40] (03CR) 10Dzahn: [C: 032] switch deployment_server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/436818 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [16:15:50] (03PS2) 10Dzahn: switch deployment_server from tin to deploy1001 [puppet] - 10https://gerrit.wikimedia.org/r/436818 (https://phabricator.wikimedia.org/T175288) [16:17:00] /srv/deployment/mediawiki is not owned by trebuchet, but that is same on tin [16:19:03] that one is fine, it's not currently used. There are some things behind a feature flag in scap that could use it, but that flag is off for the time being [16:20:01] i also set the scap global lock on deploy1001 [16:20:06] i merged the master change [16:21:53] !log deployment server has switched away from tin to deploy1001. set global scap lock on deploy1001, re-enabled puppet and ran puppet, disabled tin as deployment server (T175288) [16:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:58] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [16:22:34] tin now has the "do not use this" motd :) [16:22:59] updated global lock message as well [16:23:26] do we need to force puppet run on all scap proxies next [16:24:03] doing that [16:25:55] yea, no. that's not a puppet change [16:26:04] only on masters [16:26:13] I think we pass most of the info they need from the masters [16:27:04] (03PS4) 10Dzahn: Revert "Revert "switch deployment server from tin to deploy1001"" [puppet] - 10https://gerrit.wikimedia.org/r/422632 [16:27:06] although master_rsync in /etc/scap.cfg should change [16:27:23] (03CR) 10Dzahn: "rebased into nothing - done on multiple other changes" [puppet] - 10https://gerrit.wikimedia.org/r/422632 (owner: 10Dzahn) [16:27:30] (03Abandoned) 10Dzahn: Revert "Revert "switch deployment server from tin to deploy1001"" [puppet] - 10https://gerrit.wikimedia.org/r/422632 (owner: 10Dzahn) [16:28:12] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1965 bytes in 0.076 second response time [16:28:38] the rsync cron on deploy2001 got updated by puppet to pull from deploy1001 [16:30:01] so, what should we do with that trebuchet user [16:30:17] does it need to be a new user name [16:30:29] or can we use mwdeploy for all [16:30:59] just "deploy" because not everything is mwdeploy? [16:31:58] deploy probably makes more sense, but I haven't really thought about it and it probably needs to be updated all over the place :\ [16:32:17] 10Operations, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4249152 (10Dzahn) [16:32:30] fwiw, i had this but marked as WIP https://gerrit.wikimedia.org/r/#/c/433516/ [16:32:34] it was more of a reminder [16:32:53] to ask if that is a bad idea [16:33:04] and i already think it probably is :) [16:33:12] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1967 bytes in 0.072 second response time [16:34:36] heh, well the wikidev group is pretty intentional, but the trebuchet user is something that would be nice to update, but it's got its hooks in lots of places. Beta would definitely need some love prior to that change-over :) [16:35:50] <_joe_> let's think about that later, shall we? [16:36:02] +1 [16:36:17] ok :) [16:38:12] PROBLEM - Logstash syslog TLS listener on port 16514 on logstash1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:SSL connect attempt failed because of handshake problems [16:39:36] !log deploy2001 - also fixing file permissions. files owned by 996 -> mwdeploy, files owned by 997 -> trebuchet [16:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:38] (03PS1) 10Dzahn: scap: remove tin from scap masters and hosts [puppet] - 10https://gerrit.wikimedia.org/r/436827 (https://phabricator.wikimedia.org/T175288) [16:46:37] oh, deploy1001 should also be a regular scap host of course [16:47:49] (03PS2) 10Dzahn: scap: rm tin from masters,hosts, add deploy1001 to hosts [puppet] - 10https://gerrit.wikimedia.org/r/436827 (https://phabricator.wikimedia.org/T175288) [16:48:55] (03CR) 10Dzahn: [C: 032] scap: rm tin from masters,hosts, add deploy1001 to hosts [puppet] - 10https://gerrit.wikimedia.org/r/436827 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [16:52:12] (03PS1) 10Dzahn: remove tin from hosts kubernetes master is accessible to [puppet] - 10https://gerrit.wikimedia.org/r/436830 (https://phabricator.wikimedia.org/T175288) [16:54:43] (03CR) 10Dzahn: [C: 032] remove tin from hosts kubernetes master is accessible to [puppet] - 10https://gerrit.wikimedia.org/r/436830 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [16:58:24] (03PS1) 10Dzahn: install/cumin/scap: update/remove tin-related comments [puppet] - 10https://gerrit.wikimedia.org/r/436831 (https://phabricator.wikimedia.org/T175288) [16:58:42] i'm just removing a few things but not the part that would make it hard to revert .. just yet [16:58:52] that will be seperate "decom tin" task [16:59:08] (03CR) 10jerkins-bot: [V: 04-1] install/cumin/scap: update/remove tin-related comments [puppet] - 10https://gerrit.wikimedia.org/r/436831 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [17:00:22] let me know when you're ready for me to try a noop sync from deploy1001 [17:00:33] thcipriani: ready :) [17:00:40] oh! :) [17:00:46] 16:58:51 ERROR: pep8: commands failed on wmf_auto_reimage_lib.py' [17:01:30] ah, i'm just making the line too long.nvm [17:02:33] oh pep8 [17:02:38] (03PS2) 10Dzahn: install/cumin/scap: update/remove tin-related comments [puppet] - 10https://gerrit.wikimedia.org/r/436831 (https://phabricator.wikimedia.org/T175288) [17:04:06] alright, everything looks clean in /srv/mediawiki-staging so here goes [17:04:18] (03CR) 10Dzahn: [C: 032] install/cumin/scap: update/remove tin-related comments [puppet] - 10https://gerrit.wikimedia.org/r/436831 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [17:04:43] mutante: heh, could you remove the /var/lock/scap-global-lock from deploy1001? [17:04:55] oh, sure :) [17:05:08] deploy1001 unlocked [17:05:42] ok, going ahead with the noop sync-file [17:05:46] nice [17:06:42] !log thcipriani@deploy1001 Synchronized README: noop test of new deployment server (duration: 00m 53s) [17:06:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:14] mutante: all looks good [17:07:18] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4249218 (10Dzahn) [17:07:45] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4249233 (10Dzahn) [17:07:53] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4066939 (10Dzahn) [17:07:55] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4249232 (10Dzahn) [17:08:18] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4249218 (10Dzahn) a:03Dzahn [17:08:21] thcipriani: :)) [17:09:39] thanks for testing and going through this with me [17:10:19] sure thing, I watch other folks work with the best of them, kudos on the upgrade :) [17:10:52] mutante: let's wait with decomming tin for a few days, though [17:11:21] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4249251 (10Dzahn) [17:11:29] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10Dzahn) [17:11:32] 10Operations: decom tin - https://phabricator.wikimedia.org/T196175#4249218 (10Dzahn) 05Open>03stalled p:05Triage>03Normal [17:11:40] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4249258 (10Dzahn) [17:12:11] moritzm: yes, i will. i am preparing that and stalling it for a grace period [17:12:41] i leave tin in the network/constants and site [17:12:55] so revert would still not be too messy [17:13:01] ack [17:15:09] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4249279 (10Dzahn) It has switched from tin to deploy1001 (again, this time hopefully for good) today. All the details were in T175288. Just some more cleanup here for tin maybe. A decom task for tin will be T196175. [17:15:14] resolves "setup/install/deploy deploy1001 as deployment server" but keeps "replace tin (new hardware)" [17:16:04] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4249284 (10Dzahn) 05Open>03Resolved [17:17:15] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4066988 (10Dzahn) deploy1001 is now the active deployment server. from here it should just be about removing tin. we will wait a... [17:20:27] mutante: when all tests are complete, could you also send a mail to ops@ to make all deployers aware? [17:23:25] ok, will do [17:24:38] 10Operations, 10Patch-For-Review: Ship host syslogs to ELK - https://phabricator.wikimedia.org/T193766#4249310 (10herron) For some reason the new icinga check for this called "Logstash syslog TLS listener on port 16514" is erroring with: ``` $ /usr/lib/nagios/plugins/check_ssl -H logstash1007.eqiad.wmnet -p 1... [17:24:44] also, when does deployment.eqiad.wmnet get updated? [17:24:52] like now :) [17:25:08] k :) [17:26:02] (03PS1) 10Dzahn: switch deployment.eqiad from tin to deploy1001 [dns] - 10https://gerrit.wikimedia.org/r/436835 (https://phabricator.wikimedia.org/T175288) [17:26:22] (03PS2) 10Dzahn: switch deployment.eqiad from tin to deploy1001 [dns] - 10https://gerrit.wikimedia.org/r/436835 (https://phabricator.wikimedia.org/T175288) [17:27:03] deployment.codfw as well.. right [17:27:10] it points just over to eqiad [17:27:31] it doesnt mean that is deploy2001 [17:28:49] (03PS3) 10Dzahn: switch deployment.[eqiad|codfw] from tin to deploy1001 [dns] - 10https://gerrit.wikimedia.org/r/436835 (https://phabricator.wikimedia.org/T175288) [17:28:59] yeah, that's right [17:29:30] (03PS4) 10Dzahn: switch deployment.[eqiad|codfw] from tin to deploy1001 [dns] - 10https://gerrit.wikimedia.org/r/436835 (https://phabricator.wikimedia.org/T175288) [17:29:51] (03CR) 10Dzahn: [C: 032] switch deployment.[eqiad|codfw] from tin to deploy1001 [dns] - 10https://gerrit.wikimedia.org/r/436835 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [17:31:48] we have the big "this is not the active server" motd.. but we don't have one for the active server, also no role comment [17:33:20] [deploy2001:~] $ host deployment [17:33:20] deployment.eqiad.wmnet is an alias for deploy1001.eqiad.wmnet. [17:33:22] done [17:34:11] !log deployment.eqiad/codfw DNS names switched from tin to deploy1001 [17:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:38] (03PS1) 10Herron: icinga: remove Logstash syslog TLS listener check for troubleshooting [puppet] - 10https://gerrit.wikimedia.org/r/436837 (https://phabricator.wikimedia.org/T193766) [17:35:56] (03CR) 10Dzahn: [C: 032] Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) (owner: 10Muehlenhoff) [17:37:09] (03PS3) 10Dzahn: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) (owner: 10Muehlenhoff) [17:37:28] (03PS4) 10Dzahn: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) (owner: 10Muehlenhoff) [17:37:50] (03CR) 10Dzahn: [C: 032] "already removed the netboot.cfg line in https://gerrit.wikimedia.org/r/#/c/436831/ rebased - can be merged before the parent" [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) (owner: 10Muehlenhoff) [17:38:15] (03CR) 10Herron: [C: 032] icinga: remove Logstash syslog TLS listener check for troubleshooting [puppet] - 10https://gerrit.wikimedia.org/r/436837 (https://phabricator.wikimedia.org/T193766) (owner: 10Herron) [17:39:03] (03PS5) 10Dzahn: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) (owner: 10Muehlenhoff) [17:39:16] (03PS6) 10Dzahn: Remove obsolete partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/436767 (https://phabricator.wikimedia.org/T156944) (owner: 10Muehlenhoff) [17:56:18] are fingerprints at https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1001.eqiad.wmnet correct? [17:56:33] because I try to connect to deploy1001 and get something else [17:56:52] ECDSA key fingerprint is SHA256:fC3OkgwnAX3FbkyyVQCfdpG0W/41rwhZx2sppYsLbN0. [17:58:05] mutante ^^ [17:59:06] It's... asking me for a password [17:59:14] oh, wait [17:59:16] i'm already on tin [18:00:07] SMalyshev: I'm betting it's been reinstalled since then [18:00:08] deploy1001.eqiad.wmnet ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBIP0HKyRoT1CaCh3DInn6cwmpjtEX7kQuEJaadHMdWNBjubU8RmNRUCMtXGnZricSq+p/79ST2iDQT/9ihPfYD0= [18:00:12] Debian GNU/Linux 9 auto-installed on Wed Apr 11 16:32:01 UTC 2018. [18:00:23] Page is dated two weeks before [18:00:52] Reedy: yep, probably.. in this case somebody better to update the page :) [18:00:54] SMalyshev: i updated https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/deploy1001.eqiad.wmnet [18:01:21] mutante: thanks, works for me now [18:01:39] great, commenting on that list mail [18:02:17] bastions have all the hosts' ssh keys in their /etc/ssh/ssh_known_hosts [18:02:29] so what I do is I copy that regularly locally [18:03:13] to ~/.ssh/known_hosts.d/wmf-prod, and then I have UserKnownHostsFile ~/.ssh/known_hosts.d/wmf-prod in my .ssh/config for *.wmnet *.wikimedia.org [18:04:16] just mentioning my workflow, in case that's easier for y'all :) [18:04:35] oh and thanks SMalyshev for actually verifying the host key :) [18:04:43] and reporting it here [18:04:55] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4249469 (10RobH) a:03elukey We CANNOT move the GPU between hosts. It is in that chassis (stat1005), specifically ordered to h... [18:14:10] (03CR) 10Dzahn: [C: 032] planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 (owner: 10Dzahn) [18:17:35] (03PS4) 10Dzahn: Planet: Update global.erb rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/436790 (owner: 10Paladox) [18:19:15] (03CR) 10Dzahn: [C: 032] Planet: Update global.erb rawdog config [puppet] - 10https://gerrit.wikimedia.org/r/436790 (owner: 10Paladox) [18:33:32] (03PS3) 10Rush: openstack: labtest use labtestcontrol2003 for keystone [puppet] - 10https://gerrit.wikimedia.org/r/433734 (https://phabricator.wikimedia.org/T167559) [18:37:21] (03PS1) 10EBernhardson: Send elasticsearch slowlogs to logstash [puppet] - 10https://gerrit.wikimedia.org/r/436841 (https://phabricator.wikimedia.org/T196180) [18:39:06] PROBLEM - keystone public endoint port 5000 on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 5000: Connection refused [18:39:26] PROBLEM - keystone admin endpoint port 35357 on labtestcontrol2001 is CRITICAL: connect to address 208.80.153.47 and port 35357: Connection refused [18:42:31] (03CR) 10Rush: [C: 032] openstack: labtest use labtestcontrol2003 for keystone [puppet] - 10https://gerrit.wikimedia.org/r/433734 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [18:45:35] PROBLEM - puppet last run on labtestcontrol2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:46:26] PROBLEM - puppet last run on labcontrol1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:49:20] An error occurred while reading state from /etc/rawdog/zh/state. [18:49:21] This usually means the file is corrupt, and removing it will fix the problem. [18:49:27] /etc/rawdog/zh/state: cannot open `/etc/rawdog/zh/state' (No such file or directory) [18:49:37] (03PS1) 10Rush: openstack: specify keystone_host for main deployment [puppet] - 10https://gerrit.wikimedia.org/r/436842 [18:49:39] funny.. it's "corrupt" and "not existing" in one [18:50:24] (03CR) 10Rush: [C: 032] openstack: specify keystone_host for main deployment [puppet] - 10https://gerrit.wikimedia.org/r/436842 (owner: 10Rush) [18:50:25] PROBLEM - puppet last run on labcontrol1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:50:31] (03Abandoned) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [18:52:21] (03Restored) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [18:52:26] RECOVERY - keystone admin endpoint port 35357 on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 783 bytes in 0.082 second response time [18:52:31] (03PS9) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [18:52:35] PROBLEM - puppet last run on labvirt1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:53:08] (03CR) 10jerkins-bot: [V: 04-1] webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [18:53:16] RECOVERY - keystone public endoint port 5000 on labtestcontrol2001 is OK: HTTP OK: HTTP/1.1 300 Multiple Choices - 781 bytes in 0.076 second response time [18:54:15] PROBLEM - MariaDB Slave Lag: s8 on db2082 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 402.48 seconds [18:54:25] PROBLEM - MariaDB Slave Lag: s8 on db2079 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 409.85 seconds [18:54:26] PROBLEM - MariaDB Slave Lag: s8 on db2081 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 413.96 seconds [18:54:46] PROBLEM - MariaDB Slave Lag: s8 on db2085 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 422.93 seconds [18:54:46] PROBLEM - MariaDB Slave Lag: s8 on db2094 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 423.09 seconds [18:54:46] PROBLEM - MariaDB Slave Lag: s8 on db2080 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 423.37 seconds [18:54:46] PROBLEM - MariaDB Slave Lag: s8 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 425.19 seconds [18:54:55] PROBLEM - MariaDB Slave Lag: s8 on db2086 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 426.92 seconds [18:54:56] PROBLEM - MariaDB Slave Lag: s8 on db2045 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 429.20 seconds [18:55:26] RECOVERY - puppet last run on labcontrol1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:55:45] RECOVERY - puppet last run on labtestcontrol2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:56:35] RECOVERY - puppet last run on labcontrol1002 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:57:40] marlier: it's the red squares on https://gerrit.wikimedia.org/r/#/c/433710/9/modules/role/manifests/webperf/processors_and_site.pp [18:58:01] 10Operations, 10Analytics, 10hardware-requests: Site: eqiad | hardware request for a dedicated stat analytics host for the Research team - https://phabricator.wikimedia.org/T196080#4249584 (10Ottomata) That is not a bad idea. Although moving folks between stat boxes is not the easiest thing to do... :) [18:58:39] (03PS10) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [18:58:48] mutante: Yeah, just fixed that [19:00:03] (03PS11) 10Imarlier: webperf: separate permissions from specific apps [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [19:01:43] marlier: i think we should probably just do this in Hiera [19:02:01] we can assign any admin group to any role or profile [19:02:15] without needing to have a "master role" [19:02:27] if that's still what you wanted to achieve here [19:02:35] the shell access itself is fixed, right [19:05:10] mutante: yes, shell access is fixed. I ditched the "master" role. But still need to separate the webperf::processors_and_site role from the webperf::profiling_tools role. [19:05:17] They do different things. [19:05:31] gotcha [19:05:32] This change set has basically morphed into a few renames. [19:05:36] ok [19:10:11] 10Operations, 10Datasets-General-or-Unknown, 10Dumps-Generation: Rack and setup snapshot1008 - https://phabricator.wikimedia.org/T195385#4249606 (10RobH) [19:14:43] !log zh.planet - fixed issue with corrupt state file and permissions - updated and using new design as well now [19:14:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:55] RECOVERY - puppet last run on labvirt1020 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:20:32] (03PS1) 10EBernhardson: Tune CirrusSearch slow logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/436848 (https://phabricator.wikimedia.org/T196180) [19:21:01] !log enable query phase slow logging and increase thresholds for fetch phase slow logging for content/general indices on eqiad and codfw elasticsearch clusters [19:21:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:52] 10Operations, 10Datasets-General-or-Unknown: rack/setup/install snapshot1009 - https://phabricator.wikimedia.org/T196189#4249646 (10RobH) p:05Triage>03Normal [19:28:56] 10Operations, 10Performance-Team, 10monitoring, 10Patch-For-Review: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837#4249663 (10Imarlier) >>! In T158837#4221485, @Krinkle wrote: > @Imarlier I was just thinking about whether the etcd code is live or not for we... [19:31:22] (03PS12) 10Imarlier: webperf: Make the different webperf roles explicit [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) [19:35:08] (03CR) 10Thcipriani: [C: 031] "Thanks for the patch, looks like a sane change to me. We might be able to limit this further, but I'd have to dig a little bit to be sure." [puppet] - 10https://gerrit.wikimedia.org/r/428707 (owner: 10Imarlier) [19:37:41] (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/436841 (https://phabricator.wikimedia.org/T196180) (owner: 10EBernhardson) [19:38:23] (03CR) 10Krinkle: [C: 031] "If/when this lands, remember to also update the role assigned to webperf01 in beta via horizon.wikimedia.org (not currently in Git)." [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [19:43:45] (03CR) 10Dzahn: "can you also update the name here:" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [19:48:00] (03PS1) 10Rush: openstack: allow glance to call back for token validation [puppet] - 10https://gerrit.wikimedia.org/r/436853 (https://phabricator.wikimedia.org/T167559) [19:48:41] (03CR) 10Rush: [C: 032] openstack: allow glance to call back for token validation [puppet] - 10https://gerrit.wikimedia.org/r/436853 (https://phabricator.wikimedia.org/T167559) (owner: 10Rush) [19:49:37] (03CR) 10Dzahn: [C: 031] "compiled: http://puppet-compiler.wmflabs.org/11348/ looks all good. ready to merge it minus the nitpick for the cumin alias" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [19:50:29] (03CR) 10Dzahn: [C: 031] "(that's for debdeploy to upgrade packages or not based on their roles, btw)" [puppet] - 10https://gerrit.wikimedia.org/r/433710 (https://phabricator.wikimedia.org/T158837) (owner: 10Imarlier) [20:09:06] 10Operations, 10Continuous-Integration-Infrastructure, 10SRE-Access-Requests: Add Reedy to contint-docker group - https://phabricator.wikimedia.org/T196192#4249733 (10Legoktm) [20:23:29] (03PS1) 10Reedy: Add Reedy to contint-docker group [puppet] - 10https://gerrit.wikimedia.org/r/436860 (https://phabricator.wikimedia.org/T196192) [20:23:35] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2596 MB (5% inode=64%) [20:24:46] RECOVERY - MariaDB Slave Lag: s7 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 6.44 seconds [20:28:25] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1966 bytes in 0.078 second response time [20:38:35] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1952 bytes in 0.069 second response time [20:40:46] RECOVERY - Disk space on contint1001 is OK: DISK OK [20:48:02] (03PS1) 10Chico Venancio: Toolforge: add sqlite3 package to exec_environ [puppet] - 10https://gerrit.wikimedia.org/r/436903 [20:48:56] RECOVERY - Check systemd state on kubernetes2003 is OK: OK - running: The system is fully operational [20:50:16] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663#3698972 (10Dzahn) contint1001 was close to running out of disk... [20:52:15] PROBLEM - Check systemd state on kubernetes2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:54:31] 10Operations, 10Release Pipeline, 10Continuous-Integration-Infrastructure (shipyard), 10Release-Engineering-Team (Kanban): Switch CI Docker Storage Driver to its own partition and to use devicemapper - https://phabricator.wikimedia.org/T178663#4249798 (10hashar) From my previous comment T178663#3699074 , I... [21:11:06] PROBLEM - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1963 bytes in 0.079 second response time [21:23:46] (03PS3) 10Dzahn: decom and remove remnants of tin.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) [21:24:28] (03CR) 10Dzahn: [C: 04-2] "not yet - will wait a bit before this" [puppet] - 10https://gerrit.wikimedia.org/r/420917 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [21:26:25] RECOVERY - wikidata.org dispatch lag is higher than 300s on www.wikidata.org is OK: HTTP OK: HTTP/1.1 200 OK - 1962 bytes in 0.072 second response time [21:30:15] (03PS3) 10Dzahn: planet: rm planet-venus feed templates, rename feeds_rawdog to feeds [puppet] - 10https://gerrit.wikimedia.org/r/436580 (https://phabricator.wikimedia.org/T180498) [21:31:36] (03CR) 10Dzahn: [C: 032] planet: rm planet-venus feed templates, rename feeds_rawdog to feeds [puppet] - 10https://gerrit.wikimedia.org/r/436580 (https://phabricator.wikimedia.org/T180498) (owner: 10Dzahn) [21:32:27] (03PS3) 10Dzahn: planet: move plugin dir out of feeds dir [puppet] - 10https://gerrit.wikimedia.org/r/436583 [21:34:55] 10Operations, 10Wikimedia-Mailing-lists: Official support for upgrade from existing Mailman 2.1 lists to Mailman 3 - https://phabricator.wikimedia.org/T130554#4249897 (10Quiddity) [21:36:06] PROBLEM - puppet last run on planet1001 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 3 minutes ago with 18 failures. Failed resources (up to 3 shown): File[/etc/rawdog/ar],File[/etc/rawdog/bg],File[/etc/rawdog/cs],File[/etc/rawdog/de] [21:37:23] (03PS1) 10Dzahn: Revert "planet: rm planet-venus feed templates, rename feeds_rawdog to feeds" [puppet] - 10https://gerrit.wikimedia.org/r/436907 [21:37:46] (03CR) 10Dzahn: "too early, needs to wait until jessie is gone" [puppet] - 10https://gerrit.wikimedia.org/r/436907 (owner: 10Dzahn) [21:37:56] (03CR) 10Dzahn: [C: 032] Revert "planet: rm planet-venus feed templates, rename feeds_rawdog to feeds" [puppet] - 10https://gerrit.wikimedia.org/r/436907 (owner: 10Dzahn) [21:40:43] 10Puppet, 10Beta-Cluster-Infrastructure, 10Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#4249908 (10EddieGP) [21:40:48] 10Puppet, 10Analytics, 10Analytics-Kanban, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: deployment-eventlog05 puppet error about missing mysql heartbeat.heartbeat table - https://phabricator.wikimedia.org/T191109#4249907 (10EddieGP) 05Open>03Resolved [21:41:06] RECOVERY - puppet last run on planet1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [22:04:06] RECOVERY - MariaDB Slave Lag: s7 on db2061 is OK: OK slave_sql_lag Replication lag: 0.00 seconds [22:10:56] PROBLEM - Router interfaces on cr2-esams is CRITICAL: CRITICAL: host 91.198.174.244, interfaces up: 71, down: 1, dormant: 0, excluded: 0, unused: 0 [22:11:35] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 224, down: 1, dormant: 0, excluded: 0, unused: 0 [22:25:47] (03PS1) 10Chico Venancio: horizon: fix Horizion title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 [22:28:17] PROBLEM - novaobserver has only observer role on labcontrol1001 is CRITICAL: In bastion, user novaobserver should have roles [observer] but has [uobserver, uuser] [22:32:37] RECOVERY - novaobserver has only observer role on labcontrol1001 is OK: novaobserver has the correct roles in all projects. [22:42:42] (03CR) 10EddieGP: [C: 031] horizon: fix Horizion title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (owner: 10Chico Venancio) [22:43:20] (03CR) 10jerkins-bot: [V: 04-1] horizon: fix Horizion title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (owner: 10Chico Venancio) [22:47:02] (03PS1) 10Andrew Bogott: keystonehooks: only add users to bastion if they have the 'user' role [puppet] - 10https://gerrit.wikimedia.org/r/436955 [22:47:11] (03PS2) 10Krinkle: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) [22:47:31] (03CR) 10jerkins-bot: [V: 04-1] keystonehooks: only add users to bastion if they have the 'user' role [puppet] - 10https://gerrit.wikimedia.org/r/436955 (owner: 10Andrew Bogott) [22:47:33] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) (owner: 10Krinkle) [22:47:36] (03PS3) 10Krinkle: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) [22:47:46] (03PS4) 10Krinkle: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) [22:47:52] (03PS2) 10Andrew Bogott: keystonehooks: only add users to bastion if they have the 'user' role [puppet] - 10https://gerrit.wikimedia.org/r/436955 [22:48:02] (03PS5) 10Krinkle: deployment-prep: add webperf to scap::dsh::groups [puppet] - 10https://gerrit.wikimedia.org/r/436586 (https://phabricator.wikimedia.org/T195314) [22:48:21] (03CR) 10jerkins-bot: [V: 04-1] keystonehooks: only add users to bastion if they have the 'user' role [puppet] - 10https://gerrit.wikimedia.org/r/436955 (owner: 10Andrew Bogott) [22:48:41] (03PS3) 10Andrew Bogott: keystonehooks: only add users to bastion if they have the 'user' role [puppet] - 10https://gerrit.wikimedia.org/r/436955 (https://phabricator.wikimedia.org/T165337) [22:49:45] (03PS2) 10Chico Venancio: horizon: fix Horizion title branding [puppet] - 10https://gerrit.wikimedia.org/r/436951 (https://phabricator.wikimedia.org/T196199) [22:58:18] (03PS1) 10ArielGlenn: generate temp stubs for page ranges serially from same input stub file [dumps] - 10https://gerrit.wikimedia.org/r/436956 (https://phabricator.wikimedia.org/T196063) [23:24:45] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 [23:25:25] RECOVERY - Router interfaces on cr2-esams is OK: OK: host 91.198.174.244, interfaces up: 73, down: 0, dormant: 0, excluded: 0, unused: 0 [23:43:26] PROBLEM - puppet last run on lvs5003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [23:52:24] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Access to usergroups for Marshall Miller - https://phabricator.wikimedia.org/T194550#4250064 (10MMiller_WMF) @herron @Ottomata -- I would like to use hue.wikimedia.org to query Hadoop, but it looks like my Wikitech login doesn't get me i...